textbook

Statistics for Engineers
A Very Brief Introduction
M. Stob
November 7, 2007
Preface
These notes are for the statistics portion of the course Mathematics 232, Engineering
Mathematics, taught at Calvin College and required of all students in the engineering
program. The prerequisites for the course include two semesters of calculus and a course
in differential equations and linear algebra. Mathematics 232 includes three units: linear
algebra, statistics, and vector calculus.
It isn’t possible to do justice to the topic of statistics in the five weeks that are given
over to this topic in Mathematics 232. Nevertheless this course is intended to give at least
a broad overview into the central questions in statistical analysis. One important feature
of the approach of these notes is to integrate the use of an industry-standard statistical
computer program from the very beginning. Not only does this give the student some
familiarity with the tools used in the so-called “real world.” it also allows us to move more
quickly through the central notions of statistical analysis. I hope that all students in this
course find statistics useful and interesting enough to learn more statistics sometime down
the road.
These notes are not intended to be self-contained. First, they assume that the student
has access to a basic introduction to the R computer language. I refer explicitly in the
text to the very nice introduction SimpleR, written by John Verzani and available on
the web at http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Second, the notes
assume that the student comes to class and no apologies are made for leaving out extended
discussions from the notes of issues treated carefully in class. Finally, the problems must be
completed independently by the reader to insure that the concepts are clearly understood.
These notes have been written expressly for Mathematics 232. It seems unlikely that one
would find them ideal for some other purpose. Nevertheless, the notes are freely available
to anyone for their personal, non-commercial use (i.e., don’t sell these notes - they’re not
worth buying anyway).
These notes are part of a larger project to organize the teaching of statistics at Calvin
College. That means, among other things, that much of the material in these notes is
not original but has been shamelessly plagiarized from Foundations and Applications of
Statistics by Randall Pruim, the text for Mathematics 343-344. In turn, these notes will
also morph into part of the text for Mathematics 243 at Calvin.
This is the first edition of these notes. Thus errors, typographical and otherwise, abound.
I encourage readers to communicate them to me at [email protected]. I take full responsibility for the errors, even the ones that I plagiarized from Pruim.
Contents
1 Introduction
101
2 Data
201
2.1 Data - Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
2.2 Graphical and Numerical Summaries of Univariate Data . . . . . . . . . . . . . . . . . . 206
2.2.1 Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
2.2.2 Measures of the Center of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 209
2.2.3 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
2.3 The Relationship Between Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
2.4 Describing a Linear Relationship Between Two Quantitative Variables . . . . . 223
2.5 Describing a Non-linear Relationship Between Two Variables . . . . . . . . . . . . . 234
2.6 Data - Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
2.7 Data - Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
3 Probability
301
3.1 Modelling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
3.2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
3.2.2 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
3.2.3 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
3.3.1 pdfs and cdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
3.3.2 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
3.3.3 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
3.3.4 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
3.4 Mean and Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
v
Contents
3.5
3.6
3.4.1 The Mean of a Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . .
3.4.2 The Mean of a Continuous Random Variable . . . . . . . . . . . . . . . . . . . . .
3.4.3 Transformations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.4 The Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
323
325
327
328
329
335
4 Inference
401
4.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
4.2 Inferences about the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
4.3 The t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
4.4 Inferences for the Difference of Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
4.5 Regression Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
vi
1 Introduction
Kellogg’s makes Raisin Bran and packages it in boxes that are labeled “Net Weight: 20
ounces”. How might we test this claim? It seems obvious that we need to actually weigh
some boxes. However we certainly cannot require that every box that we weigh contains
exactly 20 ounces. Surely some variation in weight from box to box is to be expected and
should be allowed. So we are faced with several questions: How many boxes should we
weigh? How should we choose these boxes? How much deviation in weight from the 20
ounces should we allow? These are the kind of questions that the discipline of statistics is
designed to answer.
Definition 1.0.1 (Statistics). Statistics is the scientific discipline concerned with collecting, analyzing and making inferences from data.
While we cannot tell the whole Raisin Bran story here, the answers to our questions as
prescribed by NIST (National Institute of Standards and Technology) and developed from
statistical theory are something like this. Suppose that we are at a Meijer’s warehouse
that has just received a shipment of 250 boxes of Raisin Bran. We first select twelve boxes
out of the whole shipment at random. By at random we mean that no box should be any
more likely to occur in the group of twelve than any other. In other words, we shouldn’t
simply take the first twelve boxes that we find. Next we weigh the contents of the twelve
boxes. If any of the boxes is “too” underweight, we reject the whole shipment - that is we
disbelieve the claim of Kellogg’s (and they are in trouble). If that is not the case, then we
compute the average weight of the twelve boxes. If that average is not “too” far below 20
ounces, we do not disbelieve the claim.
Of course there are some details in the above paragraph. We’ll address the issue of how
to choose the boxes more carefully in Section 2.6. We’ll address the issue of summarizing
the data (in this case, using the average weight) in Section 2.2. The question of how far
below Kellogg’s should allowed to be will be dealt with in Section 4.2.
101
1 Introduction
Underlying our statistical techniques is the theory of probability which we take up in
Chapter 3. The theory of probability is meant to supply a mathematical model for situations in which there is uncertainty. In the context of Raisin Bran, we will use probability
to give a model for the variation that exists from box to box. We will also use probability
to give a model of the uncertainty introduced because we are only weighing a sample of
boxes.
If the whole course was only about Raisin Bran it wouldn’t be worth it (except perhaps
to Kellogg’s), even an abbreviated course like this one. But you are probably sophisticated
enough to be able to generalize this example. Indeed, the above story can be told in every
branch of science (biological, physical, and social). Each time we have a hypothesis about
a real-world phenomenon that is measurable but variable, we need to test that hypothesis
by collecting data. We need to know how to collect that data, how to analyze it, and how
to make inferences from it.
So without further ado, let’s talk about data.
102
2 Data
2.1 Data - Basic Notions
The OED defines data as “facts and statistics used for reference or analysis.” (And the
OED notes that while the word data is technically the plural of datum, it is often used
with a singular verb and that usage is now generally deemed to be acceptable.) For our
purposes, the sort of data that we will use comes to us in collections or datasets. A dataset
consists of a set of objects, variously called individuals, cases, items, instances, units,
or subjects, together with a record of the value of a certain variable or variables defined
on the items.
Definition 2.1.1 (variable). A variable is a function defined on the set of objects.
Ideally, each individual has a value for each variable. However there are often missing
values.
Example 2.1.1
Calvin College maintains a dataset of all currently active students. The individuals
in this dataset are the students. Many different variables are defined and recorded in
this dataset. For example, every student has a GPA, a GENDER, a CLASS, etc. Not
every student has an ACT score — there are missing values for this variable.
We will normally think of data as presented in a two-dimensional table. The rows of
the table correspond to the individuals. (Thus the individuals need to be ordered in some
way.) The columns of the table correspond to the variables. Each of the rows and the
columns normally has a name. In R, the canonical way to store such data is in an object
called a data.frame. A number of datasets are included with the basic installation of R.
The following example shows how an included dataset is accessed in R.
201
2 Data
> data(iris)
> dim(iris)
[1] 150
5
> iris[1:5,]
Sepal.Length
1
5.1
2
4.9
3
4.7
4
4.6
5
5.0
>
# the dataset called iris is loaded into a data.frame called iris
# list dimensions of iris data
# print first 5 rows (individuals), all columns
Sepal.Width Petal.Length Petal.Width Species
3.5
1.4
0.2 setosa
3.0
1.4
0.2 setosa
3.2
1.3
0.2 setosa
3.1
1.5
0.2 setosa
3.6
1.4
0.2 setosa
Notice that the data.frame has rows and columns. The individuals (rows) are, by
default, numbered (they can also be named) and the variables (columns) are named. The
numbers and names are not part of the dataset. Each column of a data.frame is a vector
and behaves like the mathematical object called a vector. In the iris dataset, there are 150
individuals (plants) and five variables. Notice that four of the variables (Sepal.Length,
Sepal.Width, Petal.Length, Petal.Width) are quantitative variables. That is, the
value of the variable is a number. The fifth variable is categorical. A categorical variable
usually has a finite number of possible values. The possible values of a categorical variable
are often called its levels. In this example the variable Species is categorical with three
levels. A categorical variable is often called a factor. Sometimes categorical variables
use numbers for the category names. For example we might code gender by using 0 for
males and 1 for females. We need to be careful not to treat these variables as quantitative
simply because numbers are used. The following example shows how to look at pieces of
the dataset.
> iris$Sepal.Length
[1] 5.1 4.9 4.7 4.6
[19] 5.7 5.1 5.4 5.1
[37] 5.5 4.9 4.4 5.1
[55] 6.5 5.7 6.3 4.9
[73] 6.3 6.1 6.4 6.6
[91] 5.5 6.1 5.8 5.0
[109] 6.7 7.2 6.5 6.4
[127] 6.2 6.1 6.4 7.2
[145] 6.7 6.7 6.3 6.5
> iris$Species
# a
202
# returns a vector
5.0 5.4 4.6 5.0 4.4
4.6 5.1 4.8 5.0 5.0
5.0 4.5 4.4 5.0 5.1
6.6 5.2 5.0 5.9 6.0
6.8 6.7 6.0 5.7 5.5
5.6 5.7 5.7 6.2 5.1
6.8 5.7 5.8 6.4 6.5
7.4 7.9 6.4 6.3 6.1
6.2 5.9
boring vector
of this
4.9 5.4
5.2 5.2
4.8 5.1
6.1 5.6
5.5 5.8
5.7 6.3
7.7 7.7
7.7 6.3
variable
4.8 4.8 4.3
4.7 4.8 5.4
4.6 5.3 5.0
6.7 5.6 5.8
6.0 5.4 6.0
5.8 7.1 6.3
6.0 6.9 5.6
6.4 6.0 6.9
5.8
5.2
7.0
6.2
6.7
6.5
7.7
6.7
5.7
5.5
6.4
5.6
6.3
7.6
6.3
6.9
5.4
4.9
6.9
5.9
5.6
4.9
6.7
5.8
5.1
5.0
5.5
6.1
5.5
7.3
7.2
6.8
2.1 Data - Basic Notions
[1] setosa
setosa
setosa
setosa
[7] setosa
setosa
setosa
setosa
[13] setosa
setosa
setosa
setosa
[19] setosa
setosa
setosa
setosa
[25] setosa
setosa
setosa
setosa
[31] setosa
setosa
setosa
setosa
[37] setosa
setosa
setosa
setosa
[43] setosa
setosa
setosa
setosa
[49] setosa
setosa
versicolor versicolor
[55] versicolor versicolor versicolor versicolor
[61] versicolor versicolor versicolor versicolor
[67] versicolor versicolor versicolor versicolor
[73] versicolor versicolor versicolor versicolor
[79] versicolor versicolor versicolor versicolor
[85] versicolor versicolor versicolor versicolor
[91] versicolor versicolor versicolor versicolor
[97] versicolor versicolor versicolor versicolor
[103] virginica virginica virginica virginica
[109] virginica virginica virginica virginica
[115] virginica virginica virginica virginica
[121] virginica virginica virginica virginica
[127] virginica virginica virginica virginica
[133] virginica virginica virginica virginica
[139] virginica virginica virginica virginica
[145] virginica virginica virginica virginica
Levels: setosa versicolor virginica
> iris$Petal.Width[c(1:5,146:150)]
# selecting
[1] 0.2 0.2 0.2 0.2 0.2 2.3 1.9 2.0 2.3 1.8
>
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
some individuals
Accessing datasets in R
We have already seen the first way of accessing a dataset in R. There are a large number of
datasets that are included with the standard distribution of R. Many of these are historically
important datasets or datasets that are often used in statistics courses. A complete list of
such datasets is available by data().
Many users of R have made other datasets available by creating a package. A package
is a collection of R datasets and/or functions that a user can load. Some of these packages
203
2 Data
come with the standard distribution of R. Others are available from CRAN. To load a package, use library(package.name) or require(package.name). For example, the faraway
package contains several datasets. One such dataset records various health statistics on
768 adult pima indians for a medical study of diabetes.
> library(faraway)
> data(pima)
> dim(pima)
[1] 768
9
> pima[1:5,]
pregnant glucose diastolic triceps insulin bmi diabetes age test
1
6
148
72
35
0 33.6
0.627 50
1
2
1
85
66
29
0 26.6
0.351 31
0
3
8
183
64
0
0 23.3
0.672 32
1
4
1
89
66
23
94 28.1
0.167 21
0
5
0
137
40
35
168 43.1
2.288 33
1
>
If the package is not included in the distribution of R installed on your machine, the
package can be installed from a remote site. This can be done easily in both Windows and
Mac implementations of R using menus.
Finally, datasets can be loaded from a file that is located on one’s local computer or on
the internet. Two things need to be known: the format of the data file and the location
of the data file. The most common format of a datafile is CSV (comma separated values).
In this format, each individual is a line in the file and the values of the variables are
separated by commas. The first line of such a file contains the variable names. There
are no individual names. The R function read.csv reads such a file. Other formats are
possible and the function read.table can be used with various options to read these. The
following example shows how a file is read from the internet. The file contains the offensive
statistics of all major league baseball teams for the complete 2007 season.
> bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’)
> bball[1:4,]
CLUB LEAGUE
BA
SLG
OBP
G
AB
R
H
TB X2B X3B HR RBI
1
New York
A 0.290 0.463 0.366 162 5717 968 1656 2649 326 32 201 929
2
Detroit
A 0.287 0.458 0.345 162 5757 887 1652 2635 352 50 177 857
3
Seattle
A 0.287 0.425 0.337 162 5684 794 1629 2416 284 22 153 754
204
2.1 Data - Basic Notions
4 Los Angeles
A 0.284 0.417 0.345 162 5554 822 1578 2317 324
SH SF HBP BB IBB
SO SB CS GDP LOB SHO
E DP TP
1 41 54 78 637 32 991 123 40 138 1249
8 88 174 0
2 31 45 56 474 45 1054 103 30 128 1148
3 99 148 0
3 33 40 62 389 32 861 81 30 154 1128
7 90 167 0
4 32 65 40 507 55 883 139 55 146 1100
8 101 154 0
>
23 123 776
Creating datasets in R
Probably the best way to create a new dataset for use in R is to use an external program to
create it. Excel, for example, can save a spreadsheet in CSV format. The editing features
of Excel make it very easy to create such a dataset. Small datasets can be entered into R
by hand. First, vectors can be created using the c() or scan() functions.
> x=c(1,2,3,4,5:10)
> x
[1] 1 2 3 4 5 6 7
> y=c(’a’, ’b’,’c’)
> y
[1] "a" "b" "c"
> z=scan()
1: 2 3 4
4: 11 12 19
7: 4
8:
Read 7 items
> z
[1] 2 3 4 11 12 19 4
>
8
9 10
The scan() function prompts the user with the number of the next item to enter. Items
are entered delimited by spaces or commas. We can use as many lines as we like and the
input is terminated by a blank line. There is also a data editor available in the graphical
user interfaces but it is quite primitive.
A data.frame can be made from vectors of the same length.
> x=c(’Tom’,’Dick’,’Harry’)
> y=c(23,28,27)
205
2 Data
> people=data.frame(names = x, ages = y)
> people
names ages
1
Tom
23
2 Dick
28
3 Harry
27
>
2.2 Graphical and Numerical Summaries of Univariate Data
Now that we can get our hands on some data, we would like to develop some tools to help
us understand the distribution of a variable in a data set. By distribution we mean two
things: what values does the variable take on, and with what frequency. Simply listing all
the values of a variable is not an effective way to describe a distribution unless the data
set is quite small. For larger data sets, we require some better methods of summarizing a
distribution.
2.2.1 Graphical Summaries
The type of summary that we generate will vary depending on the type of data that we
are summarizing. A table is useful for summarizing a categorical variable. The following
table is a useful description of the distribution of species of iris flowers in the iris dataset.
> table(iris$Species)
setosa versicolor
50
50
virginica
50
Tables can be generated for quantitative variables as well.
> table(iris$Sepal.Length)
4.3 4.4 4.5 4.6 4.7 4.8 4.9
5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
1
3
1
4
2
5
6 10
9
4
1
6
7
6
8
7
3
6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
7 7.1 7.2 7.3 7.4 7.6 7.7 7.9
4
9
7
5
2
8
3
4
1
1
3
1
1
1
4
1
206
6 6.1
6
6
2.2 Graphical and Numerical Summaries of Univariate Data
Percent of Total
30
0 2 4 6 8
Frequency
12
Histogram of bball$HR
100
120
140
160
180
200
220
240
bball$HR
20
10
0
100
150
200
bball$HR
Figure 2.1: Homeruns in major leagues: hist() and histogram()
The table function is more useful in conjunction with the cut() function. The second
argument to cut() gives a vector of endpoints of half-open intervals. Note that the default
behavior is to use intervals that are open to the left, closed to the right.
> table(cut(iris$Sepal.Length,c(4,5,6,7,8)))
(4,5] (5,6] (6,7] (7,8]
32
57
49
12
The kind of summary in the above table is graphically presented by means of a histogram. There are two R commands that can be used to build a histogram: hist() and
histogram(). hist() is part of the standard distribution of R. histogram() can only be
used after first loading the lattice graphics package, which now comes standard with all
distributions of R. The R functions are used as in the following excerpt which generates the
two histograms in Figure 2.1. Notice that two forms of the histogram() function are given.
The second form (the “formula” form) will be discussed in more detail in Section 2.3.
>
>
>
>
bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’)
hist(bball$HR)
histogram(bball$HR)
# lattice histogram of a vector
histogram(~HR,data=bball)
# formula form of histogram
Notice that the histograms produced differ in several ways. Besides aesthetic differences,
the two histogram algorithms typically choose different break points. Also, the vertical scale
207
2 Data
0
neg. skewed
5
10
15
pos. skewed
symmetric
Percent of Total
20
15
10
5
0
0
5
10
15
0
5
10
15
x
Figure 2.2: Skewed and symmetric distributions.
of histogram() is in percentages of total while the vertical scale of hist() contains actual
counts. As one might imagine, there are optional arguments to each of these functions that
can be used to change such decisions.
In these notes, we will always use histogram() and indeed we will assume that the
lattice package has been loaded. Graphics functions in the lattice package often have
several useful features. We will see some of these in later Sections.
A histogram gives a shape to a distribution and distributions are often described in
terms of these shapes. The exact shape depicted by a histogram will depend not only on
the data but on various other choices, such as how many bins are used, whether the bins
are equally spaced across the range of the variable, and just where the divisions between
bins are located. But reasonable choices of these arguments will usually lead to histograms
of similar shape, and we use these shapes to describe the underlying distribution as well
as the histogram that represents it.
Some distributions are approximately symmetrical with the distribution of the larger
values looking like a mirror image of the distribution of the lower values. We will call a
distribution positively skewed if the portion of the distribution with larger values (the
right of the histogram) is more spread out than the other side. Similarly, a distribution is
negatively skewed if the distribution deviates from symmetry in the opposite manner.
Later we will learn a way to measure the degree and direction of skewness with a number;
for now it is sufficient to describe distributions qualitatively as symmetric or skewed. See
Figure 2.2 for some examples of symmetric and skewed distributions.
Notice that each of these distributions is clustered around a center where most of the
values are located. We say that such distributions are unimodal. Shortly we will discuss
208
2.2 Graphical and Numerical Summaries of Univariate Data
Percent of Total
12
10
8
6
4
2
0
2
3
4
5
eruptions
Figure 2.3: Old Faithful eruption times (based on the faithful data set).
ways to summarize the location of the “center” of unimodal distributions numerically. But
first we point out that some distributions have other shapes that are not characterized by a
strong central tendency. One famous example is eruption times of the Old Faithful geyser
in Yellowstone National park. The command
> data(faithful);
> histogram(faithful$eruptions,n=20);
produces the histogram in Figure 2.3 which shows a good example of a bimodal distribution. There appear to be two groups or kinds of eruptions, some lasting about 2 minutes
and others lasting between 4 and 5 minutes.
2.2.2 Measures of the Center of a Distribution
Qualitative descriptions of the shape of a distribution are important and useful. But we
will often desire the precision of numerical summaries as well. Two aspects of unimodal
distributions that we will often want to measure are central tendency (what is a typical
value? where do the values cluster?), and the amount of variation (are the data tightly
clustered around a central value, or more spread out?)
Two widely used measures of center are the mean and the median. You are probably
already familiar with both. The mean is calculated by adding all the values of a variable
and dividing by the number of values. Our usual notation will be to denote the n values as
x1 , x2 , . . . xn , and the mean of these values as x. Then the formula for the mean becomes
Pn
xi
x = i=1 .
n
209
2 Data
The median is a value that splits the data in half – half of the values are smaller than
the median and half are larger. By this definition, there could be more than one median
(when there are an even number of values). This ambiguity is removed by taking the mean
of the “two middle numbers” (after sorting the data). See Exercises 2.4 – 2.6 for some
problems that explore aspects of the mean and median that may be less familiar.
The mean and median are easily computed in R. For example,
> mean(iris$Sepal.Length); median(iris$Sepal.Length);
[1] 5.843333
[1] 5.8
We can also compute the mean and median of the Old Faithful eruption times.
> mean(faithful$eruptions); median(faithful$eruptions);
[1] 3.487783
[1] 4
Notice, however, that in the Old Faithful eruption times histogram (Figure 2.3) there are
very few eruptions that last between 3.5 and 4 minutes. So although these numbers are
the mean and median, neither is a very good description of the typical eruption time(s)
of Old Faithful. It will often be the case that the mean and median are not very good
descriptions of a data set that is not unimodal. In the case of our Old Faithful data,
there seem to be two predominant peaks, but unlike in the case of the iris data, we do
not have another variable in our data that lets us partition the eruptions times into two
corresponding groups. This observation could, however, lead to some hypotheses about
Old Faithful eruption times. Perhaps eruption times are different at night than during the
day. Perhaps there are other differences in the eruptions. Subsequent data collection (and
statistical analysis of the resulting data) might help us determine whether our hypotheses
appear correct.
One disadvantage of a histogram is that the actual data values are lost. For a large
data set, this is probably unavoidable. But for more modestly sized data sets, a stem plot
can reveal the shape of a distribution without losing the actual (perhaps rounded) data
values. A stem plot divides each value into a stem and a leaf at some place value. The
leaf is rounded so that it requires only a single digit. The values are then recorded as in
Figure 2.4.
From this output we can readily see that the shortest recorded eruption time was 1.60
minutes. The second 0 in the first row represents 1.70 minutes. Note that the output of
stem() can be ambiguous when there are not enough data values in a row.
210
2.2 Graphical and Numerical Summaries of Univariate Data
> stem(faithful$eruptions);
The decimal point is 1 digit(s) to the left of the |
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
070355555588
000022233333335577777777888822335777888
00002223378800035778
0002335578023578
00228
23
080
7
2337
250077
0000823577
2333335582225577
0000003357788888002233555577778
03335555778800233333555577778
02222335557780000000023333357778888
0000233357700000023578
00000022335800333
0370
Figure 2.4: Stemplot of Old Faithful eruption times using stem().
211
2 Data
Comparing mean and median
Why bother with two different measures of central tendency? The short answer is that
they measure different things, and sometimes one measure is better than the other. If a
distribution is (approximately) symmetric, the mean and median will be (approximately)
the same. (See Exercise 2.4.)
If the distribution is not symmetric, however, the mean and median may be very different.
For example, if we begin with a symmetric distribution and add in one additional value
that is very much larger than the other values (an outlier), then the median will not
change very much (if at all), but the mean will increase substantially. We say that the
median is resistant to outliers while the mean is not. A similar thing happens with a
skewed, unimodal distribution. If a distribution is positively skewed, the large values in
the tail of the distribution increase the mean (as compared to a symmetric distribution)
but not the median, so the mean will be larger than the median. Similarly, the mean of a
negatively skewed distribution will be smaller than the median.
Whether a resistant measure is desirable or not depends on context. If we are looking
at the income of employees of a local business, the median may give us a much better
indication of what a typical worker earns, since there may be a few large salaries (the
business owner’s, for example) that inflate the mean. This is also why the government
reports median household income and median housing costs.
On the other hand, if we compare the median and mean of the value of raffle prizes, the
mean is probably more interesting. The median is probably 0, since typically the majority
of raffle tickets do not win anything. This is independent of the values of any of the prizes.
The mean will tell us something about the overall value of the prizes involved. In particular,
we might want to compare the mean prize value with the cost of the raffle ticket when we
decide whether or not to purchase one.
The trimmed mean compromise
There is another measure of central tendency that is less well known and represents a kind
of compromise between the mean and the median. In particular, it is more sensitive to the
the extreme values of a distribution than the median is, but less sensitive than the mean.
The idea of a trimmed mean is very simple. Before calculating the mean, we remove
the largest and smallest values from the data. The percentage of the data removed from
each end is called the trimming percentage. A 0% trimmed mean is just the mean; a 50%
212
2.2 Graphical and Numerical Summaries of Univariate Data
trimmed mean is the median; a 10% trimmed mean is the mean of the middle 80% of the
data (after removing the largest and smallest 10%). A trimmed mean is calculated in R by
setting the trim argument of mean(), e.g. mean(x,trim=.10). Although a trimmed mean
in some sense combines the advantages of both the mean and median, it is less common
than either the mean or the median. This is partly due the mathematical theory that has
been developed for working with the median and especially the mean of sample data.
2.2.3 Measures of Dispersion
It is often useful to characterize a distribution in terms of its center, but that is not the
whole story. Consider the distributions depicted in the histograms below.
−10
A
0
10
20
30
B
0.20
Density
0.15
0.10
0.05
0.00
−10
0
10
20
30
In each case the mean and median are approximately 10, but the distributions clearly have
very different shapes. The difference is that distribution B is much more “spread out”.
“Almost all” of the data in distribution A is quite close to 10; a much larger proportion
of distribution B is “far away” from 10. The intuitive (and not very precise) statement in
the preceding sentence can be quantified by means of quantiles. The idea of quantiles is
probably familiar to you since percentiles are a special case of quantiles.
Definition 2.2.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative distribution is
a number q such that the (approximate) proportion of the distribution that is less than q
is p.
213
2 Data
1
4
9
16
6
25
36
49
64
81 100
6
Figure 2.5: An illustration of a method for determining quantiles from data. Arrows indicate the locations of the .25-quantile and the .5-quantile.
So for example, the .2-quantile divides a distribution into 20% below and 80% above.
This is the same as the 20th percentile. The median is the .5-quantile (and the 50th
percentile).
The idea of a quantile is quite straightforward. In practice there are a few wrinkles to be
ironed out. Suppose your data set has 15 values. What is the .30-quantile? 30% of the data
would be (.30)(15) = 4.5 values. Of course, there is no number that has 4.5 values below
it and 11.5 values above it. This is the reason for the parenthetical word approximate in
Definition 2.2.1. Different schemes have been proposed for giving quantiles a precise value,
and R implements several such methods. They are similar in many ways to the decision we
had to make when computing the median of a variable with an even number of values.
Two important methods can be described by imaging that the sorted data have been
placed along a ruler, one value at every unit mark and also at each end. To find the pquantile, we simply snap the ruler so that proportion p is to the left and 1 − p to the right.
If the break point happens to fall precisely where a data value is located (i.e., at one of the
unit marks of our ruler), that value is the p-quantile. If the break point is between two data
values, then the p-quantile is a weighted mean of those two values. For example, suppose we
have 10 data values: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100. The 0-quantile is 1, the 1-quantile
is 100, the .5-quantile (median) is midway between 25 and 36, that is 30.5. Since our ruler
is 9 units long, the .25-quantile is located 9/4 = 2.25 units from the left edge. That would
be one quarter of the way from 9 to 16, which is 9 + .25(16 − 9) = 9 + 1.75 = 10.75. (See
Figure 2.5.) Other quantiles are found similarly. This is precisely the default method used
by quantile().
> quantile((1:10)^2)
0%
25%
50%
1.00 10.75 30.50
75%
100%
60.25 100.00
A second scheme is just like this one except that the data values are placed midway
between the unit marks. In particular, this means that the 0-quantile is not the smallest
214
2.2 Graphical and Numerical Summaries of Univariate Data
value. This could be useful, for example, if we imagined we were trying to estimate the
lowest value in a population from which we only had a sample. Probably the lowest value
overall is less than the lowest value in our particular sample. Other methods try to refine
this idea, usually based on some assumptions about what the population of interest is like.
Fortunately, for large data sets, the differences between the different quantile methods
are usually unimportant, so we will just let R compute quantiles for us using the quantile()
function. For example, here are the deciles and quartiles of the Old Faithful eruption
times.
> quantile(faithful$eruptions,(0:10)/10);
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000 5.1000
> quantile(faithful$eruptions,(0:4)/4);
0%
25%
50%
75%
100%
1.60000 2.16275 4.00000 4.45425 5.10000
The latter of these provides what is commonly called the five number summary. The
0-quantile and 1-quantile (at least in the default scheme) are the minimum and maximum
of the data set. The .5-quantile gives the median, and the .25- and .75-quantiles (also called
the first and third quartiles) isolate the middle 50% of the data. When these numbers are
close together, then most (well, half, to be more precise) of the values are near the median.
If those numbers are farther apart, then much (again, half) of the data is far from the
center. The difference between the first and third quartiles is called the inter-quartile
range and abbreviated IQR. This is our first numerical measure of dispersion.
The five-number summary is often present by means of a boxplot. The standard R
function is boxplot and the lattice function is bwplot() A boxplot of the Sepal.Width
of the iris data is in Figure 2.2.3 and was generated by
> bwplot(iris$Sepal.Width)
The sides of the box are drawn at the quartiles. The median is represented by a dot
in the box. In some boxplots, the whiskers extend out to the maximum and minimum
values. However the boxplot that we are using here attempts to identify outliers. Outliers
are values that are unusually large or small and are indicated by a special symbol beyond
the whiskers. The whiskers are then drawn from the box to the largest and smallest nonoutliers. One common rule for automating outlier detection for boxplots is the 1.5 IQR
rule. Under this rule, any value that is more than 1.5 IQR away from the box is marked
215
2 Data
●
●
2.0
2.5
3.0
●
3.5
●
●
4.0
4.5
iris$Sepal.Width
Figure 2.6: Boxplot of Sepal.Width of iris data.
as an outlier. Indicating outliers in this way is useful since it allows us to see if the whisker
is long only because of one extreme value.
Variance and Standard Deviation
Another important way to measure the dispersion of a distribution is by comparing each
value with the mean of the distribution. If the distribution is spread out, these differences
will tend to be large, otherwise these differences will be small. To get a single number, we
could simply add up all of the deviation from the mean:
n
X
total deviation from the mean =
(xi − x) .
i=1
The trouble with this is that the total deviation from the mean is always 0 (see Exercise 2.9).
The problem is that the negative deviations and the positive deviations always exactly
cancel out.
To fix this problem we might consider taking the absolute value of the deviations from
the mean:
n
X
total absolute deviation from the mean =
|xi − x| .
i=1
This number will only be 0 if all of the data values are equal to the mean. Even better
would be to divide by the number of data values. Otherwise large data sets will have large
216
2.2 Graphical and Numerical Summaries of Univariate Data
sums even if the values are all close to the mean.
n
1X
mean absolute deviation =
|xi − x| .
n
i=1
This is a reasonable measure of the dispersion in a distribution, but we will not use it very
often. There is another measure that is much more common, namely the variance, which
is defined by
n
1 X
variance = Var(x) =
(xi − x)2 .
n−1
i=1
You will notice two differences from the mean absolute deviation. First, instead of using
an absolute value to make things positive, we square the deviations from the mean. The
chief advatage of squaring over the absolute value is that it is much easier to do calculus
with a polynomial than with functions involving absolute values. Because the squaring
changes the units of this measure, the square root of the variance, called the standard
deviation, is commonly used in place of the variance.
mboxstandarddeviation = sd(x) =
p
Var(x)
The second difference is that we divide by n − 1 instead of by n. There is a very good
reason for this, even though dividing by n probably would have felt much more natural
to you at this point. We’ll get to that very good reason later in the course.For now, we’ll
settle for a less good reason. If you know the mean and all but one of the values of a
variable, then you can determine the remaining value, since the sum of all the values must
be the product of the number of values and the mean. So once the mean is known, there
are only n − 1 independent pieces of information remaining. That is not a particularly
satifying explanation, but it should help you remember to divide by the correct quantity.
All of these quantities are easy to compute in R.
> x=c(1,3,5,5,6,8,9,14,14,20);
>
> mean(x);
[1] 8.5
> x - mean(x);
[1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5
> sum(x - mean(x));
0.5
5.5
5.5 11.5
217
2 Data
[1] 0
> abs(x - mean(x));
[1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5
> sum(abs(x - mean(x)));
[1] 46
> (x - mean(x))^2;
[1] 56.25 30.25 12.25 12.25
6.25
> sum((x - mean(x))^2);
[1] 310.5
> n= length(x);
> 1/(n-1) * sum((x - mean(x))^2);
[1] 34.5
> var(x);
[1] 34.5
> sd(x);
[1] 5.87367
> sd(x)^2;
[1] 34.5
5.5
0.25
5.5 11.5
0.25
30.25
30.25 132.25
2.3 The Relationship Between Two Variables
Many scientific problems are about describing and explaining the relationship between two
or more variables. In the next three sections, we begin to look at graphical and numerical
ways to summarize such relationships. In this section, we consider the case where one or
both the variables are categorical.
We first consider the case when one of the variables is categorical and the other is
quantitative. This is the situation with the iris data if we are interested in the question
of how, say, Sepal.Length varies by Species. A very common way of beginning to answer
this question is to construct side-by-side boxplots.
> bwplot(Sepal.Length~Species,data=iris)
We see from these boxplots that the virginica variety of iris tends to have the longest sepal
length though the sepal lengths of this variety also have the greatest variation.
The notation used in the first argument of bwplot() is called formula notation and
is extremely important when considering the relationship between two variables. This
218
2.3 The Relationship Between Two Variables
Sepal.Length
8
7
●
6
5
●
●
setosa
●
versicolor
virginica
Figure 2.7: Box plot for iris sepal length as a function of Species.
formula notation is used throughout lattice graphics and in other R functions as well.
The generic form of a formula is
y ~ x | z
which can often be interpreted as “y modeled by x conditioned on z”. For plotting, y will
typically contain a variable presented on the vertical axis, and x a variable to be plotted
along the horizontal axis. In this case, we are modeling (or describing) sepal length by
species. In this example, there is no conditioning variable z.
An example of the use of a conditioning variable occurs in histogram(). The same
information in the boxplots above is contained in the side-by-side histograms of Figure 2.3.
> histogram(~Sepal.Length | Species,data=iris,layout=c(3,1))
In the case of a histogram, the values for the vertical axis are frequencies computed from
the x variable, so y is omitted (or can be thought of as a frequency variable that is always
included in a histogram implicitly). The condition z is a variable that is used to break the
data into different groups. In the case of histogram(), the different groups are plotted in
separate panels. When z is categorical there is one panel for each level of z. When z is
quantitative, the data is divided into a number of sections based on the values of z.
219
2 Data
4
setosa
5
6
7
8
versicolor
virginica
Percent of Total
50
40
30
20
10
0
4
5
6
7
8
4
5
6
7
8
Sepal.Length
Figure 2.8: Sepal lengths of three species of irises
The formula notation is used for more than just graphics. In the above example, we
would also like to compute summary statistics (such as the mean) for each of the species
seperately. There are two ways to do this in R. The first uses the aggregate() function.
A much easier way uses the summary() function from the Hmisc package. The summary()
function allows us to apply virtually any function that has vector input to each level of a
categorical variable seperately.
> require(Hmisc) # load Hmisc package
Loading required package: Hmisc
...............................
> summary(Sepal.Length~Species,data=iris,fun=mean);
Sepal.Length
N=150
+-------+----------+---+------------+
|
|
|N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa
| 50|5.006000
|
|
|versicolor| 50|5.936000
|
|
|virginica | 50|6.588000
|
+-------+----------+---+------------+
|Overall|
|150|5.843333
|
220
2.3 The Relationship Between Two Variables
+-------+----------+---+------------+
> summary(Sepal.Length~Species,data=iris,fun=median);
Sepal.Length
N=150
+-------+----------+---+------------+
|
|
|N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa
| 50|5.0
|
|
|versicolor| 50|5.9
|
|
|virginica | 50|6.5
|
+-------+----------+---+------------+
|Overall|
|150|5.8
|
+-------+----------+---+------------+
> summary(Sepal.Length~Species,iris);
Sepal.Length
N=150
+-------+----------+---+------------+
|
|
|N |Sepal.Length|
+-------+----------+---+------------+
|Species|setosa
| 50|5.006000
|
|
|versicolor| 50|5.936000
|
|
|virginica | 50|6.588000
|
+-------+----------+---+------------+
|Overall|
|150|5.843333
|
+-------+----------+---+------------+
Notice that the default function used in summary() computes the mean.
From now on we will assume that the lattice and Hmisc packages have been
loaded and will not show the loading of these packages in our examples. If
you try an example in this book and R reports that it cannot find a function,
it is likely that you have failed to load one of these packages. You can set
up R to automatically load these two packages every time you launch R if you
like.
In the above example, we investigated the relationship between a categorical and a
quantitative variable. We now consider an example where both variables are categorical.
A 1981 paper investigating racial biases in the application of the death penalty reported
221
2 Data
on 326 cases in which the defendant was convicted of murder. For each case they noted
the race of the defendant and whether or not the death penalty was imposed. We can use
R to cross tabulate this data for us:
> deathpenalty=read.table(’http://www.calvin.edu/~stob/data/deathPenalty.txt’,header=T)
> deathpenalty[1:5,]
Penalty Victim Defendant
1
Not White
White
2
Not Black
Black
3
Not White
White
4
Not Black
Black
5
Death White
Black
> xtabs(~Penalty+Defendant,data=deathpenalty)
Defendant
Penalty Black White
Death
17
19
Not
149
141
>
(Notice some R features. We have used read.table which is suitable to read files that
are not CSV but rather in which the data is separated by spaces. However read.table()
does not assume a header with variable names. Notice also that xtabs() uses the formula
format in a similar way to histogram(), namely with no output variable in the formula.
The output in xtabs() is counts.)
From the output, it does not look like there is much of a difference in the rates at which
black and white defendants receive the death penalty although a white defendant is slightly
more likely to receive the death penalty. However a different picture emerges if we take
into account the race of the victim.
> xtabs(~Penalty+Defendant+Victim,data=deathpenalty)
, , Victim = Black
Defendant
Penalty Black White
Death
6
0
Not
97
9
, , Victim = White
222
2.4 Describing a Linear Relationship Between Two Quantitative Variables
Defendant
Penalty Black White
Death
11
19
Not
52
132
It appears that black defendants are more likely to receive the death penalty when the
victim is black and also when the victim is white. This phenomenon is known as Simpson’s
Paradox.
2.4 Describing a Linear Relationship Between Two Quantitative
Variables
Many data analysis problems amount to describing the relationship between two quantitative variables.
Example 2.4.1
Thirteen bars of 90-10 Cu/Ni alloys were submerged for sixty days in sea water. The
bars varied in iron content. The weight loss due to corrosion for each bar was recorded.
The R dataset below gives the percentage content of iron (Fe) and the weight loss in
mg per square decimeter (loss).
> library(faraway)
> data(corrosion)
> corrosion[c(1:3,12:13),]
Fe loss
1 0.01 127.6
2 0.48 124.0
3 0.71 110.8
12 1.44 91.4
13 1.96 86.2
> xyplot(loss~Fe, data=corrosion)
It is evident from the plot (Figure 2.9) that the greater the percentage of iron, the
less corrosion. The plot suggests that the relationship might be linear. In the second
plot, a line is superimposed on the data. (How we choose the line is the subject of this
chapter.) Note that to plot the relationship between two quantitative variables, we
may use either plot from the base R package or xyplot from lattice. The function
xyplot() used the same formula notation as histogram().
223
2 Data
130
●
●
●
●
loss
120
●
●
110
●
100
●
●
●
0.5
●
●
1.0
●
●
110
●
100
●
●
90
0.0
●
●
120
loss
130
1.5
2.0
●
●
●
90
●
●
0.0
Fe
0.5
1.0
1.5
2.0
Fe
Figure 2.9: The corrosion data with a “good” line added on the right.
What is the role of the line that we superimposed on the plot of the data in this example?
Obviously, we do not mean to claim that the relationship between iron content and corrosion loss is completely captured by the line. But as a “model” of the relationship between
these variables, the line has at least three possible important uses. First, it provides a
succinct description of the relationship that is difficult to see in the unsummarized data.
The line plotted has equation
loss = 129.79 − 24.02Fe.
Both the intercept and slope of this line have simple interpretations. For example, the
slope suggests that every increase of 1% of iron content means a decrease in loss of content
of 24.02 mg per square decimeter. Second, the model might be used for prediction in a
situation where we have a yet untested object. We can easily use this line to make a
prediction for the material loss in an alloy of 2% iron content. Finally, it might figure in
a scientific explanation of the phenomenon of corrosion. All three uses of such a “model”
will be illustrated in the examples of this chapter.
Example 2.4.2
The current world records for men’s track appear in Table 2.4.2. The plot of record
distances (in meters) and times (in seconds) looks roughly linear. We know of course
(for physical reasons) that this relationship cannot be a linear one. Nevertheless, it
224
2.4 Describing a Linear Relationship Between Two Quantitative Variables
Distance
100
200
400
800
1000
1500
Mile
2000
3000
5000
10,000
Time
9.77
19.32
43.18
1:41.11
2:11.96
3:26.00
3:43.13
4:44.79
7:20.67
12:37.35
26:17.53
Record Holder
Asafa Powell (Jamaica)
Michael Johnson (US)
Michael Johnson (US)
Wilson Kipketer (Denmark)
Noah Ngeny (Kenya)
Hicham El Guerrouj (Morocco)
Hicham El Guerrouj (Morocco)
Hicham El Guerrouj (Morocco)
Daniel Komen (Kenya)
Kenenisa Bekele (Ethiopia)
Kenenisa Bekele (Ethiopia)
Table 2.1: Men’s World Records in Track (IAAF)
appears that a smooth curve might approximate the data very well and that this curve
might have a relatively simple formula.
●
Seconds
1500
1000
●
500
0
●
●●
●
0
●●
●●
●
2000
4000
6000
8000
10000
Meters
Example 2.4.3
The R dataset trees contains the measurements of the volume (in cu ft), girth
(diameter of tree in inches measured at 4 ft 6 in above the ground), and height (in ft)
of 31 black cherry trees in a certain forest. Since girth is easily measured, we might
want to use girth to predict volume of the tree. A plot shows the relationship.
225
2 Data
> data(trees)
> trees[c(1:2,30:31),]
Girth Height Volume
1
8.3
70
10.3
2
8.6
65
10.3
30 18.0
80
51.0
31 20.6
87
77.0
> xyplot(Volume~Girth,data=trees)
Volume
80
●
60
●●
40
●
●●
● ●●●●
●
● ●
●
20
●
● ●
●
●
●
●
●
●●
●
●●●
10
15
20
Girth
These three examples share the following features. In each, we are given n observations
(x1 , y1 ), . . . , (xn , yn ) of quantitative variables x and y. In each case we would like to find a
“model” that explains y in terms of x. Specifically, we would like to find a simple functional
relationship y = f (x) between these variables. Summarizing, our goal is the following
Goal:
Given (x1 , y1 ), . . . , (xn , yn ), find a “simple” function f such that yi is
approximately equal to f (xi ) for every i.
The goal is vague. We need to make precise the notion of “simple” and also the measure
of fit we will use in evaluating whether yi is close to f (xi ). In the rest of this section, we
make these two notions precise. The simplest functions we study are linear functions such
as the function that we used in Example 2.4.1. For the remainder of this chapter we will
investigate the problem of fitting linear functions to our data. Namely, we will by trying
to find b0 and b1 so that yi ≈ b0 + b1 xi for all i. (Statisticians use b0 , b1 or a, b for the slope
and intercept rather than the b, m that is typical in mathematics texts. We will use b0 , b1 .)
226
2.4 Describing a Linear Relationship Between Two Quantitative Variables
Of course, in only one of our motivating examples does it seem sensible to use a line to
approximate the data. So two important questions that we will need to address are: How
do we tell if a line is an appropriate description of the relationship? and What do we do if
a linear function is not the right relationship? We will address both questions later.
How shall we measure the goodness of fit of a proposed function f to the data? For
each xi the function f predicts a certain value ŷi = f (xi ) for yi . Then ri = yi − ŷi is the
“mistake” that f makes in the prediction of yi . Obviously we want to choose f so that
the values ri are small in absolute value. Introducing some terminology, we will call ŷi the
fitted or predicted value of the model and ri the residual. The following is a succinct
statement of the relationship
observation = predicted + residual
It will be impossible to choose a line so that all the values of ri are simultaneously small
(unless the data points are collinear). Various values of b0 , b1 might make some values
of ri small while making others large. So we need some measure that aggregates all the
residuals. Many choices are possible and R provides software to find the resulting line but
the canonical choice and the one we investigate here is the sum of squares of the residuals.
Namely, our goal is now refined to the following
Goal:
Given (x1 , y1 ), . . . , (xn , yn ), find b0 and b1 such that if f (x) = b0 + b1 x
n
X
and ri = yi − f (xi ) then
ri2 is minimized.
i=1
We call
n
X
ri2 the sum of squares of residuals and denote it by SSResid or SSE (for sum
i=1
squares error). Before we discuss the solution of this problem, we show how to solve it
in R using the data of Example 2.4.1. The R function lm finds the coefficients of the line
that minimizes the sums of squares of the residuals. Note that it uses the same syntax for
expressing the relationship between variables as does xyplot.
> lm(loss~Fe,data=corrosion)
Call:
227
2 Data
lm(formula = loss ~ Fe, data = corrosion)
Coefficients:
(Intercept)
Fe
129.79
-24.02
The problem of finding the line in question can be solve using multivariate calculus. We
need to find b0 and b1 to minimize a certain function of b0 and b1 . This is a straightforward
minimization problem that is solved by finding partial derivatives. However we will take a
different approach and find b0 and b1 by recasting the problem as a linear algebra problem.
Given the observations (x1 , y1 ), . . . , (xn , yn ), we construct vectors x, y ∈ Rn by
x = (x1 , . . . xn )
y = (y1 , . . . , yn )
Given a vector x and a function y = f (x), it is obvious how to interpret f (x). Namely
f (x) is the vector in Rn that results from applying f to each of the elements of the vector
x. (Most scalar functions in R behave in precisely this manner when given a vector as an
argument.) We define the vector ŷ by ŷ = f (x). Then the vector r = y − ŷ is precisely
the vector (r1 , . . . , rn ) of the residuals ri that we defined above. It seems natural to choose
the function f so that the length of the vector r is minimized. Minimizing the length
of r = y − ŷ is the same as minimizing the sums of the squares of the residuals (since
minimizing the length is the same as minimizing the square of the length). So finally we
restate our goal
Goal:
Given the vectors x and y, find b0 and b1 so that if ŷ = b0 + b1 x, the
length of the vector r = y − ŷ is minimized.
Our goal is to find b0 , b1 to minimize the length of the residual vector r. The resulting
line is called the least-squares line.
Given the vector x define a matrix X, called the model matrix by


1 x1
1 x2 


X = .

 ..

1 xn
228
2.4 Describing a Linear Relationship Between Two Quantitative Variables
Also define the vector b = (b0 , b1 ). We call b the coefficient vector. Then we have that
ŷ = Xb
r = y − Xb
In the best case we can find b such that y = Xb. In fact we know necessary and sufficient
conditions that such b can be found. In this particular case, we see that such a b can
be found if and only if y lies in the column space of X. The column space of X is a
two-dimensional subspace of Rn . Of course in general y will not be in this subspace.
The vector that we seek is the vector in the column space of X (i.e., a vector of form Xb)
that is closest to the vector y. The following picture illustrates the situation. The plane
in the illustration represents the column space of X. Namely, the vectors in this plane are
all vectors of form Xb as b ranges over all possible coefficient vectors. This plane lives in
Rn and the vector y is illustrated in the picture as a vector not in the column space of X.
observation: y
residual: y − ŷ
fit: ŷ = Ab
model space
Figure 2.10: The relationship among the data vector, the residual vector, and the fitted
vector.
From the picture, we can see exactly what we need. We need to choose b so that the
residual vector r = y − Xb is orthogonal to the column space of X. That is Xb will be
the projection of y onto the column space of X. The condition that r is orthogonal to the
column space of X can be written as
X T r = X T (y − Xb) = 0.
To solve for b we find that we must solve the equation
X T y = X T Xb.
(2.1)
229
2 Data
Equation 2.1 is usually called the normal equations.The vector on the left of 2.1 is a
vector in R2 . This equation will have a solution if the matrix X T X has rank two. This
will be true whenever X is rank 2. The matrix X will have rank 2 if its columns are
independent. This will be true whenever our data vector x is not the constant vector. It is
obvious that a constant vector x gives data inappropriate for our problem. For numerical
purposes, it is best to solve for b directly from the equation 2.1. However, by finding the
inverse of X T X, we can find an explicit formula for b. We have
b = (X T X)−1 X T y.
With this expression for b, we can also find the vector ŷ.
ŷ = Xb = Hy
where
H = X(X T X)−1 X T .
The matrix H in this equation is usually called the hat matrix.While there is no need
to know explicit expressions for b0 and b1 (these are always computed using software) it is
easy to show that
Pn
(xi − x̄)yi
b1 = Pi=1
b0 = ȳ − b1 x̄
(2.2)
n
2
i=1 (xi − x̄)
We illustrate the solution of the least squares problem in R with the following example.
Note that R provides tools to do linear algebra calculations but Octave might be a better
vehicle (though we have no reason to do the calculations explicitly).
Example 2.4.4
A random sample of eighty seniors at a certain undergraduate college in Michigan
was chosen and their ACT scores (the Composite score) and grade point averages
were recorded. The population was all students who had senior status as of February
15, 2003 and who had taken the ACT test. There appears to be a modest positive
relationship between ACT scores and GPA. The least-squares solution is found and
graphed below.
> sr=read.csv(’sr.csv’)
> dim(sr)
[1] 80 2
> sr[1:3,]
ACT
GPA
1 20 3.300
230
2.4 Describing a Linear Relationship Between Two Quantitative Variables
●
GPA
3.5
●
●
●
3.0
●
●
●
●
●
●
●
2.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
4.0
●
●
3.5
●
GPA
4.0
●
●
●
3.0
●
●
●
●
●
2.5
●
●
20
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
25
30
35
ACT
20
25
30
35
ACT
Figure 2.11: GPA predicted by ACT for 80 seniors.
2 22 3.409
3 27 3.224
> l.sr=lm(GPA~ACT,data=sr)
> l.sr
Call:
lm(formula = GPA ~ ACT, data = sr)
Coefficients:
(Intercept)
ACT
1.25622
0.07589
The following code computes the solution to the normal equation in R. It also
computes the hat matrix and the sum of the squares of the residuals. Note that
t(X) provides the transpose, %*% is matrix multiplication, and solve() solves linear
equations.
>
>
>
>
x=sr$ACT
y=sr$GPA
X=cbind(1,x)
X[c(1:2,79:80),]
x
[1,] 1 20
[2,] 1 22
[3,] 1 32
[4,] 1 31
# a matrix with two columns, 1,x
# the first and last two rows of X
231
2 Data
> solve(t(X)%*%X,t(X)%*%y) # solve solves systems of linear equations
[,1]
1.25621992
x 0.07588661
> Hat = X%*%solve(t(X)%*%X)%*%t(X)
#solve with one argument computes inverses
> yhat = Hat %*% y
> sum( (y-yhat)^2)
# sums of squares of residuals
[1] 9.973544
> anova(l.sr)
# R computes sums of squares of residuals
Analysis of Variance Table
Response: GPA
Df Sum Sq Mean Sq F value
Pr(>F)
ACT
1 6.3436 6.3436 49.611 6.509e-10 ***
Residuals 78 9.9735 0.1279
The analysis of variance table in the last example has an entry for SSResid. Another
picture helps us understand the other sum of squares in that table. The idea behind this
picture is the following. We are using the value of the variable x to help “explain” or
“predict” the value of y. We wish to know how much x helps us to do that. Consider
the vector y = (y, . . . , y), that is the constant vector of the average value of the yi . This
vector is in the column space of our model matrix X. Now the vector y − y, labeled the
recentered observation in Figure 2.4, measures the deviation of the observation vector from
this mean vector. This vector therefore represents the total variation in yi . Now consider
the right triangle in the figure determined by the vectors y − y, ŷ − y, and y − ŷ. This
is a right triangle since the residual vector is orthogonal to any vector in the model space,
including y. The vector ŷ − y is the variation in the yi that is explained by the vector ŷ;
i.e., by the model. By the Pythagorean Theorem, we have
||y − y||2 = ||ŷ − y||2 + ||y − ŷ||2
By the Pythagorean Theorem, we have
||y − y||2 = ||ŷ − y||2 + ||y − ŷ||2
Each of the lengths in this equation is a sum of squares of quantities defined from the data.
232
2.4 Describing a Linear Relationship Between Two Quantitative Variables
recentered observation : y − y
observation: y
residual: y − ŷ
fit: ŷ = Ab
recentered fit:
ŷ − y
overall mean: y
model space
Figure 2.12: Analysis of variance decomposition.
Namely
||y − y||2 =
||ŷ − y||2 =
||y − ŷ||2 =
n
X
i=1
n
X
i=1
n
X
i=1
(yi − y)2
(SST, sum squares total)
(yˆi − y)2
(SSR, sum squares regression)
(yi − yˆi )2
(SSResid, sum squares residual)
From these definitions, we get the following important relationship.
SST = SSR + SSResid
In the R output above, SSR is the entry in the column Sum Sq and the row labelled ACT.
This equation is usually summarized by saying something like this:
The total variation is the variation explained by x plus the error variation.
The fraction SSR / SST can then be interpreted as the percentage of variation in the yi
accounted for by the xi . This fraction is called R2 and is usually expressed as a percentage.
R computes this fraction and reports it in the summary of an lm object. For Example 2.4.4,
we would say that “39% of the variation in GPAs is explained by ACT scores.”
233
2 Data
> summary(l.sr)
Call:
lm(formula = GPA ~ ACT, data = sr)
Residuals:
Min
1Q
-0.90882 -0.19068
Median
0.03028
3Q
0.30582
Max
0.53473
Coefficients:
Estimate Std. Error t value
(Intercept) 1.25622
0.29443
4.267
ACT
0.07589
0.01077
7.044
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01
Pr(>|t|)
5.53e-05 ***
6.51e-10 ***
’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 0.3576 on 78 degrees of freedom
Multiple R-Squared: 0.3888,
Adjusted R-squared: 0.3809
F-statistic: 49.61 on 1 and 78 DF, p-value: 6.509e-10
2.5 Describing a Non-linear Relationship Between Two Variables
A linear function is not always the appropriate model to describe the relationship between
two variables. In this section we consider two different approaches to fitting a nonlinear
model. We will continue to assume that we are given data (x1 , y1 ), . . . , (xn , yn ).
Approach 1 - Linearize
Example 2.5.1
Suppose we wish to fit a function y = b0 eb1 x to the data. This equation transforms
to
ln y = ln b0 + b1 x
We then can use standard linear regression with the data (xi , ln yi ). This returns ln b0
and b1 . However this choice of b0 , b1 does not minimize the sums of the squares of the
234
2.5 Describing a Non-linear Relationship Between Two Variables
residuals ri = yi − b0 eb1 xi . Rather, it minimizes the sums of squares of ln yi − (ln b0 +
b1 xi ). In a given application, it might not be so clear that this is desirable.
Note that in the above example, though ln y is nonlinear in y, linear regression finds the
coefficients b0 and b1 . Generalizing this example, suppose that f is a possibly nonlinear
function of one variable x that depends on two unknown parameters b0 and b1 . The goal
is to transform the data (x, y) to (g(x), h(y)) so that the equation y = f (x) is equivalent
to h(y) = b00 + b01 g(x) where b00 , b01 are known functions of b0 and b1 .
Approach 2 - Nonlinear Least Squares
Example 2.5.2
Continuing Example 2.5.1, suppose we wish to fit y = b0 eb1 x to data by minimizing
the sums of the squares of the residuals ri = yi −b0 eb1 xi . This is a problem in minimizing
a nonlinear function of two variables. Usually this requires an iterative method to
approximate the solution.
The Approaches Compared in R
Example 2.5.3
In Example 2.4.3, the relationship between the volume V and girth G of a sample of
cherry trees is nonlinear. Both the plot of the data and our geometrical intuition tells
us this. Suppose that we assume the relationship has the form
V = b0 Gb1
This is not unreasonable as we might expect volume varies as approximately the square
of the girth. Linearizing gives
ln V = ln b0 + b1 ln G
Regression yields ln b0 = −2.353 (b0 = .095) and b1 = 2.20. On the other hand,
minimizing the sums of squares of residuals directly gives a = .087 and b = 2.24.
SSE = 313.75 when minimized and SSE = 317.85 when linearized. Note that the
235
2 Data
nonlinear least-squares algorithm implemented in R is an iterative procedure that needs
starting values for the unknows.
> data(trees)
> attach(trees)
> logG=log(Girth)
> logV=log(Volume)
> l.trees=lm(logV~logG)
> l.trees
Call:
lm(formula = logV ~ logG)
Coefficients:
(Intercept)
logG
-2.353
2.200
> fit1=predict(l.trees)
> sum( (Volume-exp(fit1))^2)
[1] 317.8461
> nl.trees=nls(Volume~b0*Girth^b1,start=list(b0=.2, b1=2.2))
> nl.trees
Nonlinear regression model
model: Volume ~ b0 * Girth^b1
data: parent.frame()
b0
b1
0.08661 2.23639
residual sum-of-squares: 313.8
Number of iterations to convergence: 4
Achieved convergence tolerance: 4.831e-07
2.6 Data - Samples
In the next two sections, we consider the question of data collection. If we are to make
decisions based on data, we need to be careful in their collection. In this section we consider
one common way of generating data, that of sampling from a population. Returning to
the Raisin Bran example, it is simply not feasible to weigh every box of Raisin Bran in
236
2.6 Data - Samples
the warehouse to determine whether Kellogg’s is telling the truth in its claim that the net
weight of the boxes is 20 ounces. Instead, NIST tells us to select a sample consisting of a
relatively small number of boxes and weigh those. The hope is that this smaller sample is
representative of the larger collection.
Definition 2.6.1 (Population). A population is a well-defined collection of individuals.
As with any mathematical set, sometimes we define a population by a census or enumeration of the elements of the population. The registrar can easily produce an enumeration
of the population of all currently registered Calvin students. Other times, we define a population by properties that determine membership in the population. (In mathematics, we
define sets like this all the time since many sets in mathematics are infinite and so do not
admit enumeration.) For example, the set of all Michigan registered voters is a population
even though a census of the population would be very difficult to produce. It is perfectly
clear for an individual whether that individual is a Michigan registered voter or not.
Definition 2.6.2 (sample). A subset S of population P is called a sample from P .
Quite typically, we are studying a population P but have only a sample S and have the
values of one or several variables for each element of S. The canonical goal of (inferential)
statistics is:
Goal:
Given a sample S from population P and values of a variable X on
elements of S, make inferences about the values of X on the elements of
P.
Most commonly, we will be making inferences about parameters of the population.
Definition 2.6.3 (parameter). A parameter is a numerical characteristic of the population.
For example, we might want to know the mean value of a certain variable defined on
the population. One strategy for estimating the mean of such a variable is to take a
237
2 Data
random sample and compute the mean of the sample elements. Such an estimate is called
a statistic.
Definition 2.6.4 (statistic). A statistic is a numerical characteristic of a sample.
Obviously, our success at solving this problem will depend to a large extent on how
representative S is of the whole population P with respect to the properties measured by
X. In turn, the representativeness of the sample will depend how the sample is chosen. A
convenience sample is a sample chosen simply by locating units that conveniently present
themselves. A convenience sample of Calvin students could be produced by grabbing the
first 100 students that come through the doors of Johnny’s. It’s pretty obvious that in this
case, and for convenience samples in general, there is no guarantee that the sample is likely
to be representative of the whole population. In fact we can predict some ways in which a
“Johnny’s sample” would not be representative of the whole student population.
One might suppose that we could construct a representative sample by carefully choosing
the sample according to the important characteristics of the units. For example, to choose a
sample of 100 Calvin students, we might ensure that the sample contains 54 females and 46
males. Continuing, we would then ensure a representative proportion of first-year students,
dorm-livers, etc. There are several problems with this strategy. There are usually so many
characteristics that we might consider that we would have to take too large a sample so
as to get enough subjects to represent all the possible combinations of characteristics in
the proportions that we desire. It might be expensive to find the individuals with the
desired characteristics. We have no assurance that the subjects we choose with the desired
combination of characteristics are representative of the group of all the individuals with
those characteristics. Finally, even if we list many characteristics, it might be the case that
the sample will be unrepresentative according to some other characteristic that we didn’t
think of and that characteristic might turn out to be important for the problem at hand.
Statisticians have settled on using sampling procedures that employ chance mechanisms.
The simplest such procedure (and also by far the most important) is known as simple
random sampling.
Definition 2.6.5 (simple random sample). A simple random sample (SRS) of size k from
a population is a sample that results from a procedure for which every subset of size k has
the same chance to be the sample chosen.
238
2.6 Data - Samples
For example, to pick a random sample of size 100 of Calvin students, we might write
the names of all Calvin students on index cards and choose 100 of these cards from a
well-mixed bag of all the cards. In practice, random samples are often picked by computers that produce “random numbers.” (A computer can’t really produce random numbers since a computer can only execute a deterministic algorithm. However computers
can produce numbers that behave as if they are random.) In this case, we would number all students from 1 to 4,224 and then choose 100 numbers from 1 to 4224 in such
a way that any set of 100 numbers has the same chance of occurring. The R command
sample(1:4224,100,replace=F) will choose such a set of 100 numbers.
Now it is certainly possible that a random sample is unrepresentative in some significant
way. Since all possible samples are equally likely to be chosen, by definition it is possible
that we choose a bad sample. For example, a random sample of Calvin students might
fail to have any seniors in it. However the fact that a sample is chosen by simple random
sampling enables us to make quantitative statements about the likelihood of certain kinds of
nonrepresentativeness. This in turn will enable us to make inferences about the population
and to make statements about how likely it is that our inferences are accurate.
The concept of random sampling can be extended to produce samples other than simple
random samples. For example, we might want to take into account at least some of the
characteristics of the members of the population without falling prey to the basic problems
with this approach that we described above. For example, we might want to ensure that our
sample of Calvin students is at least representative as far as class level goes. In our sample
of 100 students, we would then want to choose a sample according to the breakdowns in
Table 2.6.
Having defined the sizes of our subsamples however, we would then proceed to choose
simple random samples from each subpopulation.
Definition 2.6.6 (stratified random sample). A stratified random sample of size k from
a population is a sample that results from a procedure for that chooses simple random
samples from each of a finite number of groups (strata) that partition the population.
In the above example, we chose the random sample so that the number of individuals
in the sample from each strata were proportional to the size of the strata. While this
procedure has much to recommend it, it is not necessary and sometimes not even desirable.
239
2 Data
Class Level
First-year
Sophomore
Junior
Senior
Other
Total
Population
1,129
1,008
897
1,041
149
4,224
Sample
27
24
21
24
4
100
Table 2.2: Population of Calvin Students and Proportionate Sample Sizes
For example, only 4 “other” students appear in our sample of size 100 from the whole
population. This is fine if we are only interested in making inferences about the whole
population, but often we would like to say something about the subgroups as well. For
example, we might want to know how much Calvin students work in off-campus jobs but we
might expect and would like to discover differences among the class levels in this variable.
For this purpose, we might choose a sample of 20 students from each of the five strata.
(Of course we would have to be careful about how to combine our numbers when making
inferences about the whole population.) We would say about this sample that we have
“oversampled” one of the groups. In public opinion polls, it is often the case that small
minority groups are oversampled. The sample that results will still be called a random
sample.
Definition 2.6.7 (random sample). A random sample of size k from a population is
a sample chosen by a procedure such that each element of the population has a fixed
probability of being chosen as part of the sample.
While we need to give a definition of probability in order to make this definition precise,
it is clear from the above examples what we mean. This definition differs from that of
a simple random sample in two ways. First, it does not requires that each object has
the same likelihood of being the sample chosen. Second, it does not require that equal
likelihood extends to groups. A sampling method that we might employ given a list of
Calvin students is to choose one of the first 422 students in the list and then choose every
422nd student thereafter. Obviously some subsets can never occur as the sample since two
240
2.6 Data - Samples
students whose names are next to each other in the list can never be in the same sample.
Such a sample might indeed be representative however.
It is very important to note that we cannot guarantee by using random sampling of
whatever form that our sample is representative of the population along the dimension
we are studying. In fact with random sampling, it is guaranteed that it is possible that
we could select a really bad (unrepresentative) sample. What we hope to be able to do
(and we will later see how to do it) is to be able to quantify our uncertainty about the
representativeness of the sample. The next example gives us an idea of how this might
work.
Example 2.6.1
The dataset http://www.calvin.edu/~stob/data/miaa05.csv contains the statistics on every basketball player who played for an MIAA Men’s basketball team in 2005.
This collection of players will be our population. Of course there is no reason to take
a sample to answer a question about this population, but let’s see what would happen
if we did. Suppose that we are interested in the points per game (PTSG) of these
players. In the code below, we first take a sample of size 5.
> miaa=read.csv(’http://www.calvin.edu/~stob/data/miaa05.csv’)
> miaa[1:5,]
Number
Player GP GS Min AvgMin FG FGA FGPct FG3 FG3A
1
14 Brian Schaefer..... 25 19 769
30.8 146 366 0.399 67 185
2
32 Billy Collins Jr... 25 19 641
25.6 119 285 0.418 41 131
3
5 Mike Lewis......... 25 18 553
22.1 99 162 0.611
0
2
4
30 Adam Novak......... 20 13 453
22.6 95 163 0.583
3
3
5
24 Jeff Nokovich...... 25 17 702
28.1 38 109 0.349
7
31
FG3Pct FT FTA FTPct Off Def Tot RBG PF FO
A TO Blk Stl Pts PTSG
1 0.362 66 94 0.702 24 42 66 2.6 37 1 96 69
1 40 425 17.0
2 0.313 37 60 0.617 18 41 59 2.4 51 0 37 35
1 19 316 12.6
3 0.000 47 63 0.746 58 81 139 5.6 65 1 29 40
6 26 245 9.8
4 1.000 45 64 0.703 52 79 131 6.6 42 2 47 25
5 33 238 11.9
5 0.226 36 60 0.600 20 60 80 3.2 63 2 104 49
3 52 119 4.8
> ptsg=miaa$PTSG
> s=sample(ptsg,5,replace=F)
> mean(s)
[1] 6.7
The sample of size 5 that we chose has a mean of 6.7. It would be plausible to use this
sample to estimate the mean of the entire population. But if we had chosen a different
241
2 Data
30
25
Percent of Total
Percent of Total
25
20
15
10
20
15
10
5
5
0
0
0
5
10
15
20
25
0
ptsg
5
10
15
r
Figure 2.13: PTSG of MIAA players and average PTSG of 1000 samples of size 5.
sample, we would have computed a different sample mean. In the code below, we show
what might happen if we choose 1,000 different samples of size 5.
>
>
>
>
r=replicate(1000,mean(sample(ptsg,5,replace=F)))
h1=histogram(ptsg)
h2=histogram(r)
summary(r)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.300
4.000
5.430
5.591
6.925 14.620
> mean(ptsg)
[1] 5.593284
>
The population and the 1,000 simulated samples are in Figure 2.13. In this case, since
we know the true mean of the population (5.59), we can see what would happen if we
used a sample of size 5 to estimate this number. It is quite possible that our estimate
would be relatively close to 5.59. However it is also possible that we would get a very
unrepresentative sample - in some of the simulated samples, the mean is more than 10!
In fact, in this case we could (in principle) list all possible samples of size 5 (there are
= 333, 859, 526 of them) and look at the distribution of the means in the
only 134
5
population of all samples.
The above example illustrates a basic paradigm of statistical analysis. In it we have a
variable defined on a population. The distribution of that variable is unknown (our example was artificial in that it was known). In random sampling from the population, we
242
2.7 Data - Experiments
compute a statistic related to the variable. That statistic itself has a distribution, known
as the sampling distribution of the statistic. We simulated the sampling distribution in the
example above but conceptually at least we can envision the entire sampling distribution.
If the distribution of the population variable is not known, we will not know the sampling
distribution. However, if we make some assumptions about the unknown distribution of
the population variable, we can draw some conclusions about the shape of the sampling
distribution. Section 3.5 addresses this issue in the case that our statistic is the sample
mean. Even in the above example, it is possible to make some qualitative solutions about
the sampling distribution of the sample mean. It appears, for example, that the distribution of the sample mean is more symmetric than the distribution of the variable in the
population. Also it appears that the variation in the sample mean is less than the variation
in the variable itself. We’ll make these tentative conclusions more precise later.
2.7 Data - Experiments
The American Music Conference is an organization that promotes music education at all
levels. On their website http://www.amc-music.com/research_briefs.htm they promote music education as havng all sorts of benefits. For example, they quote a study
performed at the University of Sarasota in which “middle school and high school students
who participated in instrumental music scored significantly higher than their non-band
peers in standardized tests”. Does this mean that if we increase the availability of and
participation in instrumental programs in the schools, that standardized test scores would
generally increase? The problem with that conclusion is that there might be other factors
that cause the higher test scores of the clarinetists. For example, it might be the case
that students who play in bands are more likely to come from schools with more financial
resources. They are also more likely to be in families that are actively involved in their
education. It might be that music participation and higher test scores are a result of these
variables.
Scientists have long known that to establish a causal relationship between two variables,
it is necessary to construct an experiment in which the conditions such as other variables
are controlled. The data above come from an observational study rather than an experiment. Even worse, the observational study was retrospective rather than prospective.
(In a prospective study we could at least observe the conditions of the subjects along the
way and record the possibly relevant variables, even if we didn’t control them.)
243
2 Data
The “gold standard” for establishing a cause and effect relationship between two variables is the randomized comparative experiment. In an experiment, we want to study
the relationship between two or more variables. At least one variable is an explanatory
variable and the value of the variable can be controlled or manipulated. At least one
variable is a response variable. The experimenter has access to a certain set of experimental units (subjects, individuals, cases), sets various values of the explanatory
variable to create a treatment, and records the values of the response variables. The
experiment is a comparative experiment since the goal is to compare the responses given
different values of the explanatory variables, i.e., different treatments. In a randomized
experiment we assign the individuals to the various treatments at random. For example,
if we took 100 fifth graders and randomly chose 50 of them to be in the band and 50 of
them not to receive any music instruction, we could begin to believe that differences in
their test scores could be explained by the different treatments. Randomization here plays
the same role that it did in the previous section - we are attempting to arrange that the
group assigned any particular treatment is representative of the whole group of subjects.
Consider the chickwts data. In this experiment, the experimenter was attempting to
determine which chicken feed caused the greatest weight gain. Feed was the explanatory
variable and there were six treatments (six different feeds). Weight was the response
variable. The first step in designing this experiment was to assign baby chicks at random
to the six different feed groups.
Often clinical trials of pharmaceuticals or medical procedures are randomized comparative experiments. In testing a new drug, there are often two groups of subjects – those who
receive the drug and those who do not. A group receiving “no” treatment is often called
a control group. A control is simply a level of the explanatory variable that represents
the status quo or no treatment at all. In pharmaceutical trials, the control group is often
given a placebo. A placebo is a treatment that looks like the others but has no effect. In
a pharmaceutical trial, for example, a placebo might be a pill that does nothing. One often
finds drug documentation that refers to a “placebo-controlled, randomized, comparative
experiment.”
Randomization ensures that there is no bias in the assignment of subjects to the experimental treatments. In a medical study, this ensures that characteristics of the patients
(e.g., age, severity of the disease, height, weight, eye color) are not the explanation of
any relationship found between the explanatory and response variable. In the chickwts
example, the differences in weight in the six groups of chickens is not due to the chickens
244
2.7 Data - Experiments
(if indeed they were randomly assigned groups). Randomization in a pharmaceutical trial
ensures that any difference between the placebo group and the drug group is not due to,
say, age. But this is not enough to claim that the difference in treatments “causes” the
difference in the “response.”
In the example of music participation (explanatory variable) and test scores (response
variable), we noted that there was a third variable (poverty) that was a better explanation
of the differences in the test scores than music participation (or at least so we conjectured).
In this example, poverty is a lurking variable. A lurking variable is any variable that has
a significant effect on the response but that hasn’t been included in the study variables.
Lurking variables are a key reason that observational studies (particularly retrospective
ones) fail to determine causality. It’s pretty easy to construct examples of lurking variables.
The more churches a city has, the more bars but it is unlikely that increased church
attendance causes increased drunkenness. The lurking variable here is size of city. But
lurking variables often exist in experimental designs as well. For example, in the chickwts
data, the chickens were probably located in six different areas. Perhaps the physical setup
of these six different areas had some important effect on the eating patterns of the chicks.
Lurking variables such as weather or soil conditions are a particular concern in agricultural
experiments.
Ideally, if we know that a variable has an effect on the response variable we should control for it. A blocking variable is a variable other than the explanatory and response
variables that is controlled in the experiment, usually because it is thought that it might
have an effect on the response variable. The term comes originally from agricultural experiments where a plot of land was a block. Suppose that we are trying to determine the
effect of fertilizer (explanatory variable) on yield (response variable). Suppose that we have
three unimaginatively named fertilizers A, B, C. We could divide the plot of land that we
are using as in the first diagram of Figure 2.14. But it might be the case that the further
north in the plot, the better the soil conditions. Northernness would then be a lurking
variable. Instead, we could divide the patch using the second diagram in figure 2.14. Of
course there still might be variations in the soil conditions across the three fertilizers. But
we would at least be able to measure the effect of northernness.
In clinical trials, age and gender are often used as blocking variables. We hope to uncover
a relationship between the drug dosage and the response of the patient, but it might be
the case that this relationship is different for males and females. Of course randomization
will likely ensure that the gender breakdown of the two treatment groups is roughly the
245
2 Data
A
C
A
B
B
B
C
A
C
A
B
C
Figure 2.14: Two experimental designs for three fertilizers.
same, but if we think that gender is an important factor, we can ensure that the gender
breakdown is exactly the same. This will help us decide if how much of the variation
between the treatment groups is due to gender and not the drug. (Note that in this
example, gender is a different sort of variable than is drug. We cannot control it and assign
subjects randomly to one of the two genders!)
We have only touched on the major issues in the subject of experimental design. There
are many considerations beyond what we have described here. But the fundamental principles are the same:
1. Randomize. Randomly assigning individuals to treatment ensures that certain
uncontrollable sources of variation are spread equally over the treatments. Further,
randomization allows us to use statistical techniques to draw conclusions about the
variation in the response variable.
2. Block. Variables that could affect the response or the relationship between the
treatment and the response should be controlled if possible. Constructing blocks
for the different levels of such a variable allows us to separate out the effects of the
treatment variables and the blocking variables.
3. Replicate. Just like a larger random sample is better than a smaller one, assigning
many subjects to each treatment allows us to separate out the normal variation in
individuals from the variation caused by the treatment variables.
246
2.8 Exercises
2.8 Exercises
2.1 Read Sections 1 and 2 of SimpleR. (Get this from http://cran.r-project.org/doc/
contrib/Verzani-SimpleR.pdf.) Do problems 1, 2, 5, and 6 of Section 2 of SimpleR.
2.2 Load the builtin R dataset chickwts. (Use data(chickwts).)
a) How many individuals are in this dataset?
b) How many variables are in this dataset?
c) Classify the variables as quantitative or categorical.
2.3 The dataset singer comes with the lattice package. Make sure that you have loaded
the lattice package and then load that dataset. The dataset contains the heights of 235
singers in the New York Choral Society. Make some comments about the nature of the
distribution of heights. Use a histogram to inform those comments.
2.4 The distribution of a quantitative variables is symmetric about m if whenever there
are k data values m + d there are also k values of m − d.
a) Show that if a distribution is symmetric about m then m is the median. (You may
need to handle separately the cases where the number of values is odd and even.)
b) Show that if a distribution is symmetric about m then m is the mean.
c) Create a small distribution that is not symmetric about m, but the mean and median
are both equal to m.
2.5 Describe some situations where the mean or median is clearly a better measure of
central tendency than the other.
2.6 We could compute the mean absolute deviation from the median instead of from the
mean. Show that the mean absolute deviation from the median is always smaller than the
mean absolute deviation from the mean.
P
2.7 Let SS(c) = (xi − c)2 . (SS stands for sum of squares.) Show that the smallest value
of SS(c) occurs when c = x. This shows that the mean is a minimizer of SS. (Hint: use
calculus.)
247
2 Data
2.8 Sketch a boxplot of a distribution that is positively skewed.
2.9 Show that the total deviation from the mean, defined by
n
X
total deviation from the mean =
(xi − x) ,
i=1
is 0 for any distribution.
2.10 Find a distribution with 10 values between 0 and 10 that has as large a variance as
possible.
2.11 Find a distribution with 10 values between 0 and 10 that has as small a variance as
possible.
2.12 Suppose that x1 , . . . , xn are the values of some variable and a new variable y is defined
by adding a constant c to each xi . In other words, yi = xi + c for all i.
a) How does y compare to x?
b) How does Var(y) compare to Var(x)?
2.13 Repeat Problem 2.12 but with y defined by multiplying each xi by c. In other words,
yi = cxi for all i.
2.14 The R dataset barley has the yield in bushels/acre of barley for various varieties of
barley planted in 1931 and 1932. There are three categorical variables in play: the variety
of barley planted, the year of the experiment, and the site at which the experiment was
done (the site Grand Rapids is Minnesota, not Michigan). By examining each of these
variables one at a time, make some qualitative statements about the way each variable
affected yield. (e.g., did the year in which the experiment was done affect yield?)
2.15 A dataset from the Data and Story Library on the result of three different methods
of teaching reading can be found at http://www.calvin.edu/~stob/data/reading.csv.
The data includes the results of various pre- and post-tests given to each student. There
were 22 students taught by each method. Using the results of POST3, what can you say
about the differences in reading ability of the three groups at the end of the course? Would
you say that one of the methods is better than the other two? Why or why not?
248
2.8 Exercises
2.16 The death penalty data illustrated Simpson’s paradox. Construct your own illustration to conform to the following story:
Two surgeons each perform the same kind of heart surgery. The result of
the surgery could be classified as “successful” or “unsuccessful.” They have
each done exactly 200 surgeries. Surgeon A has a greater rate of success than
Surgeon B. Now the surgical patient’s case can be classified as either “severe”
or “moderate.” It turns out that when operating on severe cases, Surgeon B
has a greater rate of success than Surgeon A. And when operating on moderate
cases, Surgeon B also has a greater rate of success than Surgeon A.
By the way, who would you want to be your surgeon?
2.17 Runner’s World has an online calculator http://www.runnersworld.com/cda/trainingcalculator/
0,7169,s6-238-277-279-0-0-0-0,00.html that can be used to predict a runner’s time
T 2 in a race of distance D2 from the runner’s time T 1 in a race of distance D1. The
formula used by the website is
d2 1.06
t2 = t1
d1
Investigate the accuracy of this formula when applied to the men’s world record data and
report on your findings. Are there any records that are particularly inconsistent with this
formula?
2.18 A dataset containing some statistics on all baseball teams for the 1994-1998 baseball
seasons is available at http://www.calvin.edu/~stob/data/team.csv. Suppose that you
want to predict the number of runs scored (R) by a team just from knowing how many
home runs (HR) the team has.
a) Write the linear regression of R on HR.
b) Compute the predicted values for each of the teams. (Use predict(l) in R.) Make
some comments on the fit. (For example, are there any values not particiularly
well-fit? Do you have any explanations for that?)
2.19 Suppose that we wish to fit a linear model without a constant: i.e., y = bx. Find
the value of b that minimizes the sums of squares of residuals in this case. (Hint: there is
only one variable here, b, so this is a straightforward Mathematics 161 max-min problem.)
249
2 Data
R will compute b in this case as well with the command lm(y∼x-1). In this expression,
1 stands for the constant term and -1 therefore means leave it out. Alternatively we can
write lm(y∼x+0).
2.20 Data on the 2003 American League Baseball season is in the file htt://www.calvin.
edu/~stob/data/al2003.csv’. Can we predict the number of wins (W ) that a team will
have from the number of runs (R) that the team scores?
a) Write W as a linear function of R.
b) A better model takes into account the runs that a team’s opponent has scored as
well. Write W − L as a function of R − OR (here L is losses and OR is opponents
runs scored). You will have to construct new vectors that have the values of W − L
and R − OR. The function lm(R-OR W-L,data=..) will not work!
c) Why might it make sense from the meaning of the variables W − L and R − OR to
use a linear model without a constant term as in problem 1? Write W − L as a linear
function of R − OR without a constant term.
d) Compare the results of parts (b) and (c) as to the goodness of fit of the model.
2.21 Find a transformation that transforms the following nonlinear equations y = f (x)
(that depend on parameters b0 and b1 ) to linear equations g(y) = b00 + b01 h(x).
b0
b1 + x
x
b) y =
b0 + b1 x
a) y =
c) y =
1
1 + b0 eb1 x
2.22 The R dataset Puromycin gives the rate of reaction as a function (in counts/min/min)
of concentration of an enzyme (in ppm) for two different substrates - one treated with
Puromycin and one not treated. The biochemistry suggests that these two variables are
related by
conc
rate = b0
b1 + conc
250
2.8 Exercises
Find the least squares estimates of b0 and b1 for the treated condition by both of the
methods suggested in this section and compare the sums of squares of residuals.
2.23 Often, we take a sample by some convenient method (a convenience sample) and
hope that the sample “behaves like” a random sample. For each of the following convenient
methods for sampling Calvin students, indicate in what ways that the sample is likely not
to be representative of the population of all Calvin students.
a) The students in Mathematics 232A.
b) The students in Nursing 329.
c) The first 30 students who walk into the FAC west door after 12:30 PM today.
d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM
today.
e) The first 30 students named in the bod book.
f ) The men’s basketball team.
2.24 Suppose that we were attempting to estimate the average height of a Calvin student.
For this purpose, which of the convenience samples in the previous problem would you
suppose to be most representative?
2.25 Donald Knuth, the famous computer scientist wrote a book entitled “3:16”. This
book was a Bible study book that studied the 16th verse of the 3rd chapter of each book
of the Bible (that had a 3:16). Knuth’s thesis was that a Bible study of random verses
of the Bible might be edifying. The sample was of course not a random sample of Bible
verses and Knuth had ulterior motives in choosing 3:16. Describe a method for choosing a
random sample of 60 verses from the Bible. Construct a method that is more complicated
than simple random sampling that seeks to get a sample representative of all parts of the
Bible.
2.26 Suppose that we wish to survey the Calvin student body to see whether the student
body favors abolishing the Interim (we could only hope!). Suppose that instead of a simple
random sample, we select a random sample of size 20 from each of the five groups of
Table 2.6. Suppose that of 20 students in each group, 9 of the first-year students, 10 of
251
2 Data
the sophomores, 13 of the juniors, 19 of the seniors and all 20 of the other students favor
abolishing the interim. How would you use these numbers to estimate the proportion of
the whole student body that favors abolishing the Interim?
2.27 Consider the set of natural numbers P = {1, 2, . . . , 30} to be a population.
a) How many prime numbers are there in the population?
b) If a sample of size 10 is representative of the population, how many prime numbers
would we expect to be in the sample? How many even numbers would we expect to
be in the sample?
c) Using R choose 5 different samples of size 10 from the population P . Record how
many prime numbers and how many even numbers are in each sample. Make any
comments about the results that strike you as relevant.
2.28 In a clinical trial called “Preemptive Analgesia With OxyContin Versus Placebo
Before Surgery for Long Bone Fractures” which is currently being performed at the Ramban
Health Care Campus, the researchers are attempting to determine whether pain medication
provided before surgery for repair of fractures helps relieve pain (analgesia) after surgery.
(Current clinical trials are described at the website http://www.clinicaltrials.gov/.
There is also a lot of information at this website on how clinical trials are conducted.)
a) What is the explanatory variable and what is the response variable in this experiment?
b) What are the levels of the explanatory variable (i.e., the treatments) suggested by
the title of the experiment?
c) Consider the response variable. How might it be measured? What are some of the
difficulties in measuring it?
d) The experimenters want to enroll 80 subjects in the experiment. How do you think
they should go about assigning the 80 subjects to the treatments?
2.29 The R dataset CO2 has the results of an experiment on the grass species Echinochloa
crus-galli. Look at the data and the help document that accompanies the dataset (to get
a description of a dataset use ?CO2).
252
2.8 Exercises
a) What are the explanatory and response variables in this experiment according to the
short description of the data?
b) What variables serve as blocking variables in this experiment?
2.30 Most clinical studies are double-blind randomized comparative experiments. Here
blind refers to the fact that the subject does not know which treatment she is getting
(e.g., the drug or the placebo) and double-blind refers to the fact that the clinician who is
monitoring the response variable also does not know which treatment the patient is getting.
Why is it desirable that the experiment be double-blind, if possible?
2.31 In 1957, the Joint Report of the Study Group on Smoking and Health concluded
(in Science, vol. 125, pages 1129–1133) that smoking is an important health hazard for it
causes an increased risk for lung cancer. However for many years after that the tobacco
industry denied this claim. One of their principle arguments is that the data indicating
this relationship came from retrospective observational studies. (Indeed, the data in the
Joint Report came from 16 independent observational studies.)
a) One out of every ten males who smoke at least two packs a day dies of lung cancer.
Only one out of every 275 males who do not smoke dies of lung cancer. Explain
why the tobacco industry claimed that this does not prove that smoking causes lung
cancer in some men.
b) There have been no randomized comparative experiments to investigate the relationship between smoking an lung cancer. Explain why not.
c) Much of the best evidence that smoking causes lung cancer comes from prospective
observational studies. Explain why prospective observational studies might help to
establish this link.
2.32 Some people claim that it is more difficult to make free-throws in the Calvin Field
House while shooting at the south basket than at the north basket. Construct an experiment to test this claim. (You need not perform it!) Use the language of this section to
describe the experiment carefully.
253
3 Probability
3.1 Modelling Uncertainty
Probability theory is the mathematical discipline concerned with modeling situations in
which the outcome is uncertain. For example, in random sampling, we do not know which
sample of inviduals from the population that we might actually get in our sample. The
basic notion is that of a probability.
Definition 3.1.1 (A probability). A probability is a number meant to measure the likelihood of the occurrence of some uncertain event (in the future).
Definition 3.1.2 (probability). Probability (or the theory of probability) is the mathematical discipline that
1. constructs mathematical models for “real-world” situations that enable the computation of probabilities (“applied” probability)
2. develops the theoretical structure that undergirds these models (“theoretical” or
“pure” probability).
The setting in which we make probability computations is that of a random process.
(What we call a random process is usually called a random experiment in the literature
but we use process here so as not to get the concept confused with that of randomized
experiment.) A random process has three key characteristics:
1. A random process is something that is to happen in the future (not in the past). We
can only make probability statements about things that have not yet happened.
301
3 Probability
2. The outcome of the process could be any one of a number of outcomes and which
outcome will obtain is uncertain.
3. The process could be repeated indefinitely (under essentially the same circumstances),
at least in theory.
Historically, some of the basic random processes that were used to develop the theory of
probability were those originating in games of chance. Tossing a coin or dealing a poker
hand from a well-shuffled deck are examples of such processes. For our purposes the two
most important random processes are producing a random sample from a population and
assigning subjects randomly to the treatments of a randomized comparative experiment.
Essentially all the probability statements that we want to make in statistics come from
these two situations (and their cousins).
Given a random process, the set (collection) of all possible outcomes will be referred
to as the sample space. An event is simply a set of some of the outcomes. These two
fundamental notions are illustrated by the following example of random sampling.
Example 3.1.1
Twenty-nine students are in a certain statistics class. It is decided to choose a simple
random sample of 5 of the students. There are a boatload of possible outcomes. (It
can be shown that there are 118,755 different samples of 5 students out of 29.) One
event of interest is the collection of all outcomes in which all 5 of the students are
male. Suppose that 25 of the students in the class are male. Then it can be shown
that 53,130 of the outcomes comprise this event.
Given a random process, our goal is to assign to each event E a number P (E) (called
the probability of E) such that P (E) measures in some way the likelihood of E. In
order to assign such numbers however, we need to understand what they are intended to
measure. Interpreting probability computations is fraught with all sorts of philosophical
issues but it is not too great a simplification at this stage to distinguish between two
different interpretations of probability statements.
302
3.1 Modelling Uncertainty
The frequentist interpretation.
The probability of an event E, P (E), is the limit of the relative frequency that E
occurs in repeated trials of the process as the number of trials approaches infinity.
The subjectivist interpretation.
The probability of an event E, P (E), is an expression of how confident the assignor is
that the event will happen in the next trial of the process.
It is easy to think of examples of probability statements in the real world that are
more naturally interpreted using either of these interpretations rather than the other.
In this text, we will usually phrase our interpretations of probability statements using the
frequentist interpretation. Mathematics cannot tell us which of these two interpretations is
right or indeed how to assign probabilities in any particular situation. But mathematicians
have developed some basic axioms to constrain our choice of probabilities. The three
fundamental axioms of probability are
Axiom 3.1.3. For all events A, P (A) ≥ 0.
Axiom 3.1.4. P (S) = 1.
Axiom 3.1.5. If A1 and A2 are disjoint events (i.e., have no outcomes in common) then
P (A1 or A2 ) = P (A1 ) + P (A2 )
If one interprets probabilities as limiting relative frequencies, it is easy to see that these
three axioms should be true.
The axioms do not tell us how to assign the probabilities in any particular case. They only
provide some minimal constraints on this assignment. There are two important methods
for assigning that we will use extensively.
303
3 Probability
The equally likely outcomes model
In some cases, we can list all the outcomes in such a way that is plausible to suppose
that each outcome is equally likely. For example, the very definition of choosing a random
sample of size 5 from a class of 29 requires us to develop a method so that each sample of
size 5 is equally likely to occur. In this case, it is easy to compute the probability of an
event E. If there are N equally likely outcomes, the probability of each outcome should
be 1/N . The probability of an event E is k/N where there are k outcomes in the event.
Example 3.1.2
A six-sided die is rolled. Then one of six possible outcomes occurs. From the
symmetry of the die it is reasonable to assume that the six outcomes are equally likely.
Therefore, the probability of each outcome is 1/6. If E is the event that is described
by “the die comes up 1 or 2” then P (E) = 2/6 = 1/3.
In a more interesting and more useful example in Example 3.1.1 there are 118,755 possible
different samples of five students from 29 and by the definition of simple random sample
these samples are equally likely to occur. Since 53,130 of these comprise the event E of
getting all males in the sample, the probability of this event is 53130/118755 = 44.7%.
Example 3.1.3
Perhaps the canonical historical example of a random process for which it is possible
to generate a list of equally likely outcomes is the process in which two dice are thrown
and the number on each face is recorded. It is easy to see that there are 36 equally
likely outcomes (list the pairs (i, j) of numbers where i is the number on the first die,
j is the number on the second die and i and j range from 1 to 6). One event related
to this process is the event E that the throw results in a sum of 7 on the two dice. It
is easy to see that there are 6 outcomes in E so that P (E) = 6/36 = 1/6.
Past performance as an indicator of the future
In some cases, we have data on many previous trials of the process. In this case we may
estimate the probability of each outcome by the relative frequency with which it occurred
in the previous trials. This method is used extensively in the insurance industry. For
example the probability that a male alive on his 55th birthday lives to his 56th is currently
304
3.1 Modelling Uncertainty
estimated to be 0.0081 or slightly less than 1% based on the recent history of 55 year old
males.
Example 3.1.4
In the 2007 baseball season, Manny Ramirez came to the plate 569 times. Of those
569 times, he had 89 singles, 33 doubles, 1 triple, 20 homeruns, 78 walks (and hit
by pitch), and 348 outs. We might estimate that the probability Ramirez will hit a
homerun in his next plate appearance to be 20/569 = .035.
For the purpose of investigating how random process work, it us very useful to use
R. In the following example, we simulate one, and then five, of Manny Ramirez’s plate
appearances.
> outcomes=c(’Out’,’Single’,’Double’,’Triple’,’Homerun’,’Walk’)
> ramirez=c(348,89,33,1,20,78)/569
> sum(ramirez)
[1] 1
> ramirez
[1] 0.611599297 0.156414763 0.057996485 0.001757469 0.035149385 0.137082601
> sample(outcomes,1,prob=ramirez)
[1] "Double"
> sample(outcomes,5,prob=ramirez,replace=T)
[1] "Out"
"Double" "Out"
"Out"
"Walk"
In the next example, we simulate the tossing of a coin 1,000 times. The graph provides
some evidence that the limiting relative frequency of “Heads” is 0.5.
> coins=sample(c(’H’,’T’),1000,replace=T)
> cumfrequency = cumsum(coins==’H’)/c(1:1000)
> plot(cumfrequency,type=’l’)
305
3 Probability
1.0
cumfrequency
0.9
0.8
0.7
0.6
0.5
0
200
400
600
800
1000
3.2 Discrete Random Variables
3.2.1 Random Variables
If the outcomes of a random process are numbers, we will call the random process a random
variable. Since non-numerical outcomes can always be coded with numbers, restricting
our attention to random variables results in no loss of generality. We will use upper-case
letters to name random variables (X, Y , etc.) and the corresponding lower-case letters (x,
y, etc.) to denote the possible values of the random variable. Then we can describe events
by equalities and inequalities so that we can write such things as P (X = 3), P (Y = y) and
P (Z ≤ z). Some examples of random variables include
1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be the
random variable that counts the number of underweight boxes and let Y be the
random variable that is the average weight of the 12 boxes.
2. Choose a Calvin senior at random. Let Z be the GPA of that student and let U be
the composite ACT score of that student.
3. Assign 12 chicks at random to two groups of six and feed each group a different feed.
Let D be the difference in average weight between the two groups.
306
3.2 Discrete Random Variables
4. Throw a fair die until all six numbers have appeared. Let T be the number of throws
necessary.
We will consider two types of random variables, discrete and continuous.
Definition 3.2.1 (discrete random variable). A random variable X is discrete if its possible
values can be listed x1 , x2 , x3 , . . . .
In the example above, the random variables X, U , and T are discrete random variables.
Note that the possible values for X are 0, 1, . . . , 12 but that T has infinitely many possible
values 1, 2, 3, . . . . The random variables Y , Z, and D above are not discrete. The random
variable Z (GPA) for example can take on all values between 0.00 and 4.00. (We should
make the following caveat here however. All variables are discrete in the sense that there
are only finitely many different measurements possible to us. Each measurement device
that we use has divisions only down to a certain tolerance. Nevertheless it is usually more
helpful to view these measurements as on a continuous scale rather than a discrete one.
We learned that in calculus.)
Definition 3.2.2 (continuous random variable). A random variable X is continuous if its
possible values are all x in some interval of real numbers.
In this section, we focus on properties of discrete random variables.
Example 3.2.1
Two dice are thrown and the sum X of the numbers appearing on their faces is
recorded. X is a random variable with possible values 2, 3, . . . , 12. By using the equally
likely outcomes method we can see that P (X = 7) = 1/6 and P (X ≤ 5) = 5/18.
If X is a discrete random variable, we will be able to compute the probability of any
event defined in terms of X if we know all the possible values of X and the probability
P (X = x) for each such value x.
Definition 3.2.3 (probability mass function). The probability mass function (pmf) of a
random variable X is the function f such that for all x, f (x) = P (X = x). We will
sometimes write fX to denote the probability mass function of X when we want to make
it clear which random variable is in question.
307
3 Probability
25
Percent of Total
20
15
10
5
0
1
2
3
4
5
r
Figure 3.1: The probability histogram for the Calvin class random variable.
The word mass is not arbitrary. It is convenient to think of probability as a unit mass
that is divided into point masses at each possible outcome. The mass of each point is its
probability. Note that mass obeys the probability axioms.
Example 3.2.2
Suppose that a student is chosen at random from the Calvin student body. We
will code the class of the student by 1, 2, 3, 4 for the four standard classes and 5 for
other. The coded class is a random variable. Referring to Table 2.6, we see that the
probability mass function of X is given by f (1) = 0.27, f (2) = 0.24, f (3) = 0.21,
f (4) = 0.25, f (5) = 0.03, and f (x) = 0 otherwise.
One useful way of picturing a probability mass function is by a probability histogram. For
the mass function in Example 3.2.2, we have the corresponding histogram in Figure 3.2.1.
On the frequentist interpretation of probability, if we repeat the random process many
times, the histogram of the results of those trials should approximate the probability histogram. The probability histogram is not a histogram of data from many trials however.
It is a representation of what might happen in the next trial. We will often use this idea to
308
3.2 Discrete Random Variables
work in reverse. In other words, given a histogram of data that obtained from successive
trials of a random process, we will choose the pmf to fit the data. Of course we might not
ask for a perfect fit but instead we will choose the pmf f to fit the data approximately but
so that f has some simple form.
Several families of random variables are particularly important to us and provide models
for many real-world situations. We examine two of such families here. Each arises from
a common kind of random process that will be important for statistical inference. The
second of these arises from the very important case of simple random sampling from a
population. We will first study a somewhat different case (which, among other uses, can
be used to study sampling with replacement).
3.2.2 The Binomial Distribution
A binomial process is a process characterized by the following conditions:
1. The process consists of a sequence of finitely many (n) trials of some simpler process.
2. Each trial results in one of two possible outcomes, usually called success (S) and
failure (F ).
3. The probability of success on each trial is a constant denoted by π.
4. The trials are independent one from another - that is the outcome of one trial does
not affect the outcome of any other.
Thus a binomial process is characterized by two parameters, n and π. Given a binomial
process, the natural random variable to observe is the number of successes.
Definition 3.2.4 (binomial random variable). Given a binomial process, the binomial
random variable X associated with this process is defined by X is the number of successes
in the n trials of the process. If X is a binomial random variable with parameters n and
π, we write X ∼ Binom(n, π).
Example 3.2.3
The following are all natural examples of binomial random variables.
309
3 Probability
1. A fair coin is tossed n = 10 times with the probability of a HEAD (success) being
π = .5. X is the number of heads.
2. A basketball player shoots n = 25 freethrows with the probability of making each
freethrow being π = .70. Y is the number of made freethrows.
3. A quality control inspector tests the next n = 12 widgets off the assembly line
each of which has a probability of 0.10 of being defective. Z is the number of
defective widgets.
4. Ten Calvin students are randomly sampled with replacement. W is the number
of males in the sample.
The probability mass function for a binomial distribution is given in the following theorem.
Theorem 3.2.5 (The Binomial Distribution). Suppose that X is a binomial random variable with parameters n and π. The pmf of X is given by
n!
n x
π x (1 − π)n−x
π (1 − π)n−x =
fX (x; n, π) =
x
x!(n − x)!
Note the use of the semicolon in the defintion of fX in the theorem. We will use a
semicolon to separate the possible values of the random variable (x) from the parameters (n,
π). For any particular binomial experiment, n and π are fixed. If n and π are understood,
we might write fX (x) for fX (x; n, π).
For all but very small n, computing f by hand is tedious. We will use R to do this.
Besides computing the mass function, R can be used to compute the cumulative distribution
function FX which is the useful function defined in the next definition.
Definition 3.2.6 (cumulative distribution function). If X is any random variable, the
cumulative distribution function of X (cdf) is the function FX given by
FX (x) = P (X ≤ x) =
310
X
y≤x
fX (y)
3.2 Discrete Random Variables
We will usually use the convention that the pmf of X is named by a lower-case letter
(usually fX ) and the cdf by the corresponding upper-case letter (usually FX ). The R
functions to compute the cdf, pdf, and also to simulate binomial processes are as follows if
X ∼ Binom(n, π).
function (& parameters)
explanation
rbinom(n,size,prob)
makes n random draws of the random variable
X and returns them in a vector.
dbinom(x,n,size,prob)
returns P(X = x) (the pmf).
pbinom(q,n,size,prob)
returns P(X ≤ q) (the cdf).
Suppose that a manufacturing process produces defective parts with probability π = .1.
If we take a random sample of size 10 and count the number of defectives X, we might
assume that X ∼ Binom(10, 0.1). Some examples of R related to this situation are as
follows.
> defectives=rbinom(n=30, size=10,prob=0.1)
> defectives
[1] 2 0 2 0 0 0 0 2 0 1 1 1 0 0 2 2 3 1 1 2 1 1 0 2 0 1 1 0 1 1
> table(defectives)
defectives
0 1 2 3
11 11 7 1
> dbinom(c(0:4),size=10,prob=0.1)
[1] 0.34867844 0.38742049 0.19371024 0.05739563 0.01116026
> dbinom(c(0:4),size=10,prob=0.1)*30
# pretty close to table
[1] 10.4603532 11.6226147 5.8113073 1.7218688 0.3348078
> pbinom(c(0:5),size=10,prob=0.1)
# same as cumsum(dbinom(...))
[1] 0.3486784 0.7360989 0.9298092 0.9872048 0.9983651 0.9998531
>
It is important to note that
• R uses size for the number of trials (what we have called n) and n for the number
of random draws.
311
3 Probability
• pbinom() gives the cdf not the pdf. Reasons for this naming convention will become
clearer later.
• There are similar functions in R for many of the distributions we will encounter, and
they all follow a similar naming scheme. We simply replace binom with the R-name
for a different distribution.
3.2.3 The Hypergeometric Distribution
The hypergeometric distribution arises from considering the situation of random sampling
from a population in which there are just two types of individuals. (That is there is a
categorical variable defined on the population with just two levels.) It is traditional to
describe the distribution in terms of the urn model. Suppose that we have an urn with two
different colors of balls. There are m white balls and n black balls. Suppose we choose k
balls from the urn in such a way that every set of k balls is equally likely to be chosen (i.e.,
a random sample of balls) and count the number X of white balls. We say that X has the
hypergeometric distribution with parameters m, n, and k and write X ∼ Hyper(m, n, k).
Example 3.2.4
Remember our class of 29 intrepid souls, 25 of whom are male. Let’s call the females
the white balls and the males the black balls. Recall that for some reason we wanted
a sample of size 5. Let X be the number of females in our sample. Then X ∼
Hyper(4, 25, 5).
There is a simple formula for the pmf of the hypergeometric distribution. This formula
comes from careful counting of the equally likely outcomes.
m
n
x k−x
m+n
k
fX (x) =
.
R knows the hypergeometric distribution and the syntax is exactly the same as for the
binomial distribution (except that the names of the parameters have changed).
312
3.3 Continuous Random Variables
function (& parameters)
explanation
rhyper(nn,m,n,k)
makes nn random draws of the random variable X and returns them in a vector.
dhyper(x,m,n,k)
returns P(X = x) (the pmf).
phyper(q,m,n,k)
returns P(X ≤ q) (the cdf).
Some interesting computations related to Example 3.2.4 are below.
> dhyper(x=c(0:5),m=4,n=25,k=5)
[1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
[6] 0.0000000000
> dhyper(x=c(0:5),k=5,m=4,n=25)
# order of named arguments does not matter
[1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
[6] 0.0000000000
> phyper(q=c(0:5),m=4,n=25,k=5)
[1] 0.4473917 0.8734790 0.9896846 0.9997895 1.0000000 1.0000000
> rhyper(nn=30,m=4,n=25,k=5)
# note nn for number of random outcomes
[1] 2 1 1 1 1 2 2 2 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1 2 0 0 0
> dhyper(0:5,4,25,5)
# default order of unnamed arguments
[1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175
[6] 0.0000000000
>
3.3 Continuous Random Variables
Recall that a continuous random variable X is one that can take on all values in an
interval of real numbers. For example, the height of a randomly chosen Calvin student in
inches could be any real number between, say, 36 and 80. Of course all continuous random
variables are idealizations. If we measure heights to the nearest quarter inch, there are only
finitely many possibilities for this random variable and we could, in principle, treat it as
discrete. We know from calculus however that treating measurements as continuous valued
functions often simplifies rather than complicates our techniques. In order to understand
what kinds of probability statements that we would like to make about continuous random
variables, it is helpful to keep in mind this idea of the finite precision of our measurements
however. For example, a statement that a randomly chosen individual is 72 inches tall is
313
3 Probability
1.0
0.8
0.6
0.8
0.6
0.4
0.6
0.4
0.4
0.2
0.2
0.0
0.2
0.0
0
2
4
6
Time
8
0.0
0
2
4
6
8
Time
0
2
4
6
8
Time
Figure 3.2: Discretized pmf for T .
not a claim that the individual is exactly 72 inches tall but rather a claim that the height
of the individual is in some small interval (maybe 71 34 to 72 14 if we are measuring to the
nearest half inch). So probabilities of the form P (X = x) are not meaningful. Rather the
appropriate probability statements will be of the form P (a ≤ X ≤ b).
3.3.1 pdfs and cdfs
Recall the analogy of probability and mass. In the case of discrete random variables, we
represented the probability P(X = x) by a point of mass P(X = x) at the point x and had
total mass 1. In this case mass is continuous and the appropriate weighting of mass is a
density function. In the following example, we can see how this works.
Example 3.3.1
A Geiger counter emits a beep when a radioactive particle is detected. The rate of
beeping determines how radioactive the source is. Suppose that we record the time
T to the next beep. It turns out that T behaves like a random variable. Suppose
that we measured T with increasing precision. We might get histograms that look like
those in Figure 3.2 for the pmf of T . It’s pretty obvious that we want to replace these
histograms by a smooth curve. In fact the pictures should remind us of the pictures
drawn for the Riemann sums that define the integral.
The analogue to a probability mass function for a continuous variable is a probability
density function.
Definition 3.3.1 (probability density function, continuous random variable). A probability density function (pdf) is a function f such that
314
3.3 Continuous Random Variables
• f (x) ≥ 0 for all real numbers x, and
•
R∞
−∞ f (x)
dx = 1.
The continuous random variable X defined by the pdf f satisfies
P(a ≤ X ≤ b) =
Z
b
f (x) dx
a
for any real numbers a ≤ b.
The following simple lemma demonstrates one way in which continuous random variables
are very different from discrete random variables.
Lemma 3.3.2. Let X be a continuous random variable with pdf f . Then for any a ∈ R,
1. P(X = a) = 0,
2. P(X < a) = P(X ≤ a), and
3. P(X > a) = P(X ≥ a).
Z
Proof.
a
a
f (x) dx = 0 . And P(X ≤ a) = P(X < a) + P(X = a) = P(X < a).
Example 3.3.2
(
3x2
Q. Consider the function f (x) =
0
P(X ≤ 1/2).
x ∈ [0, 1]
Show that f is a pdf and calculate
otherwise.
A. Let’s begin be looking at a plot of the pdf.
315
0.0
1.0
f (x)
2.0
3.0
3 Probability
0.0
0.2
0.4
0.6
0.8
1.0
x
The rectangular region of the plot has an area of 3, so it is plausible that the area
under the graph of the pdf is 1. We can verify this by integration.
Z ∞
Z 1
1
f (x) dx =
3x2 dx = x3 0 = 1 ,
−∞
so f is a pdf and P(X ≤ 1/2) =
0
R 1/2
0
1/2
3x2 dx = x3 0 = 1/8.
The cdf of a continuous random variable is defined the same way as it was for a discrete
random variable, but we use an integral rather than a sum to get the cdf from the pdf in
this case.
Definition 3.3.3 (cumulative distribution function). Let X be a continuous random variable with pdf f , then the cumulative distribution function (cdf) for X is
Z x
F (x) = P(X ≤ x) =
f (t) dt .
−∞
Example 3.3.3
Q. Determine the cdf of the random variable from Example 3.3.2.
A. For any x ∈ [0, 1],
FX (x) = P(X ≤ x) =
316
Z
0
x
x
3t2 dt = t3 0 = x3 .
3.3 Continuous Random Variables
So


0
FX (x) = x3


1
x ∈ [−∞, 0)
x ∈ [0, 1]
x ∈ (1, ∞) .
Notice that the cdf FX is an antiderivative of the pdf fX . This follows immediately from
the Fundamental Theorem of Calculus. Notice also that P(a ≤ X ≤ b) = F (b) − F (a).
Lemma 3.3.4. Let FX be the cdf of a continuous random variable X. Then the pdf fX
satisfies
fX (x) =
d
FX (x) .
dx
Just as the binomial and hypergeometric distributions were important families of discrete
random variables, there are several important families of continuous random variables that
are often used as models of real-world situations. We investigate a few of these in the next
three subsections.
3.3.2 Uniform Distributions
The continuous uniform distribution has a pdf that is constant on some interval.
Definition 3.3.5 (uniform random variable). A continuous uniform random variable on
the interval [a, b] is the random variable with pdf given by
(
1
x ∈ [a, b]
f (x; a, b) = b−a
0
otherwise.
It is easy to confirm that this function is indeed a pdf. We could integrate, or we could
simply use geometry. The region under the graph of the uniform pdf is a rectangle with
1
width b − a and height b−a
, so the area is 1.
Example 3.3.4
317
3 Probability
Q. Let X be uniform on [0, 10]. What is P(X > 7)? What is P(3 ≤ X < 7)?
A. Again we argue geometrically. P(X > 7) is represented by a rectangle with base
from 7 to 10 along the x-axis and a height of .1, so P(X > 7) = 3 · 0.1 = 0.3.
Similarly P(3 ≤ X < 7) = 0.4. In fact, for any interval of width w contained in
[0, 10], the probability that X falls in that particular interval is w/10.
We could also compute these results by integrating, but this would be silly.
Example 3.3.5
Q. Let X be uniform on the interval [0, 1] (which we denote X ∼ Unif(0, 1)) what is
the cdf for X?
Rx
A. For x ∈ [0, 1], FX (x) = 0 1 dx = x, so


0
FX (x) = x


1
x ∈ (∞, 0)
x ∈ [0, 1]
x ∈ (1, ∞) .
F (x)
0.4
0.0
0.4
0.0
f (x)
0.8
cdf for Unif(0,1)
0.8
pdf for Unif(0,1)
0.0
0.5
1.0
x
1.5
2.0
0.0
0.5
1.0
1.5
2.0
x
Although it has a very simple pdf and cdf, this random variable actually has several
important uses. One such use is related to random number generation. Computers
are not able to generate truly random numbers. Algorithms that attempt to simulate
randomness are called pseudo-random number generators. X ∼ Unif(0, 1) is a model
for an idealized random number generator. Computer scientists compare the behavior
of a pseudo-random number generator with the behavior that would be expected for
X to test the quality of the pseudo-random number generator.
318
3.3 Continuous Random Variables
There are R functions for computing the pdf and cdf of a uniform random variable as well
as a function to return random numbers. An additional function computes the quantiles
of the uniform distribution. If X ∼ Unif(min, max) the following functions can be used.
function (& parameters)
explanation
runif(n,min,max)
makes n random draws of the random variable
X and returns them in a vector.
dunif(x,min,max
returns fX (x), (the pdf).
punif(q,min,max)
returns P(X ≤ q) (the cdf).
qunif(p,min,max)
returns x such that P(X ≤ x) = p.
Here are examples of computations for X ∼ Unif(0, 10).
> runif(6,0,10)
# 6 random values on [0,10]
[1] 5.449745 4.124461 3.029500 5.384229 7.771744 8.571396
> dunif(5,0,10)
# pdf is 1/10
[1] 0.1
> punif(5,0,10)
# half the distribution is below 5
[1] 0.5
> qunif(.25,0,10) # 1/4 of the distribution is below 2.5
[1] 2.5
3.3.3 Exponential Distributions
In Example 3.3.1 we considered a “waiting time” random variable, namely the waiting
time until the next radioactive event. Waiting times are important random variables in
reliability studies. For example, a common characteristic of a manufactured object is MTF
or mean time to failure. The model often used for the Geiger counter random variable is the
exponential distribution. Note that a waiting time can be any x in the range 0 ≤ x < ∞.
Definition 3.3.6 (The exponential distribution). The random variable X has the exponential distribution with parameter λ > 0 (X ∼ Exp(λ)) if X has the pdf
(
λeλx x ≥ 0
fX (x) =
0
x<0.
319
3 Probability
It is easy to see that the function fX of the previous definition is a pdf for any value of λ.
R refers to the value of λ as the rate so the appropriate functions in R are rexp(n,rate),
dexp(x,rate), pexp(x,rate), and qexp(p,rate). We will see later that rate is an apt
name for λ as λ will be the rate per unit time if X is a waiting time random variable.
Example 3.3.6
Suppose that a random variable T measures the time until the next radioactive event
is recorded at a Geiger counter (time measured since the last event). For a particular
radioactive material, a plausible models for T is T ∼ Exp(0.1) where time is measured
in seconds. Then the following R session computes some important values related to T .
> pexp(q=0.1,rate=.1)
# probability waiting time less than .1
[1] 0.009950166
> pexp(q=1,rate=.1)
# probability waiting time less than 1
[1] 0.09516258
> pexp(q=10,rate=.1)
[1] 0.6321206
> pexp(q=20,rate=.1)
[1] 0.8646647
> pexp(100,rate=.1)
[1] 0.9999546
> pexp(30,rate=.1)-pexp(5,rate=.1)
# probability waiting time between 5 and 30
[1] 0.5567436
> qexp(p=.5,rate=.1)
# probability is .5 that T is less than 6.93
[1] 6.931472
The graphs in Figure 3.3 are graphs of the pdf and cdf of this random variable. All
exponential distributions look the same except for the scale. The rate of 0.1 here means
that we can expect that in the long run this process will average 0.1 counts per second.
3.3.4 Weibull Distributions
A very important generalization of the exponential distributions are the Weibull distributions. They are often used by engineers to model phenomena such as failure, manufacturing
or delivery times. They have also been used for applications as diverse as fading in wireless
320
3.3 Continuous Random Variables
0.08
0.8
0.06
0.6
y
1.0
y
0.10
0.04
0.4
0.02
0.2
0.0
0.00
0
10
20
30
x
40
50
0
10
20
30
40
50
x
Figure 3.3: The pdf and cdf of the random variable T ∼ Exp(0.1).
communications channels and wind velocity. The Weibull is a two-parameter family of
distributions. The two parameters are a shape parameter α and a scale parameter λ.
Definition 3.3.7 (The Weibull distributions). The random variable X has a Weibull
distribution with shape parameter α > 0 and scale parameter β > 0 (X ∼ Weib(α, β)) if
the pdf of X is
( α
α−1 e−(x/β)α x ≥ 0
βα x
fX (x; α, β) =
0
x<0
Notice that if X ∼ Weib(1, λ) then X ∼ Exp(1/λ). Varying α in the Weibull distribution changes the shape of the distribution while changing β changes the scale. The
effect of fixing β (β = 5) and changing α (α = 1, 2, 3) is illustrated by the first graph
in Figure 3.4 while the second graph shows the effect of changing β (β = 1, 3, 5) with α
fixed at α = 2. The appropriate R functions to compute with the Weibull distribution are
dweibull(x,shape,scale), pweibull(q,shape,scale), etc.
Example 3.3.7
The Weibull distribution is sometimes used to model the maximum wind velocity
measured during a 24 hour period at a specific location. The dataset http://www.
calvin.edu/~stob/data/wind.csv gives the maximum wind velocity at the San Diego
airport on each of 6,209 consecutive days. It is claimed that the maximum wind velocity
measured on a day behaves like a random variable W that has a Weibull distribution
321
0.2
0.4
y21
0.10
0.0
0.00
0.05
y35
0.15
0.6
0.20
0.8
3 Probability
0
2
4
6
8
10
0
2
x
4
6
8
10
x
Figure 3.4: Left: fixed β. Right: fixed α.
with α = 3.46 and β = 16.90. The R code below investigates that model using this past
data. (In fact, this model is not a very good one although the output below suggests
that it might be plausible.)
> w$Wind
[1] 14 11 10 13 11 11 26 21 14 13 10 10 13 10 13 13 12 12 13 17 11 11 13 25 15
[26] 18 13 17 12 14 15 10 16 17 17 13 18 14 12 20 11 14 20 16 12 14 18 17 13 16
[51] 13 16 11 13 11 15 13 15 16 18 14 15 15 14 14 16 15 18 14 16 14 10 17 14 12
.............
> cutpts=c(0,5,10,15,20,25,30)
> table(cut(w$Wind,cutpts))
(0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
2
434
3303
1910
409
95
> length(w$Wind[w$Wind<12.5])/6209
[1] 0.2728298
# 27.3% days with max windspeed less than 12.5
> pweibull(12.5,3.46,16.9)
[1] 0.2968784
# 29.7% predicted by Weibull model
> length(w$Wind[w$Wind<22.5])/6209
[1] 0.951361
> pweibull(22.5,3.46,16.9)
[1] 0.9322498
> simulation=rweibull(100000,3.46,16.9)
# 100,000 simulated days
322
3.4 Mean and Variance of a Random Variable
> mean(simulation)
[1] 15.18883
> mean(w$Wind)
[1] 15.32405
> sd(simulation)
[1] 4.85144
> sd(w$Wind)
[1] 4.239603
>
# simulated days have mean about the same as actual
# simulated days have greater variation
3.4 Mean and Variance of a Random Variable
Just as numerical summaries of a data set can help us understand our data, numerical
summaries of the distribution of a random variable can help us understand the behavior
of that random variable. In this section we develop two of the most important numerical
summaries of random variables: mean and variance. In each case, we will use our experience
with data to help us develop a definition.
3.4.1 The Mean of a Discrete Random Variable
Example 3.4.1
Q. Let’s begin with a motivating example. Suppose a student has taken 10 courses
and received 5 A’s, 4 B’s and 1 C. Using the traditional numerical scale where an A is
worth 4, a B is worth 3 and a C is worth 2, what is this student’s GPA (grade point
average)?
A. The first thing to notice is that 4+3+2
= 3 is not correct. We cannot simply add up
3
the values and divide by the number of values. Clearly this student should have GPA
that is higher than 3.0, since there were more A’s than C’s.
Consider now a correct way to do this calculation and some algebraic reformulations
323
3 Probability
of it.
GPA =
4+4+4+4+4+3+3+3+3+2
5·4+4·3+1·2
=
10
10
5
4
1
=
·4+
·3+
·2
10
10
10
5
4
1
=4·
+3·
+2·
10
10
10
= 3.4
Our definition of the mean of a random variable follows the example above. Notice that
we can think of the GPA as a sum of terms of the form
(grade)(proportion of students getting that grade) .
Since the limiting proportion of outcomes that have a particular value is the probability of
that value, we are led to the following definition.
Definition 3.4.1 (mean). Let X be a discrete random variable with pmf f . The mean
(also called expected value) of X is denoted as µX or E(X) and defined by
X
µX = E(X) =
x · f (x) .
x
The sum is taken over all possible values of X.
Example 3.4.2
Q. If we flip four fair coins and let X count the number of heads, what is E(X)?
A. If we flip four fair coins and let X count the number of heads, then the distribution
of X is described by the following table. (Note that X ∼ Binom(4, .5).)
value of X
probability
324
0
1
16
1
4
16
2
6
16
3
4
16
4
1
16
3.4 Mean and Variance of a Random Variable
So the expected value is
0·
1
4
6
4
1
+1·
+2·
+3·
+4·
=2
16
16
16
16
16
On average we get 2 heads in 4 tosses. This is certainly in keeping with our informal
understanding of the word average.
More generally, the mean of a binomial random variable is found by the following Theorem.
Theorem 3.4.2. Let X ∼ Binom(n, π). Then E(X) = nπ.
Similarly, the mean of a hypergeometric random variable is just what we think it should
be.
Theorem 3.4.3. Let X ∼ Hyper(m, n, k). Then E(X) = km/(m + n).
The following example illustrates the computation of the mean for a hypergeometric
random variable.
> x=c(0:5)
> p=dhyper(x,m=4,n=25,k=5)
> sum(x*p)
[1] 0.6896552
> 4/29 * 5
[1] 0.6896552
3.4.2 The Mean of a Continuous Random Variable
If we think of probability as mass, then the expected value for a discrete random variable
X is the center of mass of a system of point masses where a mass fX (x) is placed at each
possible value of X. The expected value of a continuous random variable should also be
the center of mass where the pdf is now interpreted as density.
325
3 Probability
Definition 3.4.4 (mean). Let X be a continuous random variable with pdf f . The mean
of X is defined by
Z ∞
µX = E(X) =
xf (x) dx .
−∞
Example 3.4.3
(
3x2
Recall the pdf in Example 3.3.2: f (x) =
0
E(X) =
Z
0
1
x ∈ [0, 1]
. Then
otherwise.
x · 3x2 dx = 3/4 .
The value 3/4 seems plausible from the graph of f .
We compute the mean of two of our favorite continuous random variables in the next
Theorem.
Theorem 3.4.5.
1. If X ∼ Unif(a, b) then E(X) = (a + b)/2.
2. If X ∼ E(λ) then E(X) = 1/λ.
Proof. The proof of each of these is a simple integral. These are left to the reader.
Our intuition tells us that in a large sequence of trials of the random process described
by X, the sample mean of the observations should be usually be close the mean of X. This
is in fact true and is known as the Law of Large Numbers. We will not state that law
precisely here but we will illustrate it using several simulations in R.
> r=rexp(100000,rate=1)
> mean(r)
[1] 0.9959467
> r=runif(100000,min=0,max=10)
> mean(r)
[1] 5.003549
326
# should be 1
# should be 5
3.4 Mean and Variance of a Random Variable
> r=rbinom(100000,size=100,p=.1)
> mean(r)
[1] 9.99755
> r=rhyper(100000,m=10,n=20,k=6)
> mean(r)
[1] 1.99868
# should be 10
# should be 2
3.4.3 Transformations of Random Variables
After collecting data, we often transform it. That is we apply some function to all the
data. For example, we saw the value of using a logarithmic transformation to linearize
some bivariate relationships. Now consider the notion of transforming a random variable.
Definition 3.4.6 (transformation). Suppose that t is a function defined on all the possible
values of the random variable X. Then the random variable t(X) is the random variable
that has outcome t(x) whenever x is the outcome of X.
If the random variable Y is defined by Y = t(X), then Y itself has an expected value. To
find the expected value of Y , we would need to find the pmf or pdf of Y , fY (y), and then
use the definition of E(Y ) to compute E(Y ). There is an easier way to compute E(t(X))
however which is given in the following lemma.
Lemma 3.4.7. If X is a random variable (discrete or continuous) and t a function defined
on the values of X, then if Y = t(X) and X has pdf (pmf) fX
(P
t(x)fX (x)
if X is discrete
E(Y ) = R ∞x
−∞ t(x)f (x) dx if X is continuous .
We will not give the proof but it is easy to see that this lemma should be so (at least for
the discrete case) by looking at an example.
Example 3.4.4
Let X be the result of tossing a fair die. X has possible outcomes 1, 2, 3, 4, 5, 6. Let
Y be the random variable |X − 2|. Then the lemma gives
E(Y ) =
6
X
x=1
|x − 2| ·
1
1
1
1
1
1
11
1
=1· +0· +1· +2· +3· +4· =
.
6
6
6
6
6
6
6
6
327
3 Probability
But if we can also compute E(Y ) directly from the definition. Noting that the possible
values of Y are 0, 1, 2, 3, 4, we have
E(Y ) =
4
X
y=0
yfY (y) = 0 ·
1
2
1
1
1
11
+1· +2· +3· +4· =
.
6
6
6
6
6
6
The sum that computes E(Y ) is clearly the same sum as E(X) but in a “different
order” and with some terms combined since there are more than one x that produce a
given value of Y .
Example 3.4.5
Suppose that X ∼ Unif(0, 1) and that Y = X 2 . Then
Z 1
E(Y ) =
x2 · 1 dx = 1/3 .
0
This is consistent with the following simulation.
> x=runif(1000,0,1)
> y=x^2
> mean(y)
[1] 0.326449
While it is not necessarily the case that E(t(X)) = t(E(X)) (see problem 3.23), the next
proposition shows that the expectation function is a “linear operator.”
Lemma 3.4.8. If a and b are real numbers, then E(aX + b) = a E(X) + b.
3.4.4 The Variance of a Random Variable
We are now in a position to define the variance of a random variable. Recall that the
variance of a set of n data points x1 , . . . , xn is almost the average of the squared-deviation
from the sample mean.
X
Var(x) =
(xi − x)2 /(n − 1)
328
3.5 The Normal Distribution
The natural analogue for random variables is the following.
Definition 3.4.9 (variance, standard deviation of a random variable). Let X be a random
variable. The variance of X is defined by
2
σX
= Var(X) = E((X − µX )2 ) .
The standard deviation is the square root of the variance and is denoted σX .
The following lemma records the variance of several of our favorite random variables.
Lemma 3.4.10.
1. If X ∼ Binom(n, π) then Var(X) = nπ(1 − π).
m
n
m+n−k
2. If X ∼ Hyper(m, n, k) then Var(X) = k m+n
m+n
m+n−1 .
3. If X ∼ Unif(a, b) then Var(X) = (b − a)2 /12.
4. If X ∼ E(λ) then Var(X) = 1/λ2 .
3.5 The Normal Distribution
The most important distribution in statistics is called the normal distribution.
Definition 3.5.1 (normal distribution). A random variable X has the normal distribution
with parameters µ and σ if X has pdf
f (x; µ, σ) = √
1
2
2
e−(x−µ) /2σ
2πσ
−∞<x<∞.
We write X ∼ Norm(µ, σ) in this case.
329
3 Probability
0.4
f(x)
0.3
0.2
0.1
0.0
−3
−2
−1
0
1
2
3
x
Figure 3.5: The pdf of a standard normal random variable.
The mean and variance of a normal distribution are µ and σ 2 so that the parameters are
aptly, rather than confusingly, named. R functions dnorm(x,mean,sd), pnorm(q,mean,sd),
rnorm(n,mean,sd), and qnorm(p,mean,sd) compute the relevant values.
If µ = 0 and sd = 1 we say that X has a standard normal distribution. Figure 3.5
provides a graph of the density of the standard normal distribution. Notice the following
important characteristics of this distribution: it is unimodal, symmetric, and can take on
all possible real values both positive and negative. The curve in Figure 3.5 suffices to
understand all of the normal distributions due to the following lemma.
Lemma 3.5.2. If X ∼ Norm(µ, σ) then the random variable Z = (X − µ)/σ has the
standard normal distribution.
Proof. To see this, we show that P(a ≤ Z ≤ b) is computed by the integral of the standard
normal density function.
Z µ+bσ
1
X −µ
2
2
√
P(a ≤ Z ≤ b) = P(a ≤
≤ b) = P (µ + aσ ≤ X ≤ µ + bσ) =
e−(x−µ) /2σ dx .
σ
2πσ
µ+aσ
Now in the integral, make the substitution u = (x − µ)/σ. We have then that
Z µ+bσ
Z b
1
1
2
−(x−µ)2 /2σ 2
√
√ e−u /2 du .
e
dx =
2πσ
2π
µ+aσ
a
330
3.5 The Normal Distribution
But the latter integral is precisely the integral that computes P(a ≤ U ≤ b) if U is a
standard normal random variable.
The normal distribution is used so often that it is helpful to commit to memory certain
important probability benchmarks associated with it.
The 68–95–99.7 Rule
If Z has a standard normal distribution, then
1. P(−1 ≤ Z ≤ 1) ≈ 68%
2. P(−2 ≤ Z ≤ 2) ≈ 95%
3. P(−3 ≤ Z ≤ 3) ≈ 99.7%.
If the distribution of X is normal (but not necessarily standard normal), then these
approximations have natural interpretations using Lemma 3.5.2. For example, we can say
that the probability that X is within one standard deviation of the mean is about 68%.
Example 3.5.1
In 2000, the average height of a 19-year old United States male was 69.6 inches.
The standard deviation of the population of males was 5.8 inches. The distribution of
heights of this population is well-modeled by a normal distribution. Then the percentage of males within 5.8 inches of 69.6 inches was approximately 68%. In R,
> pnorm(69.6+5.8,69.6,5.8)-pnorm(69.6-5.8,69.6,5.8)
[1] 0.6826895
It turns out that the normal distribution is a good model for many variables. Whenever
a variable has a unimodal, symmetric distribution in some population, we tend to think of
the normal distribution as a possible model for that variable. For example, suppose that we
take repeated measures of a difficult to measure quantity such as the charge of an electron.
It might be reasonable to assume that our measurements center on the true value of the
quantity but have some spread around that true value. And it might also be reasonable to
331
3 Probability
assume that the spread is symmetric around the true value with measurements closer to
the true value being more likely to occur than measurements that are further away from
the true value. Then a normal random variable is a candidate (and often used) model for
this situation.
The most important use of the normal distribution stems from the way that it arises in
the analysis of repeated trials of a random experiment. This is a result of what might be
called the Fundamental Theorem of Statistics — The Central Limit Theorem. Before we
state the theorem, we give two examples illustrating the principles.
Example 3.5.2
Suppose that X is the result of tossing a single die and recording the number. Now
suppose that we wish to toss the die 100 times and record the results, x1 , . . . , x100 that
obtain. These data can be viewed as the result of performing 100 random processes
represented by random variables X1 , . . . , X100 which all have the same distribution
and are independent one from another. Consider now the sum y = x1 + · · · + x100
of the 100 tosses. (We’d expect this number to be in the ballpark of 350, wouldn’t
we?) We can consider this number y to be the result of a random variable, namely
Y = X1 + · · · + X100 . Y itself has a distribution and in theory we could write the pmf
for Y . (Y is discrete with possible values 100, 101, . . . , 599, 600.) A simulation suggests
what happens.
> trials10000=replicate(10000,sum(sample(c(1:6),100,replace=T)))
> summary(trials10000)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
286.0
339.0
350.0
350.2
362.0
414.0
> histogram(trials10000,xlab="Sum of 100 dice rolls")
Note that the histogram in Figure 3.6 suggests that Y has a distribution that is unimodal and symmetric.
Example 3.5.3
The random variable in the previous example was discrete. Suppose instead that X
is a continuous random variable. For example, suppose that X ∼ Exp(1). X might be
a waiting time random variable that measures the time until the next radioactive event
detected at a Geiger counter. Suppose that X1 , . . . , Xn are n independent trials of the
random process X. This would be a natural model for the experiment in which we wait
332
3.5 The Normal Distribution
Percent of Total
20
15
10
5
0
300
350
400
Sum of 100 dice rolls
Figure 3.6: 10,000 trials of the sum of 100 dice.
0.20
0.08
0.15
0.10
Density
Density
Density
0.06
0.10
0.05
0.05
0.04
0.02
0.00
0.00
0
5
10
15
Sum of 5 Exp(1) Random Variables
20
0.00
0
5
10
15
20
Sum of 10 Exp(1) Random Variables
25
30
10
20
30
40
Sum of 20 Exp(1) Random Variables
Figure 3.7: Sums of independent exponential random variables.
for not just one radioactive event but for n in succession. In this case Y = X1 +· · ·+Xn
is just the time until n events have happened. The histograms in Figure 3.7 shows what
might happen if n = 5, n = 10, and n = 20. One can see that as the number of trials
of the experiment increase, the distribution of the sum becomes more symmetric.
To describe this situation in general, we note that the situation we are imagining is
that we have n random variables X1 , . . . , Xn that have the same distribution and that are
independent one from another (i.e., the dice don’t talk to each other). We will call such
random variables i.i.d. (for independent and identically distributed). Random variables
that arise from repeating a random process and observing the same random variable are
the canonical example of i.i.d. random variables. In this situation, we sometimes refer
to the original random variable that describes the distribution in question as being the
population random variable. If X1 , . . . , Xn are i.i.d. random variables, X1 , . . . , Xn are
333
3 Probability
usually called a random sample. Note that this is the same term that we used to
describe a sample from a population. These meanings are related but different. Before we
state the Central Limit Theorem, we consider the properties of Y = X1 + · · · + Xn in terms
of those of X. Specifically we have
Lemma 3.5.3. Suppose that X1 , . . . , Xn are random variables and that Y = X1 +· · ·+Xn .
Then
P
1. E(Y ) = ni=1 E(Xi ), and
P
2. if in addition the Xi are independent, then Var(Y ) = ni=1 Var(Xi ), and
3. if in addition the Xi have normal distributions and independent, then Y has a normal
distribution.
In particular, if the Xi are i.i.d. with mean µ and variance σ 2 , then µY = nµ and Var(Y ) =
nσ 2 .
The lemma says that the sum of random variables that have normal distributions and
are independent is normal. The Central Limit Theorem says that even if the Xi are not
normal, if n is large the sum of the Xi is approximately normal.
Theorem 3.5.4 (Central Limit Theorem). Suppose that X1 , . . . , Xn are i.i.d. random
variables with common mean µ and variance σ 2 . Then as n gets large the random variable
Yn = X1 + · · · + Xn
has a distribution that approaches the normal distribution.
Given i.i.d. random variables X1 , . . . , Xn , we will often be interested in the mean of the
values of the random variables rather than the sum.
Definition 3.5.5 (sample mean). Given i.i.d. random variables X1 , . . . , Xn (i.e., a random
sample), the sample mean is the random variable Xn defined by
Xn = (X1 + · · · + Xn )/n
.
334
3.6 Exercises
Corollary 3.5.6. Suppose that X1 , . . . , Xn are i.i.d. random variables with common mean
µ and variance σ 2 . Then as n gets large the sample mean Xn has a distribution that is
approximately normal with mean µ and variance σ 2 /n.
Returning to Example 3.5.3, we have that if we find the sum Y of 10 different exponential
random variables with λ = 1, the mean and variance of Y are each 10. (Recall that the
mean and variance of X ∼ Exp(λ) are 1/λ and 1/λ2 respectively.) Corollary 3.5.6 is
especially important since we are often interested in the mean of data values x1 , . . . , xn
that can be modeled as resulting from repeating a random process n times.
Example 3.5.4
In Example 2.6.1 we considered simple random samples of size 5 from a population
of 134 MIAA basketball players. We observed the points per game of each of the 5
players in our sample. We could consider the sample of size 5 that we generated as
the result of 5 random variables X1 , . . . , X5 . Now this sample does not fit exactly the
framework of Corollary 3.5.6. Namely, these random variables are not independent.
Once we choose the first player at random (X1 , a discrete random variable with 134
possibilities) the distribution of points per game of X2 changes. This is because we
generally sample without replacement. We can rectify this in two ways. First, we may
sample with replacement. That guarantees that the five random variables are i.i.d.
Else, we can sample with replacement but believe that the random variables Xi are
close enough to being independent so as not to affect the result too much. This is
especially true if the sample size (5) is much smaller than the population size (134).
3.6 Exercises
3.1 Suppose that four coins (a penny, nickel, dime and quarter) are tossed and the face-up
side of each is observed as heads or tails.
a) How many equally likely outcomes are there? List them.
b) In how many of these outcomes is exactly one head showing?
335
3 Probability
c) What is the probability that exactly one head is showing?
3.2 Suppose that ten coins are tossed.
a) How many equally likely outcomes are there? Do not list them!
b) In how many of these outcomes is exactly one head showing?
c) What is the probability that exactly one head is showing?
3.3 Use R to simulate the rolling of a fair six-sided die. (e.g., sample(c(1:6),1) will do
the trick). Roll the die 600 times.
a) How many of each of the numbers 1 through 6 did you “expect” to occur?
b) How many of each of the numbers 1 through 6 actually occurred?
c) Are you surprised by the discrepancy between your answers to (a) and (b)? Why or
why not?
3.4 Suppose that a small class of 10 students has 4 male students and 6 female students.
A random sample of two students is chosen from this class. What is the probability that
both of the students are male? (Hint: first find the number of equally likely outcomes.)
3.5 Toss a coin 1,000 times (a simulated coin, not a real one!).
a) What is the number of heads in the 1,000 tosses. (You can do this very easily if you
code heads as 1 and tails as 0.)
b) Now repeat this procedure 10,000 times (that is toss 1,000 coins 10,000 times). You
now have 10,000 different answers to part (a). Don’t write them all down but describe
the distribution of these 10,000 numbers using the terminology and techniques for
describing distributions.
3.6 Let E C be the event “E doesn’t happen.” For example, if we toss one die and E is
the event that the die comes up 1 or 2, then E C is the event that the die doesn’t come up
336
3.6 Exercises
1 or 2 (and so E C is the event that the die comes up 3, 4, 5, or 6). Show from the axioms
of probability that P (E C ) = 1 − P (E).
3.7 Suppose that you roll 5 standard dice. Determine the probability that all the dice are
the same.
3.8 Suppose that you deal 5 cards from a standard deck of cards. Determine the probability
that all the cards are of the same color. (A standard deck of cards has 52 cards in two
colors. There are 26 red and 26 black cards.)
3.9 Acceptance sampling is a procedure that tests some of the items in a lot and decides
to accept or reject the entire lot based on the results of testing the sample. Suppose that
the test determines whether an item is “acceptable” or “defective”. Suppose that in a lot
of 100 items, 4 are tested and that the lot is rejected if one or more of those four are found
to be defective.
a) If 10% of the lot of 100 are defective, what is the probability that the purchaser will
reject the shipment?
b) If 20% of the lot of 100 are defective, what is the probability that the purchaser will
reject the shipment?
3.10 Suppose that there are 10,000 voters in a certain community. A random sample
of 100 of the voters is chosen and are asked whether they are for or against a new bond
proposal.
a) If only 4,500 of the voters are for the bond proposal, what is the probability that
fewer than half of the sampled voters are in favor of the bond proposal?
b) Suppose instead that the sample consists of 2,000 voters. Answer the same question
as in the previous part.
3.11 If the population is very large relative to the size of the sample, it seems like sampling
with replacement should yield very similar results to that of sampling without replacement.
Suppose that an urn contains 10,000 balls, 3,000 of which are white.
a) If 100 of these balls are chosen at random with replacement, what is the probability
that at most 25 of these are white?
337
3 Probability
b) If 100 of these balls are chosen at random without replacement, what is the probability that at most 25 of these are white?
(
2x
3.12 A random variable X has the triangular distribution if it has pdf fX (x) =
0
x ∈ [0, 1]
otherwise.
a) Show that fX is indeed a pdf.
b) Compute P(0 ≤ X ≤ 1/2).
c) Find the number m such that P(0 ≤ X ≤ m) = 1/2. (If is natural to call m the
median of the distribution.)
(
k(x − 2)(x + 2)
3.13 Let f (x) =
0
−2 ≤ x ≤ 2
otherwise.
a) Determine the value of k that makes f a pdf. Let X be the corresponding random
variable.
b) Calculate P(X ≥ 0).
c) Calculate P(X ≥ 1).
d) Calculate P(−1 ≤ X ≤ 1).
3.14 Describe a random variable that is neither continuous nor discrete. Does your random
variable have a pmf? a pdf? a cdf?
3.15 Show that if f and g are pdfs and α ∈ [0, 1], then αf + (1 − α)g is also a pdf.
3.16 Suppose that a number of measurements that are made to 3 decimal digits accuracy
are each rounded to the nearest whole number. A good model for the “rounding error”
introduced by this process is that X ∼ Unif(−.5, .5) where X is the difference between the
true value of the measurement and the rounded value.
a) Explain why this uniform distribution might be a good model for X.
338
3.6 Exercises
b) What is the probability that the rounding error has absolute value smaller than .1?
3.17 If X ∼ Exp(λ), find the median of X. That is find the number m such that P(X ≤
m) = 1/2.
3.18 A part in the shuttle has a lifetime that can be modeled by the exponential distribution with parameter λ = 0.01, where the units are hours. The shuttle mission is scheduled
for 200 hours.
a) What is the probability that the part fails on the mission?
b) The event that is described in part (a) is BAD. So the shuttle carries two replacements
for the part (a total of three altogether). What is the probability that the mission
ends without all three failing?
3.19 The lifetime of a certain brand of water heaters in years can be modeled by a Weibull
distribution with α = 2 and β = 25.
a) What is the probability that the water heater fails within its warranty period of 10
years?
b) What is it probability that the water heater lasts longer than 30 years?
c) Using a simulation, estimate the average life of one of these water heaters.
3.20 Prove Theorem 3.4.5.
3.21 Suppose that you have an urn containing 100 balls, some unknown number of which
are red and the rest are black. You choose 10 balls without replacement and find that 4 of
them are red.
a) How many red balls do you think are in the urn? Give an argument using the idea
of expected value.
b) Suppose that there were only 20 red balls in the urn. How likely is it that a sample
of 10 balls would have at least 4 red balls.
339
3 Probability
3.22 The file http://www.calvin.edu/~stob/data/scores.csv contains a dataset that
records the time in seconds between scores in a basketball game played between Kalamazoo
College and Calvin College on February 7, 2003.
a) This waiting time data might be modeled by an exponential distribution. Make some
sort of graphical representation of the data and use it to explain why the exponential
distribution might be a good candidate for this data.
b) If we use the exponential distribution to model this data, which λ should we use? (A
good choice would be to make the sample mean equal to the expected value of the
random variable.)
c) Your model of part (b) makes a prediction about the proportion of times that the
next score will be within 10, 20, 30 and 40 seconds of the previous score. Test that
prediction against what actually happened in this game.
3.23 Show that it is not necessarily the case that E(t(X)) = t(E(X)).
3.24 Let X be the random variable that results form tossing a fair six-sided die and reading
the result (1–6). Since E(X) = 3.5, the following game seems fair. I will pay you 3.52 and
then we will roll the die and you will pay me the square of the result. Is the game fair?
Why or why not?
3.25 In this problem we compare sampling with replacement to sampling without replacement. You will recall that the former is modeled by the binomial distribution and the
latter by the hypergeometric distribution. Consider the following setting. There are 4,224
students at Calvin and we would like to know what they think about abolishing the interim. We take a random sample of size 100 and ask the 100 students whether or not they
favor abolishing the interim. Suppose that 1,000 students favor abolishing the interim and
the other 3,224 misguidedly want to keep it.
a) Suppose that we sample these 100 students with replacement. What is the mean and
the variance of the random variable that counts the number of students in the sample
that favor abolishing the interim?
b) Now suppose that we sample these 100 students without replacement. What is the
mean and the variance of the random variable that counts the number of students in
the sample that favor abolishing the interim?
340
3.6 Exercises
c) Comment on the similarities and differences between the two. Give an intuitive
reason for any difference.
3.26 Scores on IQ tests are scaled so that they have a normal distribution with mean 100
and standard deviation 15 (at least on the Stanford-Binet IQ Test).
a) MENSA, a society supposedly for persons of high intellect, requires a score of 130
on the Stanford-Binet IQ test for membership. What percentage of the population
qualifies for MENSA?
b) One psychology text labels those with IQs of between 80 and 115 as having “normal
intelligence.” What percentage of the population does this range contain?
c) The top 25% of scores on an IQ test are in what range?
d) If two different individuals are chosen at random, what is the probability that the
sum of their IQ scores is greater than 240?
3.27 In this problem we investigate the accuracy of the Central Limit Theorem by simulation. Suppose that X is a random variable that is exponential with parameter λ = 1/2.
2 = 4. Suppose that we repeat the random experiment n times to get
Then µX = 2 and σX
independent random variables X1 , . . . , Xn each of which is exponential with λ = 1/2. (The
R function rexp(n,lambda=.5) will simulate this experiment.)
a) From Lemma 3.5.3 X = (X1 + · · · + X5 )/5 has what mean and variance?
b) If n = 5, what does the Central Limit Theorem predict for P (1.5 < X < 2.5)?
c) Simulate the distribution of X by taking many samples of size 5. Compute the
proportion of your samples for which (1.5 < x < 2.5) and compare to part (b).
d) Repeat parts (b) and (c) for n = 30.
341
4 Inference
4.1 Hypothesis Testing
Suppose that a real-world process is modeled by a binomial distribution for which we know
n but do not know π. Examples abound.
Example 4.1.1
1. We have said that a fair coin is equally likely to be heads or tails when tossed.
But now suppose we have a coin and toss it 100 times. How do we know it is fair?
That is, how do we know π = 0.5?
2. A factory produces the ubiquitous widget. It claims that the probability that
any widget is defective is less than 0.1%. We receive a shipment of widgets. We
wonder whether the claim about the defective rate is really true. If we test 100
widjits, this is an example of a binomial experiment with n = 100 and π unknown.
3. A National Football League team is trying to decide whether to replace its field
goal kicker with a new one. The current kicker makes about 30% of his kicks from
45 yards out. The team tests the new kicker by asking him to try 20 kicks from
45 yards out. This might be modeled by a binomial distribution with n = 20 and
π unknown. The team is hoping that π > .3.
4. A standard test for ESP works as follows. A card with one of five printed symbols
is selected without the person claiming to have ESP being able to see it. The
purported psychic is asked to name what symbol is on the card while the experimenter looks at it and “thinks” about it. A typical experiment consists of 25
trials. This is an example of a binomial experiment with n = 25 and unknown π.
The experimenter usually believes that π = .2.
401
4 Inference
In each of the instances of Example 4.1.1 we have a hypothesis about π that we could
be considered to be testing. In the four cases we could be considered to be testing the
hypotheses π = .5, π ≤ 0.001, π ≤ 0.3, and π = .2. A hypothesis proposes a possible state
of affairs with respect to a probability distribution governing an experiment that we are
about to perform. There are a variety of kinds of hypotheses that we might want to test.
1. A hypothesis stating a fixed value of a parameter: π = .5.
2. A hypothesis stating a range of values of a parameter: π ≤ .3.
3. A hypothesis about the nature of the distribution itself: X has a binomial distribution.
To test the hypothesis that the coin is fair (π = .5) we must actually collect data. Suppose
that we toss the coin n = 100 times and get x = 40 heads. What should we conclude about
our hypotheses? The first thing to note is that we cannot conclude anything with certainty
in this case. Any value of x = 0, 1, . . . , 100 is consistent with both π = 0.5 and any other
value of π. However, if the coin really is fair, some results for x are more surprising than
others. In this case, for example, if the our hypothesis is true, then P(X ≤ 40) = 0.02844,
so we would only get 40 or fewer heads about 2.8% of the times that did this test. In other
words, getting only 40 heads is pretty unusual, but not extremely unusual. This gives us
some evidence to suggest that the coin in biased. After all, one of two things must be true.
Either
• the coin is fair (π = 0.50) and we were just “unlucky” in our particular 100 tosses, or
• the coin is not fair, in which case the probability calculation we just did doesn’t apply
to the coin.
That in a nutshell is the logic of a statistical hypothesis test. We will learn a number
of hypothesis tests, but they all follow the same basic outline.
Step 1: State the null and alternative hypotheses
In a typical hypothesis test, we pit two hypotheses against each other.
402
4.1 Hypothesis Testing
1. Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a hypothesis that the data analysis is intended to investigate. It is usually thought of as the
“default” or “status quo” hypothesis that we will accept unless the data gives us
substantial evidence against it.
2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the
hypothesis that we are wanting to put forward as true if we have sufficient evidence
against the null hypothesis.
In the example of the supposedly fair coin, it is clear that the hypotheses should be
H0 :
Ha :
π = 0.5
π 6= 0.5
The null hypothesis simply says that the coin in fair while the alternate hypothesis says
that it is not. We want to choose between these two hypotheses. In this example, the
alternate hypothesis is two-sided. There are also situations when we wish to consider a
one-sided alternate hypothesis. Consider the ESP example. Our null hypothesis is surely
that the subject cannot do better than chance (π = .2) but our alternate hypothesis is that
the subject can do better than chance (π > .2). In our particular test, we do not allow for
the possibility that the the subject somehow typically does worse than chance (although
this is logically possible).
Step 2: Calculate a test statistic
In our example, we compute the number of heads (40). This is the number that we will
use to test our hypothesis. The number 40 in this instance is called a statistic. Since we
use this statistic to test our hypothesis, we will sometimes call it a test statistic. In fact
we will use the term statistic in two different ways. In this case, the number 40 is a specific
value that is computed from the data. But also, 40 is the value of a certain random variable
that is computed from the experiment of tossing a coin 100 times. We will refer to both
the random variable and its value as statistics. In keeping with our notation for random
variables and data, upper-case letters will denote random variables and lower-case letters
their particular values.
A test statistic should be some number that measures in some way how true the null
hypothesis looks. In this case, a number near 50 is in keeping with the null hypothesis.
The farther x is from 50, the stronger the evidence against the null hypothesis.
403
4 Inference
Step 3: Compute the p-value
.
Now we need to evaluate the evidence that our test statistic provides. To do this requires
that we think about our statistic as a random variable. In the case of the supposedly fair
coin, our test statistic X ∼ Binom(100, π). As a random variable, our test statistic has a
distribution. The distribution of the test statistic is called its sampling distribution.
Now we can ask probability questions about our test statistic. The general form of the
question is “How unusual would my test statistic be if the null hypothesis were true?” To
do this, it is important that we know about the distribution of X when the null hypothesis
is true. In this case, X ∼ Binom(100, 0.5). So how unusual is it to get only 40 heads?
Assuming that the null hypothesis is true (i.e., that the coin is fair),
and
P(X ≤ 40) = pbinom(40,100,.5) = 0.0284 ,
P(X ≥ 60) = 1 - pbinom(59,100,.5) = 0.0284 .
So the probability of getting a test statistic at least as extreme (unusual) as 40 is 0.0568.
This probability is called a p-value.
There is some subtlety to the above computation and we shall return to it.
Step 4: Draw a conclusion
Drawing a conclusion from a p-value is a judgment call and it is a scientific rather than
mathematical decision. Our p-value is 0.0568. This means that if we flipped 100 fair coins
many times, between 5 and 6% of these times we would fewer than 41 or more than 59
heads. So our result of 40 is a bit on the unusual side, but not extremely so. Our data
provide some evidence to suggest that the coin may not be fair, but the evidence is far
from conclusive. If we are really interested in the coin, we probably need to gather more
data.
Other hypothesis tests will proceed in a similar fashion. The details of how to compute
a test statistic and how to convert it into a p-value will change from test to test, but the
interpretation of the p-value is always the same. The p-value measures how surprising the
value of the test statistic would be if the null hypothesis were true. The next example illustrates the steps of the hypothesis testing paradigm in a case where the alternate hypothesis
is one-sided. Example 4.1.2
404
4.1 Hypothesis Testing
A company receives a shipment of printed circuit boards. The claim of the manufacturer is that the defective rate is at most 1%. If 100 boards are tested, should we
dispute the claim of the manufacturer if we find 3 defective boards in this test? In this
situation, the pair of hypotheses to test are
H0 : π = 0.01
Ha : π > 0.01
The following R session is relevant to this example.
> 1-pbinom(c(0:5),100,.01)
[1] 0.633968 0.264238 0.079373 0.018374 0.003432 0.000535
From this computation, we find that even if the null hypothesis is true, we could
expect to find 3 or more defective boards 7.9% of the time if we test 100. This result
doesn’t seem surprising enough to reject the null hypothesis or the shipment. (But
perhaps you disagree!) In this example, we have illustrated how we proceed when
the alternate hypothesis is one-sided. Namely, we only consider results to favor the
alternate hypothesis when they are in the correct direction of the null hypothesis.
That is, we wouldn’t consider having too few defectives as evidence against the null
hypothesis in favor of the alternate hypothesis.
It is often the case that we must make a decision based on our hypothesis test. In
Example 4.1.2, for example, we must finally decide whether to reject the shipment. There
are of course two different kinds of errors that we could make.
Definition 4.1.1 (Type I and Type II errors). A Type I error is the error of rejecting
H0 even though it is true.
A Type II error is the error of not rejecting H0 even though it is false.
Of course, if we reject the null hypothesis, we cannot know whether we have made a
Type I error. Similarly, if we do not reject the null hypothesis, we cannot know whether
we have made a Type II error. Whether we have committed such an error depends on the
true value of π which we cannot ever know simply from data. What we can do however
is to compute the probability that we will make such an error given our decision rule and
our true state of nature.
To illustrate the computation of these two kinds of errors, let’s return to the computation
of the p-value in the case of the (un)fair coin. Suppose that we decide that whenever we
405
4 Inference
toss a coin 100 times, we will consider it unfair if we have 40 or fewer or 60 or more heads.
Then the p-value computation (recall the p-value was 0.0568) tells us that
If the null hypothesis is true, our decision rule will make a Type I error with
probability 5.68%
Is this the right decision rule to use? If we instead we decide to reject the null hypothesis
only if X ≤ 39 or X ≥ 61 we find that we will make a type I error with probability only
pbinom(39,100,.5) + (1-pbinom(60,100,.5))=0.035. Which decision rule should we
use? A common convention is to make some canonical choice of a probability of Type I
error that we are willing to tolerate. A probability of Type I error of 5% is often chosen. If
5% were the greatest type I error probability we were willing to tolerate then we would not
reject a null hypothesis if our p-value was greater than 5%. In the coin example, 40 heads
would be acceptable but 39 would not. The choice of 5% is conventional but somewhat
arbitrary. It is usually better to report the result of a hypothesis test as a p-value rather
than simply reporting that the null hypothesis is rejected. We usually denote by α the
probability of a Type I error that we are willing to accept in our decision rule.
Notice that if we lower α it becomes more difficult to reject the null hypothesis. This
means that if the null hypothesis is false, the probability of a Type II error increases with
decreasing α. (Oddly enough, the probability of a Type II error is named β.) We cannot
compute the probability of a Type II error however without knowing the true value of π.
Consider the case of the un(fair) coin. Suppose we choose α = .5 and so we choose to reject
the null hypothesis only if X ≤ 39 or X ≥ 61. What is the probability that we make a
type II error if the true value of π = 0.55? It is easy to see that this is computed by
> pbinom(39,100,.55) + (1-pbinom(60,100,.55))
[1] 0.1351923
Notice that we will reject the null hypothesis only 13.5% of the time so that the probability
that we make a Type II error is 86.5%! Obviously, our test is very conservative and will not
detect an unfair coin very often. That is the penalty we pay for wanting to be reasonably
sure that we do not make a Type I error. The next example illustrates these considerations
in the case of a one-sided alternate hypothesis.
Example 4.1.3
As described in Example 4.1.1, the conventional test for ESP is a card test. The
subject is asked to guess what is on 25 consecutive cards each of which contains one of
five symbols. The appropriate pair of hypotheses in this case are
406
4.2 Inferences about the Mean
H0 : π = 0.2
Ha : π > .2
The following computation from R will help us develop our test.
> 1-pbinom(c(5:10),25,.2)
[1] 0.38331 0.21996 0.10912 0.04677 0.01733 0.00555
Obviously, our decision rule should say to reject the null hypothesis if the number of
successes is too large. Note that the probability that P(X ≥ 9) = 4.7% if the null
hypothesis is true. Therefore if we choose α to be 5% as is a custom, we should reject
the null hypothesis in favor of the alternate hypothesis if the number of successes is at
least 9. If we follow this rule, the probability that we will make a Type I error is 4.7%
if the null hypothesis is true.
What if the null hypothesis is false? For example what if the true value of π =
.3? (This is a rather modest case of ESP but such a person would be interesting!) In this case, our decision rule would reject the null hypothesis with probability
1-pbinom(8,25,.3)=.323. Note that even if our subject has ESP, our test could very
well not detect this.
What one should notice in our treatment of decision rules is the asymmetry between the
two hypotheses. We are generally not willing to tolerate a large probability of a Type I
error – we often set α = 5%. However this seems to lead to a rather large probability of a
Type II error in the case that the null hypothesis is false. This asymmetry is intentional
however as the null hypothesis usually has a preferred status as the “innocent until proven
guilty” hypothesis.
4.2 Inferences about the Mean
One of the most important problems in inferential statistics is that of making inferences
about the (unknown) mean of a population.
Example 4.2.1
1. What is the average height of a Calvin College student? It not being feasible to
measure each student, we might take a random sample of Calvin students and
407
4 Inference
compute the sample mean, x of these students. How close is x likely to be to the
true mean?
2. We have a number of chickens that we feed a diet of sunflower seeds. The average
weight of the chickens after 30 days is 330 grams. How close is this number to the
average weight of the (theoretical) population of “all” similar chickens?
3. We take a number of measurements of the speed of the light. How close is the
average of these measurements likely to be to the “true” value?
In this section, we will conceptualize the above examples as instances of this question.
Given i.i.d. random variables X1 , . . . , Xn with unknown mean µX , what can we infer
about µX from a particular outcome x1 , . . . , xn ?
Estimates and Estimators
We will call x an estimate of µX and X and estimator of µX . The difference is that
X is a random variable – you can think of it as a procedure for producing an estimate —
and x is a number. The estimator X has two very important properties that make it a
desirable estimator. The first is that E(X) = µX . In other words, in the long run, the
sample mean will average the population mean. Because of this, we say that the estimator
X is unbiased. An unbiased estimator doesn’t have a tendency to under- or over-estimate
the quantity in question. The general definition is this.
Definition 4.2.1 (unbiased estimator). Suppose that θ is a parameter of a distribution and
that Y is a statistic computed from a random sample X1 , . . . , Xn from that distribution.
Then Y is an unbiased estimator of θ if E(Y ) = θ.
2 which is the real
It turns out that the sample variance S 2 is an unbiased estimator of σX
2
reason we use n − 1 rather than n in the definition of S .
408
4.2 Inferences about the Mean
The second important property is that X is likely to be close to µX if n is large. Formally,
we can say that for every > 0 we have that limn→∞ P(|Xn − µX | > ) = 0. While we will
not prove this, it follows from the fact that the variance of Xn is σ 2 /n.
These two properties together suggest that X is a good choice for an estimator of µ.
The Idea of a Confidence Interval
While the estimator X may be a good procedure to use, we recognize that in any particular
instance, the estimate x will not be equal to to µX . We next will use the Central Limit
Theorem to say something about how close to µX the estimate is likely to be. The Central
Limit Theorem allows us to say that X is approximately normally distributed with mean
µX and variance σ 2 /n. Thus the following random variable has a distribution that is
approximately standard normal:
Z=
X −µ
√
σ/ n
Therefore we can write
X −µ
√ < 1.96 ≈ .95 .
P −1.96 <
σ/ n
(The number 2 in the 68%-95%-99.7% law is actually 1.96.) Using algebra, we find that
σ
σ
≈ .95 .
P X − 1.96 √ < µ < X + 1.96 √
n
n
What this probability statement says is that the interval
σ
σ
X − 1.96 √ , X + 1.96 √
n
n
is likely to contain the true mean of the distribution. This interval is a random interval.
Definition 4.2.2 (confidence interval). Suppose that X1 , . . . , Xn is a random sample from
a distribution that is normal with mean µ and variance σ 2 . Suppose that x1 , . . . , xn is the
observed sample. The interval
σ
σ
x − 1.96 √ , x + 1.96 √
n
n
409
4 Inference
is called an approximate 95% confidence interval for µ.
How does this notion of a confidence interval help us? Actually not much since this
interval is defined in terms of σ, the standard deviation of the original distribution. But
σ is not likely to be known (after all, we don’t even know the mean µ of the original
distribution). Let’s set that issue aside and consider an example.
Example 4.2.2
A machine creates rods that are to have a diameter of 23 millimeters. It is known
that the standard deviation of the actual diameters of parts created over time is 0.1
mm. A random sample of 40 parts are measured precisely to determine if the machine
is still producing rods of diameter 23 mm. The data and 95% confidence interval are
given by
> x
[1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186
[11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054
[21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889
[31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031
> mean(x)
[1] 23.024
> c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40))
[1] 22.993 23.055
23.015
22.995
23.019
23.059
23.089
22.894
23.073
23.117
It appears that the process could still be producing rods of average diameter 23 mm.
We use the term confidence interval for this interval since we are are reasonably confident
that the true mean of the rods is in the interval (22.933, 23.055). We even have a number
that quantifies that confidence, 95%. But we need to be very careful in what are saying.
We are not saying that
(BAD - DO NOT SAY) the probability that the true mean is in the interval
(22.993, 23.055) is 95%.
There is no probability after the data are collected. Either the mean is in the interval
or it isn’t. Rather we are making a statement before the data are collected:
410
4.2 Inferences about the Mean
If we are to generate a 95% confidence interval for the mean from a random
sample of size 40 from a normal distribution with standard deviation 0.1, then
the probability is 95% that the resulting confidence interval will contain the
mean.
On the frequentist conception of probability we could say
If we generate many 95% confidence intervals by this procedure, approximately
95% of them will contain the mean of the population.
After the data are collected, a good way of describing the confidence interval that results
is
Either the population mean is in (22.993, 23.055) or something surprising happened.
Notice that the confidence interval says something about the precision of our estimate.
A wide confidence interval means that our estimate is not very precise.
But σ Isn’t Known!
Using the Central Limit Theorem, we have seen that
σ
σ
P X − 1.96 √ < µ < X + 1.96 √
≈ .95 .
n
n
(4.1)
The next step is to make another approximation. We need to get rid√of σ. Since S 2 ,
the sample variance is an unbiased estimate of σ 2 , the trick is to use S = S 2 , the sample
statidard deviation, to estimate σ. Thus we have
S
S
P X − 1.96 √ < µ < X + 1.96 √
≈ .95 .
n
n
Now, after the experiment we have values for both X and S. We illustrate the procedure
for getting our new confidence interval using Example 4.2.2. Note that the following R code
computes a 95% confidence interval for µX .
411
4 Inference
> sd(x)
[1] 0.098755
> c( mean(x) - 1.96* sd(x)/sqrt(40), mean(x) + 1.96 * sd(x)/sqrt(40))
[1] 22.993 23.054
Removing the Approximations
Our new 95% confidence interval for the mean
s
s
.
x − 1.96 √ , x + 1.96 √
n
n
makes two approximations:
• We use the CLT to say that we can use the normal distribution (that’s where the
1.96 comes from)
• We use S instead of σ simply because we do not know σ
The CLT Approximation
There are two ways of getting around the fact that we use the CLT in our approximation.
First, we could assume that the underlying distribution is normal. Then there is no need
to approximate since the distribution of X is exactly normal. Or we could use facts about
the particular distribution in question. For example if X is binomial, we could use similar
facts about the binomial distribution to develop a different kind of confidence interval.
In general however we are just going to have to be content with the fact that the our
confidence intervals are approximate and hope that our sample size n is large enough.
The Approximation of using S for σ
The bottom line here is that we will change the 1.96 used in our current approximation to a
slightly larger number to compensate for the approximation that results from not knowing
σ. It seems right to do this: if we are less sure that we are using the right endpoints for the
interval, we should make the interval a little wider to ensure that we have a 95% chance
of capturing the mean. How much wider we should make the interval is a somewhat tricky
(and long) story that we will tell in the next section.
412
4.3 The t-Distribution
Before we modify our intervals to take into account the approximation of σ by S, we
note that we could modify our confidence intervals in a number of ways. For example, the
number 95% is not sacred. It should be clear how to generate a 68% confidence interval
or even a 80% confidence interval. We merely need to look up the appropriate fact about
the standard normal distribution. A second way in which we might modify our intervals is
to make them one-sided. For example, if we wanted a lower-bound for our rod diameters,
since qnorm(.05,0,1)=-1.644854 we could use
S
P X − 1.64 √ < µ < ∞ ≈ .95 .
n
4.3 The t-Distribution
In the last section, we left the problem of finding a confidence interval for µ at the point
where we were had a perfectly reasonable, but approximate, confidence interval. There
were two approximations: the use of the CLT and the approximation of σ by S. We
focus on the later problem here. We will begin by assuming that the random sample
X1 , . . . , Xn are normal random variables so that we need not concern ourselves with the
X −µ
X −µ
√ by
√ ?
CLT approximation. Then the question is, what is the effect of replacing
σ/ n
S/ n
The t-distribution holds the key.
Definition 4.3.1 (t-distribution). A random variable T has a t distribution (with parameter ν ≥ 1, called the degrees of freedom of the distribution) if it has pdf
1 Γ((ν + 1)/2)
1
f (t) = √
2
Γ(ν/2) (1 + t /ν)(ν+1)/2
πν
−∞<t<∞.
(The Γ function in the definition of the pdf above is an important function from analysis
that is a continuous extension of the factorial function. But in this instance, it doesn’t
really matter what it is since its purpose is simply as a constant to ensure that the integral
of the density is 1.)
Some properties of the t-distribution include
1. f is symmetric about t = 0 and unimodal. In fact f looks bell-shaped.
413
4 Inference
2. The mean of T is 0 if ν > 1 (and does not exist if ν = 1).
3. The variance of T is ν/(ν − 2) if ν > 2.
4. For large ν, T is approximately standard normal.
0.3
density
0.2
0.1
x=seq(-3,3,.01)
y=dt(x,3)
z=dt(x,10)
w=dnorm(x,0,1)
plot(w~x,type="l",ylab="density")
lines(y~x)
lines(z~x)
0.0
>
>
>
>
>
>
>
0.4
R knows the t-distribution of course and the appropriate functions are dt(x,df), pt(),
qt(), and rt(). The graphs of the normal distribution and two t-distributions are shown
below.
−3
−2
−1
0
1
2
3
x
The importance of the t-distribution is contained in the following Theorem.
Theorem 4.3.2. If X1 , . . . , Xn are i.i.d. normal random variables with mean µ and variance σ 2 , then the random variable
X −µ
√
S/ n
has a t distribution with n − 1 degrees of freedom.
It is now clear how to generate an exact confidence interval for µ in the case that the
data come from a normal distribution. For any number β, let tβ,ν be the unique number
such that
P (T > tβ,ν ) = β
where T is random variable that has a t distribution with ν degrees of freedom. Then we
have
414
4.3 The t-Distribution
Theorem 4.3.3. If x1 , . . . , xn are the observed values of a random sample from a normal
distribution with unknown mean µ and t∗ = tα/2,n−1 , the interval
∗ s
∗ s
√
√
x̄ − t
, x̄ + t
n
n
is an 100(1 − α)% confidence interval for µ.
In Example 4.2.2 where we considered the diameter of manufactured rods, we had n = 40.
If we assume that the measurements come from a normal distribution, we would use the
t-distribution with ν = 39. To find a 95% confidence interval we need t.025,39 . R of course
computes this as qt(.975,39)= 2.022691 . So the effect of not knowing σ in this case
is to use 2.02 in determining the width of the confidence interval rather than 1.96.
Notice that in this confidence interval there are three components. The first is an estimate
x of the quantity it is a confidence interval for. Second there is a number t∗ determined
from the t-distribution by the level of confidence and the degrees of freedom. This number
√
is usually referred to as a critical value. Finally, there is an estimate s/ n of the standard
√
deviation of the estimator. The number σ/ n is often called the standard error (of the
√
estimator or of the mean) and is often denoted σe . The estimate s/ n of this standard
error is often denoted se . Therefore we have that the confidence interval is of the form
(estimate) ± (critical value) · (estimate of standard error) .
Many other confidence intervals in statistics have the same form. The critical values and
estimates change based on the situation but the general form of the interval is the same.
Because of the importance of confidence intervals for µ that are generated by the tdistribution, there is a function in R that does the table lookup and the arithmetic for us.
We illustrate in the next example.
Example 4.3.1
Returning to the iris data, we might want to know the average sepal width of virginica
irises. There is a lot to ignore in the following output but note that two confidence
intervals are generated (95% and 90%) and that the t-distribution is used with 49
degrees of freedom (as n = 50).
> data(iris)
> sw=iris$Sepal.Width[iris$Species=="virginica"]
415
4 Inference
> hist(sw)
> t.test(sw)
One Sample t-test
data: sw
t = 65.208, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.882347 3.065653
sample estimates:
mean of x
2.974
> t.test(sw,conf.level=.9)
One Sample t-test
data: sw
t = 65.208, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
2.897536 3.050464
sample estimates:
mean of x
2.974
Now that we have exact confidence intervals in the case that data come from a normal
distribution even if σ is unknown, we turn to the case that the underlying distribution is
unknown. In this case we advocate using the t-distribution just as above, recognizing that
the result is just an approximation. We illustrate in the following example.
Example 4.3.2
Thirty seniors are chosen at random from the collection of 1,333 seniors at a certain midwest college. The average GPA of the thirty seniors chosen is 3.2981. What
inferences can we make about the mean GPA of the 1,333 seniors? We first simplify
and assume that the 30 seniors represent the result of thirty i.i.d. random variables.
Though sampling was without replacement, this seems like a relatively harmless as-
416
4.3 The t-Distribution
sumption. We next realize that the underlying distribution of GPAs is not likely to
be normal but rather to be negatively skewed. (This does not mean that we expect
to find negative GPAs!) The technology of the last section suggests to use the normal
distribution with s in place of σ. Using the t-distribution instead produces
> sr=read.csv(’http://www.calvin.edu/~stob/data/actgpa.csv’)
> sr$GPA
[1] 3.992 2.533 3.377 3.009 3.509 3.969 3.917 3.547 3.416 3.287 4.000 3.446
[13] 3.905 2.926 3.100 3.446 2.785 3.663 3.368 3.352 3.929 2.750 3.620 3.765
[25] 2.763 1.986 2.836 2.696 3.119 2.662
> t.test(sr$GPA)
One Sample t-test
data: sr$GPA
t = 35.1095, df = 29, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
3.097500 3.480700
sample estimates:
mean of x
3.2891
In this case, the effect of using the t-distribution is to replace 1.96 for 2.045 in the
computation of the width of the confidence interval. It seems prudent to use a wider
confidence interval since we are only approximating the “true” 95% confidence interval
in light of the fact that we are using the CLT.
Most statisticians recommend the approach of the last example. Namely, when constructing an approximate confidence interval in the case when our data are not from a
normal distribution, we use the t-distribution with its slightly wider intervals than would
be constructed by using the normal distribution. Statisticians have found that when this is
done, the confidence intervals constructed work well for a wide variety of underlying nonnormal distributions. That is, 95% confidence intervals produced from the t-distribution
tend to be approximately 95% confidence intervals even though the distributional hypothesis is not satistfied. We say that this method of producing confidence intervals is robust
meaning it is not particularly sensitive to departures from the hypothesis (normality) on
417
4 Inference
which it is based. (Older books suggest that one could use the normal distribution instead
of the t-distribution if n ≥ 30 but this was a computational simplification. R knows all the
t-distributions and can use one as easily as another.)
There are two very important cautions to be made here. Although the t-distribution
works well over a wide range of distributions and sample sizes, it still is an approximation
and in particular can give poor results if the sample size is small and the underlying distribution is quite skew. And the t-distribution will often fail disastrously if the independence
assumption is violated.
4.4 Inferences for the Difference of Two Means
In this section we consider the problem of making inferences about the difference of two
unknown means. We first give some examples.
Example 4.4.1
1. One might hypothesize that females get better grades at Calvin than males on
average. One way of stating this claim precisely is to claim that the average
GPA of females is greater than the average GPA of males. Since Calvin does
not publish the average GPA by gender, we might test this claim by choosing a
random sample of males and a separate random sample of females and comparing
the two sample means.
2. One might claim that Tylenol is better than ibuprofren for treating pain from
fractures in young children. To test this, one might assign children with leg
fractures at random to treatment by Tylenol or ibuprofren. One would then
compare the averages of some measure of pain relief in the two groups.
3. Kaplan claims to be able to raise SAT scores by 100 points with its tutoring
program. To test the claim, they take a number of individuals who have already
taken the SAT test and subject them to their program. The students then take
the SAT test after the program and their before and after scores are compared.
In the first case of the example, it is easy to see that we are choosing a random sample
from each of two different populations. The second case is somewhat different. The “populations” of ibuprofren and Tylenol takers are really theoretical and not actual populations.
418
4.4 Inferences for the Difference of Two Means
But we can still think of the results as random sample from these theoretical populations
(e.g., the population of all children with similar injuries who might be given ibuprofren),
in part because we randomized the assignment of individuals to the two groups. The third
case of the example is clearly different. The before and after scores do not represent two
independent populations since we measured these scores on the the same individuals. In
this section we address the issue of determining whether there is a difference in means
between the two populations. In this section, we consider the situation that arises in the
the first two cases of the example. We will call this the “two independent samples” case.
Assumptions for two independent samples:
1. X1 , . . . , Xm is a random sample from a population with mean µX and variance
2 .
σX
2. Y1 , . . . , Yn is a random sample from a population with mean µY and variance
σY2 .
3. The two samples are independent one from another.
4. The samples come from normal distributions.
We first write a confidence interval for the difference in the two means µX − µY . Just
as did our confidence intervals for one mean µ, our confidence interval will have the form
(estimate) ± (critical value) · (estimate of standard error) .
The natural choice for an estimator of µX − µY is X − Y . To write the other two pieces
of the confidence interval, we need to know the distribution of X − Y . The necessary fact
is this:
X − Y − (µX − µY )
q
∼ Norm(0, 1) .
2
2
σX
σY
m + n
Analogously to confidence intervals for a single mean, it seems like the right way to
proceed is to estimate σX by sX , σY by sY and to investigate the random variable
419
4 Inference
X − Y − (µX − µY )
q
.
2
SX
SY2
m + n
(4.2)
The problem with this approach is that the distribution of this quantity is not known
in general (unlike the case of the single mean where the analogous quantity has a tdistribution). We need to be content with an approximation.
Lemma 4.4.1. (Welch) The quantity in Equation 4.2 has a distribution that is approximately a t-distribution with degrees of freedom ν where ν is given by
ν=
2
SX
m
2 /m)2
(SX
m−1
+
+
SY2
n
2
(SY2 /n)2
n−1
(4.3)
(It isn’t at all obvious from the formula but it is good to know that min(m − 1, n − 1) ≤
ν ≤ n + m − 2.)
We are now in a position to write a confidence interval for µX − µY .
An approximate 100(1 − α)% confidence interval for µX − µY is
!
r
2
2
s
s
1
x − y ± t∗
+ 2
m
n
(4.4)
where t∗ is the appropriate critical value tα/2,ν from the t-distribution with ν degrees
of freedom given by (4.3).
We note that ν is not necessarily an integer and we leave it R to compute both the value
of ν and the critical value t∗ .
Example 4.4.2
420
4.4 Inferences for the Difference of Two Means
The t-test is due to “Student” (a pseudonym of William Sealy Gossett whose employer, Guinness Brewery, did not allow him to publish under his own name). In a
famous paper in 1908 addressing the issue of the inference about means, Student considered data from a sleep experiment. Two different soporifics were tried on a number
of subjects and the amount of extra sleep that each subject attained was recorded. The
question is whether one soporific worked better than another.
> sleep
extra group
1
0.7
1
2
-1.6
1
3
-0.2
1
4
-1.2
1
5
-0.1
1
.................
> t.test(extra~group,data=sleep)
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.0794
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75
2.33
We see that each group averaged more sleep and the excess was more for those
subjects in group 2. However it does not appear that we could say one drug was
clearly better than the other (after all, 0 is in the confidence interval so that the mean
difference could be 0). A 95% confidence interval for the difference in mean effect of
the two drugs is (−3.37, 0.21). We can see that the degrees of freedom is 17.776 and
we can be grateful that we didn’t have to compute it or the critical value. Note too
that R refers to this a Welch test.
We should remark at this point that older books (and the Fundamentals of Engineering
421
4 Inference
Exam) suggest an alternate approach to the problem of writing confidence intervals for
µX − µY . These books suggest that we assume that the two standard deviations σX and
σY are equal. In this case the exact distribution of our quantity is known. The problem
with this approach is that there is usually no reason to suppose that σX and σY are equal
and if they are not the proposed confidence interval procedure is not as robust as the one we
are using. In these notes we take the approach of not even mentioning what this alternate
procedure is since it has fallen into disfavor.
Hypotheses and Cautions
Confidence intervals generated by Equation 4.4 are probably the most common confidence
intervals in the statistical literature. But those who generate such intervals are not always
sensitive to the hypotheses that are necessary to be confident about the confidence intervals
generated. It should first be noted that the confidence intervals constructed are based on
the hypothesis that the two populations are normally distributed. It is often apparent from
even a cursory examination of the data that this hypothesis is unlikely to be true. However,
if the sample sizes are large enough, we can rely on the Central Limit Theorem to tell us
our results are approximately true. There are a number of different rules of thumb as to
what large enough means, but n, m > 15 for distributions that are relatively symmetric
and n, m > 40 for most distributions are common rules of thumb. A second approximation
concerns the approximation made in computing the Welch interval. The rule of thumb
here is that we are surer of confidence intervals where the quotients s2X /m and s2Y /n are
not too different in size than those in which they are quite different.
Turning Confidence Intervals into Hypothesis Tests
It is often the case that we are interested in testing a hypothesis about µX − µY rather
than computing a confidence interval for that quantity. For example, the null hypothesis
µX − µY = 0 in the context of an experiment is a claim that there is no difference in the
two treatments represented by X and Y . Hypothesis testing of this sort has fallen into
disfavor in many circles since the knowledge that µX − µY 6= 0 is of rather limited interest
unless the size of this quantity is known. A confidence interval answers that question more
directly. Nevertheless, since the literature is still littered with such hypothesis tests, we
give an example here.
Example 4.4.3
422
4.4 Inferences for the Difference of Two Means
Returning to our favorite chicks, we might want to know if we should believe that
the effect of a diet of horsebean seed is really different that a diet of linseed. Suppose
that x1 , . . . , xm are the weights of the m chickens fed horsebean seed and y1 , . . . , yn
are the weights of the n chickens fed linseed. The hypothesis that we really want
to test is H0q: µX − µY = 0. We note that if the null hypothesis is true, then
T = (X − Y )/ Sx2 /m + Sy2 /m has a distribution that is approximately a t-distribution
with the Welch formula giving the degrees of freedom. Thus the obvious strategy is
to reject the null hypothesis if the value of T is too large. Fortunately, R does all the
appropriate computations. Notice that the mean weight of the two groups of chickens
differs by 58.5 but that a 95% confidence interval for the true difference in means is
(−99.1, −18.0). On this basis we expect to conclude that the linseed diet is superior,
i.e., that there is a difference in the mean weights of the two populations. This is
verified by the hypothesis test of H0 : µX − µY = 0 which results in a p-value of 0.007.
That is, this great a difference in mean weight would have been quite unlikely to occur
if there was no real difference in the mean weights of the populations.
> hb=chickwts$weight[chickwts$feed=="horsebean"]
> ls=chickwts$weight[chickwts$feed=="linseed"]
> t.test(hb,ls)
Welch Two Sample t-test
data: hb and ls
t = -3.0172, df = 19.769, p-value = 0.006869
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-99.05970 -18.04030
sample estimates:
mean of x mean of y
160.20
218.75
Variations
One-sided confidence intervals and one-sided tests are possible as are intervals of different
confidence levels. All that is needed is an adjustment of the critical numbers (for confidence
423
4 Inference
intervals) or p-values for tests.
Example 4.4.4
A random dot stereogram is shown to two groups of subjects and the time it takes
for the subject to see the image is recorded. Subjects in one group (VV) are told
what they are looking for but subjects in the other group (NV) are not. The quantity
of interest is the difference in average times. If µX is the theoretical average of the
population of the NV group and µY is the average of the VV group, then we might
want to test the hypothesis
H0 : µX − µY = 0
Ha : µX > µY
> rds=read.csv(’http://www.calvin.edu/~stob/data/randomdot.csv’)
> rds
Time Treatment
1 47.20001
NV
2 21.99998
NV
3 20.39999
NV
......................
77 1.10000
VV
78 1.00000
VV
> t.test(Time~Treatment,data=rds,conf.level=.9,alternative="greater")
Welch Two Sample t-test
data: Time by Treatment
t = 2.0384, df = 70.039, p-value = 0.02264
alternative hypothesis: true difference in means is greater than 0
90 percent confidence interval:
1.099229
Inf
sample estimates:
mean in group NV mean in group VV
8.560465
5.551429
>
From this we see that a lower bound on the difference µX − µY is 1.10 at the 90%
level of confidence. And we see that the p-value for the result of this hypothesis test
424
4.5 Regression Inference
is 0.023. We would probably conclude that those getting no information take longer
than those who do on average.
.
4.5 Regression Inference
In Section 2.4, we tried to describe the relationship between two quantitative variables by
fitting a line to the data that came to us in pairs (x1 , y1 ), . . . , (xn , yn ). In this section, we
describe a statistical model that attempts to account for both the linear relationship in the
data and also the fact that the data are not exactly collinear. What results is known as
the standard linear model.
The standard linear model is given by the following equation that relates the values of
x and y.
Yi = β0 + β1 xi + i .
where
1. β0 , β1 are (unknown) parameters,
2. i is a random variable with mean 0 and (unknown) variance σ 2 ,
3. thus Yi is a random variable with mean β0 + β1 xi and variance σ 2 ,
4. the random variables i (and hence the variables Yi ) are independent,
5. the random variables i are normally distributed.
We can write this model more succinctly in terms of linear algebra. Let β = (β0 , β1 ).
Then the model says that Y = Xβ + where is a random vector. There are three
unknown parameters to estimate in this model: β0 , β1 , and σ 2 .
Estimating β0 and β1
One obvious choice for the estimates of β0 and β1 is given by the coefficients b0 , b1 of the
least squares regression line. It turns out that there are good statistical reasons for using
b0 , b1 to estimate β0 , β1 .
425
4 Inference
Lemma 4.5.1. The estimates b0 and b1 are unbiased estimates of β0 and β1 respectively.
Therefore, ŷi = b0 + b1 xi is an unbiased estimate of β0 + β1 xi .
Since b0 and b1 are the estimates, we will use B0 , B1 for the estimators (just as we used
X and x for the estimator and estimate of the mean). Unbiased estimators are not much
good to us if they have large variance. It is fairly easy to show (using equation 2.2, say)
that
σ2
Var(B1 ) = P
(xi − x̄)2
P 2
xi
1
x̄2
σ2
2
P
=σ
+P
Var(B0 ) =
n (xi − x̄)2
n
(xi − x̄)2
An inspection of these formulas for the variances of the coefficients shows that the variances of the estimators decrease as the number of observations increase (provided that the
values xi are not all identical). The variance depends not only the the error variance but
also on the spread of the independent variables xi . Qualitatively, at least, the variances of
the estimators behaves as we would want them to. But could we find estimators with even
smaller variance? The following famous theorem says that the least-squares estimates of
β0 and β1 are the best estimators in a certain precise sense.
Theorem 4.5.2 (Gauss-Markov Theorem). Assume that E(i ) = 0, Var(i ) = σ 2 , and the
random variables i are independent. Then the estimators B0 and B1 are the unbiased
estimators of minimum variance among all unbiased estimators that are linear in the random variables Yi . (We say that these estimators are BLUE which stands for Best Linear
Unbiased Estimator.)
Estimating σ 2
The random variables i have mean 0, variance σ 2 and are independent. Thus E(2i ) = σ 2 .
So we could estimate σ 2 by
Pn
2
i=1 (yi − (β0 + β1 xi ))
.
(4.5)
n
426
4.5 Regression Inference
This fraction would give an unbiased estimate of σ 2 . This is not much good however as we
do not know β0 and β1 . Substituting estimates for β0 and β1 and changing the denominator
of the fraction gives us the estimate we need
Pn
(yi − (b0 + b1 xi ))2
SSResid
MSE = i=1
=
.
n−2
n−2
This estimate, denoted MSE, is called the mean squared error. The justification for
substituting n − 2 in the denominator rather than n which would be more natural is the
same as that for using n − 1 in the definition of s2 in Section 4.2. Namely, the use of n − 2
ensures that MSE is an unbiased estimate of σ 2 . Notice that the denominator in each case
accounts for the number of parameters estimated (one in the case of s2 and two in the case
of MSE.
Example 4.5.1
A class taught at a college in the midwest took three tests and a final exam. There
were 32 students in the class. The final exam scores are related to the scores on Test
1. The result of a regression analysis appears below.
> class=read.csv(’http://www.calvin.edu/~stob/data/m222.csv’)
> class[1:3,]
Test1 Test2 Test3 Exam
1
98
100
98 181
2
93
91
89 168
3
100
99
99 193
> l.class=lm(Exam~Test1,data=class)
> summary(l.class)
Call:
lm(formula = Exam ~ Test1, data = class)
Residuals:
Min
1Q
-33.6930 -10.1574
Median
-0.9462
3Q
8.5918
Max
44.0759
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.9652
22.7916
1.051
0.301
Test1
1.6044
0.2729
5.880 1.95e-06 ***
427
4 Inference
--Signif. codes:
0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 18.86 on 30 degrees of freedom
Multiple R-Squared: 0.5354,
Adjusted R-squared: 0.5199
F-statistic: 34.57 on 1 and 30 DF, p-value: 1.952e-06
> p.class=predict(l.class)
> mse=sum( (class$Exam-p.class)^2/30 )
> rse=sqrt(mse)
> rse
[1] 18.86495
√
Notice that R computes MSE which is called the residual standard error. The
residual standard error is used as an estimate for σ (although it is not an unbiased
estimate of σ). In keeping with our previous use of s to denote the estimate of the
standard deviation of an unknown distribution, we will generally use se to denote the
residual standard error (and Se to denote the corresponding estimator).
What we have done until now does not depend on the normality assumptions on the
random variables i but only on the fact that they are independent with mean 0 and
common variance σ 2 . In order to make inferences about the parameters β0 and β1 , we
need to assume something about the distribution of the i and so we now assume also that
the random variables i are normally distributed. This in turn implies that the random
variables Yi are normally distributed with E(Yi ) = β0 + β1 xi and Var(Yi ) = σ 2 .
Under this assumption, it turns out the estimators B0 and B1 are normally distributed
as well. So we have that
1
x2
2
P
B0 ∼ N β0 , σ
+
n
(xi − x)2
2
σ
B1 ∼ N β1 , P
(xi − x)2
We will primarily be concerned with constructing confidence intervals and hypothesis
tests for β1 , the slope in the regression line. The reason for this is that the slope tells
428
4.5 Regression Inference
us the direction and size of the supposed linear relationship between x and y. The same
reasoning can be used to write confidence intervals and tests for β0 .
Our procedure for writing a confidence interval for β1 is very similar to that of constructing a confidence interval for the mean µ of an unknown distribution. Just as in that
case, the unknown standard deviation σ is a nuisance parameter and we must substitute
an estimate of σ for it. In this case we use se . This in turn means that the sampling distribution of our statistic becomes a t-distribution rather than a normal distribution. The
resulting fact is the statistic T has a t-distribution with n − 2 degrees of freedom. (Here
n − 2 matches the denominator in the definition of se .)
T =
B1 − β 1
P
Se / (xi − x)2
P
We define sb1 = se /( (xi − x)2 ) This number sb1 is called the estimate of the standard
error of b1 . We now have the following result
Confidence Intervals for β1
A 100(1 − α)% confidence interval for β1 is given by
(b1 − tα,n−2 sb1 , b1 + tα,n−2 sb1 )
Example 4.5.2
In Example 2.4.1 we used linear regression to write a relationship between iron
content and material loss in certain Cu/Ni alloy bars. The dataset was the corrosion
dataset in R. In what follows, we write a 95% confidence interval for the slope of the
regression line.
> summary(l.corrosion)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.787
1.403
92.52 < 2e-16 ***
429
4 Inference
Fe
-24.020
1.280 -18.77 1.06e-09 ***
--Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Residual standard error: 3.058 on 11 degrees of freedom
Multiple R-Squared: 0.9697,
Adjusted R-squared: 0.967
F-statistic: 352.3 on 1 and 11 DF, p-value: 1.055e-09
> qt(.975,11)
[1] 2.200985
> c(-24.020 - qt(.975,11)*1.280, -24.020+qt(.975,11)*1.280)
[1] -26.83726 -21.20274
The confidence interval constructed, (−26.84, −21.20), is a 95% confidence interval for
the slope of the “true” linear relationship between x and the mean of y. To interpret
this, we might say something like “We are 95% confident that the an increase in iron
content of 1% results in an average loss of between 21.2 and 26.8 milligrams per square
decimeter of material.” Notice the high R2 value in this model. A very high percentage
of the loss due to corrosion is explained by the percentage iron content of the bar.
4.6 Exercises
4.1 A basketball player claims to be a 90% free-throw shooter. Namely, she claims to be
able to make 90% of her free-throws. Should we doubt her claim if she makes 14 out of 20
in a session at practice? Set this problem up as a hypothesis testing problem and answer
the following questions.
a) What are the null and alternate hypotheses?
b) What is the p-value of the result 14?
c) If the decision rule is to reject her claim if she makes 15 or fewer free-throws, what
is the probability of a Type I error?
430
4.6 Exercises
4.2 In Example 4.1.1(c), we are trying to decide whether to fire the old kicker and hire a
new one on the basis of a trial of 20 kicks. Suppose that we decide to hire the new kicker
if he makes 8 or more kicks.
a) Suppose that he makes exactly 8 kicks. What is the p-value of this result?
b) What is α, the probability of a Type I error, for this decision rule?
c) If the kicker truly has a 35% chance of making each kick, what is the probability of
a Type II error (i.e., that we don’t believe that he is better than the old kicker)?
4.3 Nationally, 79% of students report that they have cheated on an exam at some point in
their college career. You can’t believe that the number is this high at your own institution.
Suppose that you take a random sample of size 50 from your student body. Since 50 is so
small compared to the size of the student body, you can treat this sampling situation as
sampling with replacement for the purposes of doing a statistical analysis.
a) Write an appropriate set of hypotheses to test the claim that 79% of students cheat.
b) Construct a decision rule so that the probability of a Type I error is less than 5%.
4.4 In this problem, you will develop a hypothesis test for a random variable other than
a binomial one. Suppose that you believe the waiting time until you are served at the
MacDonald’s on 28th street is a random variable with an exponential distribution but with
unknown λ. The sign at the drive-up window says that the average wait time is 1 minute.
You actually wait 2 minutes. Your friend in the car says that this is outrageous and that
the claim on the sign must be wrong.
a) Write a pair of hypothesis about λ that captures the discussion between you and
your friend.
b) What is the p-value of the single data point of a 2 minute wait?
c) Write a sentence that explains clearly to your friend the meaning of that p-value.
Remember that your friend has not yet been fortunate enough to take a statistics
course.
431
4 Inference
d) How long would you have to have waited to be suspicious of MacDonald’s claim?
There are many right answers to this question but any answer needs statistical justification.
4.5 In Example 4.2.2 we generated an approximate 95% confidence interval for µ assuming
that σ is known.
a) Construct instead a 90% confidence interval for µ.
b) Construct both 90% one-sided confidence intervals for µ.
c) Describe clearly a situation in which you would want a one-sided confidence interval
rather than a two-sided one.
4.6 Suppose that we are in a situation where we would want to construct a confidence
interval for µ and we knew σ = 0.3. How large a sample should we take to ensure that a
95% confidence interval would estimate µ to within 0.1?
4.7 Suppose that X1 , . . . , Xn are i.i.d. from an exponential distribution with parameter λ
unknown. In this problem we write a confidence interval for λ using X.
a) Rewrite Equation 4.1 in this case by substituting for µ and σ the appropriate expressions involving λ.
b) Solve the inequality that results in part (a) for an inequality of form a < λ < b where
a and b do not involve λ.
c) Suppose that n = 30 and X = 4.23. Using (b), write an approximate 95% confidence
interval for λ. Note that this confidence interval relies of the CLT but makes no other
approximation.
4.8 The chickwts dataset presents the results of an experiment in which chickens are fed
six different feeds. Suppose that we assume that the chickens were assigned to the feed
groups at random so that we can assume that the chickens can be thought of as coming
from one population. For each feed, we can assume that the chickens fed that feed are a
random sample of the (theoretical) population that would result from feeding all chickens
that feed.
432
4.6 Exercises
a) Write 95% confidence intervals for the mean weight of chickens fed each of the six
seeds.
b) From an examination on the six resulting confidence intervals, is there convincing
evidence that some diets are better than others?
c) Since you no doubt used the t-distribution to generate the confidence intervals in (a),
you might wonder whether that is appropriate. Are there any features in the data
that suggest that this might not be appropriate?
4.9 The dataframe in http://www.calvin.edu/~stob/data/miaa05.csv contains statis-
tics on each of the 134 players in the MIAA 2005 Men’s Basketball season. Choose 10
different random samples of size 15 from this dataset.
a) From each, compute a 90% confidence interval for the mean PTSG (points per game)
of all players.
b) Of the 10 confidence intervals you computed in part (a), how many actually did
contain the true mean? (Which you can compute since you have the population in
this instance.)
c) How many of the 10 confidence intervals in part (a) would you have expected (before
you actually generated them) to contain the true mean?
d) In light of your answer in (c), are you surprised by your answer in (b)?
4.10 The dataset http://www.calvin.edu/~stob/data/reading.csv contains the results of an experiment done to test the effectiveness of three different methods of reading
instruction. We are interested here in comparing the two methods DRTA and Strat. Let’s
suppose, for the moment, that students were assigned randomly to these two different
treatments.
a) Use the scores on the third posttest (POST3) to investigate the difference between
these two teaching methods by constructing a 95% confidence interval for the difference in the means of posttest scores.
b) Your confidence interval in part (a) relies on certain assumptions. Do you have any
concerns about these assumptions being satisfied in this case.
433
4 Inference
c) Using your result in (a), can you make a conclusion about which method of reading
instruction is better?
4.11 Surveying a choir, you might expect that there would not be a significant height
difference between sopranos and altos but that there would be between sopranos and basses.
The dataset singer from the lattice package contains the heights of the members of the
New York Choral Society together with their singing parts.
a) Decide whether these differences do or do not exist by computing relevant confidence
intervals.
b) These singers aren’t random samples from any particular population. Explain what
your conclusion in (a) might be about.
4.12 Returning to the sport of baseball one last time, let’s reexamine the results of the
1994–1998 baseball seasons in http://www.calvin.edu/~stob/data/team.csv. Earlier,
we tried to predict R (runs) by HR (homeruns). Let’s refine that analysis here.
a) Instead of predicting R from HR, use regression to write a linear relationship to
predict RG (runs per game) from HRG (homeruns per game).
b) Interpret the slope and intercept of the line in part (a) informally.
c) Write a 95% confidence interval for the slope of the line in (a).
4.13 The dataset http://www.calvin.edu/~stob/data/lakemary.csv contains the age
and length (in mm) of 78 bluegills captured from Lake Mary, Minnesota. (Richard Frie, J.
Amer. Stat. Assoc., (81), 922-929).
a) Write a linear function to predict the length from the age.
b) Interpret the slope and intercept of the line in (a).
c) Write a 95% confidence interval for the slope of the regression line.
d) Do you have any comments about the data or the model?
434
4.6 Exercises
4.14 The dataset http://www.calvin.edu/~stob/data/home.csv contains the prices of
homes in a certain community at two different points in time.
a) Write a linear function to predict the old price from the new.
b) Write a 90% confidence interval for the slope of the line in (a).
c) Write a sentence explaining what the confidence interval in (b) means.
435