Mathematics 343 Class Notes
1
1
Introduction
Goal: To define probability and statistics and explain the relationship between the
two
1.1
What are Probability and Statistics?
1. Probability
Definition 1.1. A probability is a number meant to measure the likelihood of the occurrence of some uncertain event (in the future).
Definition 1.2. Probability (or the theory of probability) is the mathematical discipline
that
(a) constructs mathematical models for “real-world” situations that enable the computation of probabilities (“applied” probability)
(b) develops the theoretical structure that undergirds these models (“theoretical” or
“pure” probability).
2. Statistics
Definition 1.3. Statistics is the scientific discipline concerned with collecting, analyzing
and making inferences from data.
3. The relation between probability theory and statistics
Often, we arrange to collect the data by a process for which we have a probabilistic model.
Then probability theory informs our data analysis. The relationship between statistics
and probability theory is much like the relationship between mechanics and calculus.
1.2
An example of the relationship
Kellogg’s sells boxes of Raisin Bran labeled “Net Wt. 20 oz.”
1. What is meant by this claim?
Important Observation 1.4. Some variation is to be expected and should be allowed.
A probabilistic model is an appropriate “description” of this variation.
2. How does NIST recommend checking this claim?
Important Observation 1.5. We cannot check every box that Kellogg’s produces. We
must limit ourselves to a sample of such boxes and make inferences about the whole “population” of boxes.
Mathematics 343 Class Notes
1.3
2
Populations and Samples
1. Populations
Definition 1.6. A population is a well-defined collection of individuals.
We distinguish between actual (concrete) populations and conceptual (hypothetical) populations.
Definition 1.7. A parameter is a numerical characteristic of a population.
2. Samples
Definition 1.8. A sample is a subcollection of a given population.
Definition 1.9. A simple random sample of a given size n is a sample chosen from a
finite population in a manner such that each possible sample of size n is equally likely to
occur.
Homework.
1. Read the syllabus.
2. Read Section 1.1 of Devore and Berk.
3. Read Section 2 of the notes SimpleR. These notes and the R package are available at the
R section of the course webpage.
4. Download R to the computer that you use regularly. If you do not have easy access to
high-speed internet, ask your instructor for an installation CD.
5. Do problems 2.1,2,5,6 of SimpleR.
6. Do problems 1.4,6,9 of Devore and Berk.
Mathematics 343 Class Notes
2
3
Random Experiments
Goal: To develop the language for describing random (probabilistic) experiments
2.1
Experiments
1. Experiment (or random experiment) is an undefined term.
2. Experiments have three key characteristics
(a) future, not past
(b) could have any one of a number of outcomes and which outcome will attain is uncertain
(c) could be performed repeatedly (under essentially the same circumstances)
3. Examples of experiments.
2.2
The sample space
1. The sample space of an experiment is the set of all possible outcomes of that experiment.
(we usually use S for the name of the sample space)
2. Examples of sample spaces.
2.3
Events
1. An event is a subset of the set of outcomes.
2. The fundamental goal of a probability model:
We want to assign to each event E a number P (E) such that P (E) is the
likelihood that event E will happen if the experiment corresponding to the
sample space S is performed.
2.4
Language of Set Theory
Definition 2.1. Suppose that E and F are events.
1. The union of events E and F , denoted E ∪ F , is the set of outcomes that are in either E
or F
Mathematics 343 Class Notes
4
2. The intersection of events E and F , denoted E ∩ F , is the set of outcomes that are in
both E and F
3. The complement of an event E, denoted E 0 , is the set of outcomes that are in S but not
in E
• Two special events are ∅, (nothing happens!) and S (something happens)
2.5
Using R to generate random events
To construct a simple random sample of size 12 from a lot of size 250, we could use the following
R code.
> x=c(1:250)
> x
[1]
1
2
3
4
5
6
7
8
9 10
[19] 19 20 21 22 23 24 25 26 27 28
....................
[217] 217 218 219 220 221 222 223 224 225 226
[235] 235 236 237 238 239 240 241 242 243 244
> sample(x,12,replace=F)
[1] 82 145 19 129 198 27 237 25 106 83
> sample(x,12,replace=F)
[1] 222 240 34 30 239 109 27 141 112 248
>
11
29
12
30
13
31
14
32
15
33
16
34
17
35
18
36
227 228 229 230 231 232 233 234
245 246 247 248 249 250
34 170
69 243
Homework.
1. Read Section 2.1 of Devore and Berk.
2. Do problems 2.2, 4, 5, 10.
3. There are (obviously) 100 positive integers in the range 1–100.
(a) How many of these are even? are prime?
(b) In a random sample of size 10, how many even numbers would one expect to find?
prime numbers?
(c) Use R to construct a random sample of size 10 from these integers. Record the
itegers in your sample. How many elements of your sample are even? how many are
prime?
Mathematics 343 Class Notes
3
5
Probability Functions
Goal: To assign to each event A a number P (A), the probability of A
3.1
The Meaning of Probability Statements
1. The frequentist interpretation: the probability of an event A is the limit of the relative
frequency that A occurs in repeated trials of the experiment as the number of trials goes
to infinity.
2. The subjectivist interpretation: the probability of an event A is an expression of how
confident the assignor is that the event will happen.
3.2
Assigning Probabilities - Theory
1. Axioms.
Axiom 3.1. For all events A, P (A) ≥ 0.
Axiom 3.2. P (S) = 1.
Axiom 3.3. If A1 , A2 , A3 , . . . is a sequence of disjoint events, then
P (A1 ∪ A2 ∪ A3 ∪ · · · ) =
∞
X
P (Ai )
i=1
2. Consequences.
Theorem 3.4. P (∅) = 0.
Theorem 3.5. If A and B are disjoint sets, then P (A ∪ B) = P (A) + P (B).
Theorem 3.6. For every event A, P (A0 ) = 1 − P (A).
Theorem 3.7. For all events A and B, P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
3.3
Assigning Probabilities - Practice
1. If we can list the outcomes (simple events) E1 , E2 , E3 , . . . and assign a probability P (Ei )
to each, then the probability of any event A is determined by
X
P (A) =
P (Ei )
Ei in A
Mathematics 343 Class Notes
6
2. Special case: there are N possible outcomes and we judge that each outcome is equally
likely to occur. Then each outcome has probability 1/n. And if an event A consists of k
outcomes, then P (A) = k/n.
Example: toss two fair dice. Will a sum of 7 occur?
Example: choose a random sample of 12 boxes of Raisin Bran from a lot of 250. Will
there be a unacceptably underweight box in the sample?
3. Special case: if we have data on previous trials of the experiment, we may estimate the
probability of each outcome by the relative frequency with which that outcome occured
in the previous trials.
Example: Jim Thome bats. Will a homerun occur?
Example: a 54 year old male buys a life insurance policy. Will he die in the next 10 years?
3.4
Using R to Simulate Random Experiments
> outcomes=c(’Out’,’Single’,’Double’,’Homerun’,’Walk’)
> outcomes
[1] "Out"
"Single" "Double" "Homerun" "Walk"
> relfreq=c(299,65,23,39,86)/512
> sum(relfreq)
[1] 1
> sample(outcomes,1,prob=relfreq)
[1] "Out"
> sample(outcomes,4,prob=relfreq,replace=T)
[1] "Out"
"Out"
"Out"
"Single"
Homework.
1. Read Devore and Berk, Section 2.2.
2. Do problems 2.22,24,26,28,30 of Devore and Berk.
Mathematics 343 Class Notes
4
7
Counting
Goal: To develop methods for counting equally likely outcomes
4.1
Two Problems
Example 4.1. Suppose that there are 10 underweight boxes of Raisin Bran in a shipment of
250. What is the probability that there will be an underweight box in a random sample of 12
such boxes?
Example 4.2. Suppose that a fair coin is tossed 100 times. Should we be surprised if it comes
up heads more than 60 times?
4.2
The Fundamental Theorem of Counting
Proposition 4.3. Suppose that a set consists of ordered pairs such that there are n1 possible
choices for the first element of the ordered pair and for each such element there are n2 choices
for the second element. Then there are n1 n2 ordered pairs in the set.
Proposition 4.4. Suppose that a set consists of ordered k-tuples such that there are n1 choices
for the first element, for each choice of the first element there are n2 choices for the second
element, for each choice of the first two elements there are n3 choices for the third element, etc.
Then there are n1 n2 · · · nk−1 nk ordered tuples in the set.
4.3
Permutations and Combinations
Definition 4.5. An ordered sequence of k objects chosen from a set of n ≥ k objects is called
a permutation of size k. The number of such permutations is denoted Pk,n and is computed by
Pk,n = n(n − 1)(n − 2) · · · (n − k + 1).
Note that Pn,n = n!
Definition 4.6. A subset of k objects chosen from a set
of n objects is called a combination of
n
size k. The number of such combinations is denoted k (or sometimes Ck,n ) and is computed
by
Pk,n
n
n!
=
=
k
Pk,k
k!(n − k)!
The number nk is usually called a binomial coefficient and is often read as “n choose k.”
Mathematics 343 Class Notes
8
Note that in each of these situations, we are selecting k objects without replacement. If order
“matters”, we are dealing with permutations. If order does not matter, we are dealing with
combinations.
4.4
Solution to the two basic problems
> 1-choose(240,12)/choose(250,12)
[1] 0.3942094
> heads=c(0:100)
> prob=choose(100,heads)/(2^100)
> sum(prob)
[1] 1
> prob
[1] 7.888609e-31 7.888609e-29 3.904861e-27 1.275588e-25 3.093301e-24
[6] 5.939138e-23 9.403635e-22 1.262774e-20 1.467975e-19 1.500596e-18
................................
[91] 1.365543e-17 1.500596e-18 1.467975e-19 1.262774e-20 9.403635e-22
[96] 5.939138e-23 3.093301e-24 1.275588e-25 3.904861e-27 7.888609e-29
[101] 7.888609e-31
> sum(prob[61:100])
[1] 0.02844397
Homework.
1. Read Section 2.3 of Devore and Berk.
2. Do problems 2.35,40,42 of Devore and Berk.
3. A company receives a shipment of 100 computer chips. Inevitably, there will be defective
chips in the shipment. However the company is willing to accept the shipment if there
are no more than 5 defective chips in the shipment. Unfortunately, testing the chips for
defects is expensive and destructive. Suppose that the company decides to test 10 chips
and decides to reject the shipment if there is at least one defective chip in the shipment.
(a) If there are 6 defective chips in the shipment, what is the probability that the company will reject the shipment?
(b) If there are only 5 defective chips in the shipment, what is the probability that the
company will reject the shipment?
(c) If the company is limited to testing just 10 chips, do you think that it is employing
the right decision rule?
Mathematics 343 Class Notes
5
9
Conditional Probability
Goal: To compute the probability of an event A given knowledge as to whether
another event B has occured
5.1
Examples
In each of the following examples, it appears that event A and event B “depend” on each other.
1. Experiment: Choose a Calvin senior at random. Event A: Student has a GPA greater
than 3.5. Event B: Student has an ACT score greater than 30.
2. Experiment: Choose 12 Raisin Bran boxes from a shipment of 250. Event A: There is a
box weighing less than 19.23 ounces. Event B: The average weight of the 12 boxes is less
than 19.9 ounces.
3. Experiment: Throw two dice. Event A: The sum of the two dice is twelve. Event B: A
six occurs on at least one of the two dice.
5.2
Conditional Probability Defined
Definition 5.1. Suppose that A and B are events such that P (B) > 0. The conditional
probability of A given B, denoted P (A|B), is
P (A|B) =
P (A ∩ B)
P (B)
Example 5.2. In example 3 above, we have P (A) = 1/36, P (B) = 11/36, P (A ∩ B) = 1/36
so P (A ∩ B) = 1/36 so P (A|B) = 1/11 and P (B|A) = 1.
Proposition 5.3. Given a sample space S and an event B with P (B) > 0, the function
P 0 (A) = P (A|B) is a probability function defined on the new sample space S 0 = B.
5.3
Using Conditional Probability to Compute Unconditional Probabilities
Since P (A ∩ B) = P (B)P (A|B), we can use conditional probabilities (such as P (A|B)) to
compute unconditional probabilities such as P (A ∩ B).
Sampling without replacement is a typical example.
Example 5.4. A Class of 31 calculus students has 14 females in the class. If a random sample
if size 2 is chosen from the class, what is the probability that both are female?
Mathematics 343 Class Notes
10
Example 5.5. A certain rare disease has an incidence of 0.1% in the general population. There
is a test for this disease but the test can be in error. It is estimated that the test indicates
false positives 1% of the time (that is a person that doesn’t have the disease tests positive) and
false negatives 5% of the time (that is a person with the disease tests negative). What is the
probability that a randomly chosen person receives a positive test result?
5.4
Independence
Informally, A and B are independent if the fact that B occurs does not affect the probability
that A occurs. Formally,
Definition 5.6. Events A and B are independent if P (A ∩ B) = P (A)P (B).
Note that if P (B) = 0 or P (A) = 0 then events A and B are automatically independent. The
following proposition gives a more intuitive characterization of independence in the case that
P (B) > 0.
Proposition 5.7. Suppose that P (B) > 0. Then events A and B are independent iff P (A) =
P (A|B).
Sampling with replacement is a typical situation where independence is applied.
Homework.
1. Read pages 73–78 and also 83, 84
2. Do problems 2.45, 46, 48, 49, 55, 56
Mathematics 343 Class Notes
6
11
Bayes’ Theorem
Goal: To compute conditional probabilities of the form P (B|A) from P (A|B)
6.1
Simple Statement of the Theorem
Theorem 6.1 (Bayes’ Theorem). Suppose that A and B are events such that P (A) > 0 and
P (B) > 0. Then
P (B|A) =
P (A)P (A|B)
P (B)
Proof. By the definition of conditional probability
P (A ∩ B) = P (A|B)P (B)
and
P (A ∩ B) = P (B|A)P (A)
The result follows immediately by equating the two expressions for P (A ∩ B).
6.2
Examples
Example 6.2. Medical testing for a rare disease.
T
D
a person tests positive for the a certain disease
the person has the disease
Mammograms have, on some reports, a 30% false negative rate: P (T |D) = 0.7. The false
positive rate is lower - perhaps 10%; P (T |D0 ) = 0.1. In typical situations, P (D) = 0.005.
Obviously the important question is what is P (D|T )?
Example 6.3. The dependability of the judicial system.
G
J
the defendant is guilty
the jury finds the defendant guilty
A scholarly study estimated that in capital cases in Illinois, P (G|J) = .93. How accurate are
juries? That is what are P (J|G) and P (J|G0 )?
Mathematics 343 Class Notes
6.3
12
Aside - Information Markets
Estimates of probabilities from past events (e.g., mammogram errors) can often be made accurately by computing relative frequencies. What about probabilities of possible future events
that may turn out one way or another?
Example 6.4. What is the probability that the Democratic Party will win the 2008 Presidential
Election if Hilary Clinton is Nominated?
Event
Clinton is Nominated
Clinton is Elected
The Democratic Candidate Wins
McCain is Nominated
McCain is Elected
The Republican Candidate Wins
Homework.
1. Read Section 2.4.
2. Do problems 2.59, 60, 106, 109.
Probability
Mathematics 343 Class Notes
7
13
Random Variables and The Binomial Distribution
Goal: To introduce the concept of random variables by way of an extraordinarily
important example
7.1
Random Variables
Definition 7.1. Given an experiment with sample space S, a random variable is a function X
defined on S that has real number values. For a given outcome o ∈ S, X(o) is the value of the
random variable on outcome o.
We generally use uppercase letters near the end of the alphabet for random variables (X, Y ,
etc.). Examples:
1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be the random
variable that counts the number of underweight boxes and let Y be the random variable
that is the average weight of the 12 boxes.
2. Choose a Calvin senior at random. Let Z be the GPA of that student and let U be the
composite ACT score of that student.
3. Choose a football player in the National Football League at random. Let W be the weight
(in pounds) of that player.
4. Throw a fair die until all six numbers have appeared. Let T be the number of throws
necessary.
For most purposes, we can consider a random variable as an experiment with outcomes that
are numbers. Random variables can have finitely many or infinitely many different values.
Definition 7.2. A random variable is discrete if it has only finitely many different values or
infinitely many values that can be listed in a list v1 , v2 , v3 , . . . .
Notice that X, U and T above are discrete random variables while the others are not.
7.2
Binomial Random Variables
A binomial experiment is a random experiment characterized by the following conditions:
1. The experiment consists of a sequence of finitely many (n) trials of some simpler experiment.
Mathematics 343 Class Notes
14
2. Each trial results in one of two possible outcomes, usually called success (S) and failure
(F ).
3. The probability of success on each trial is a constant denoted by p.
4. The trials are independent one from another - that is the outcome of one trial does not
affect the outcome of any other.
Thus a binomial experiment is characterized by two parameters, n and p.
Definition 7.3. Given a binomial experiment, the binomial random variable X associated with
this experiment is defined by X(o) is the number of successes in the n trials of the experiment.
Examples:
1. A fair coin is tossed n = 10 times with the probability of a HEAD (success) being p = .5.
X is the number of heads.
2. A basketball player shoots n = 25 freethrows with the probability of making each freethrow
being p = .70. Y is the number of made freethrows.
3. A quality control inspector tests the next n = 12 widgets off the assembly line each of
which has a probability of 0.10 of being defective. Z is the number of defective widgets.
7.3
The pmf of a discrete random variable
Definition 7.4. Suppose that X is a discrete random variable. The probability mass function
(pmf) of X is the function
pX (x) = P r({o : X(o) = x})
(We will write the right hand side of this equation as P (X = x).)
We will drop the subscript X in naming pX if the random variable in question is clear. And
sometimes, as in the case of a binomial random variable, we will give the pmf a more suggestive
name. The pmf is also sometimes called the distribution of X.
Theorem 7.5 (The Binomial Distribution). Suppose that X is a binomial random variable
with parameters n and p. The pmf of X, denoted by b(x; n, p), is given by
n x
b(x; n, p) =
p (1 − p)n−x
x
Mathematics 343 Class Notes
7.4
R Computes the pmf of any Binomial Random Variable
> help(dbinom)
> dbinom(x=7,size=10,prob=.7)
[1] 0.2668279
> x=c(0:10)
> dbinom(x=x,size=10,prob=.7)
[1] 0.0000059049 0.0001377810 0.0014467005 0.0090016920 0.0367569090
[6] 0.1029193452 0.2001209490 0.2668279320 0.2334744405 0.1210608210
[11] 0.0282475249
> y=dbinom(x=x,size=10,prob=.7)
> plot(x,y)
>
The Binomial Distribution
Description:
Density, distribution function, quantile function and random
generation for the binomial distribution with parameters ’size’
and ’prob’.
Usage:
dbinom(x,
pbinom(q,
qbinom(p,
rbinom(n,
size,
size,
size,
size,
prob, log = FALSE)
prob, lower.tail = TRUE, log.p = FALSE)
prob, lower.tail = TRUE, log.p = FALSE)
prob)
Arguments:
x, q: vector of quantiles.
p: vector of probabilities.
n: number of observations. If ’length(n) > 1’, the length is
taken to be the number required.
size: number of trials.
prob: probability of success on each trial.
15
Mathematics 343 Class Notes
16
..............
Value:
’dbinom’ gives the density, ’pbinom’ gives the distribution
function, ’qbinom’ gives the quantile function and ’rbinom’
generates random deviates.
If ’size’ is not an integer, ’NaN’ is returned.
0.25
●
●
0.15
y
0.20
●
0.10
●
0.00
0.05
●
●
●
0
●
●
2
●
●
4
6
8
10
x
Figure 1: b(x; 10, .7)
Homework.
1. It’s not easy to give a reading assignment as we have just jumped all around in Chapter
3. However this material is covered on pages 95, 99, and 125–128.
Mathematics 343 Class Notes
17
2. For each of the following binomial random experiments, use R to compute the indicated
probability. In each, identify clearly the parameters of the experiment and the outcome
of a trial that you are considering a “success.”
(a) 100 boxes of Raisin Bran coming off the production line are inspected. The probability that any one of these is significantly underweight is .01. What is the probabilty
that none of the 100 boxes are underweight? one is underweight? two or more are
underweight?
(b) Jermain Dye hits a homerun in 7.48% of his plate appearances. Suppose that he
comes to the plate 4 times in tomorrow night’s game. What is the probability that
he will get at least one homerun in that game?
(c) A section of SAT problems has 12 multiple choice problems, each with five possible
options. What is the probability that a guesser will be so unlucky as to get every
one of the problems wrong?
(d) Normally, the random variable associated with the roll of a die has 6 outcomes.
But suppose that we are concerned only with whether a six appears on the die or
not. What is the probability that a six occurs at least three times in five rolls (an
important question in the game of Yahtzee!)? What is the probability that one of
the six numbers occurs at least three times in five rolls (justify your answer)?
Mathematics 343 Class Notes
8
18
Binomial Distribution Continued and The Hypergeometric
Distribution
Goal: To introduce further properties of random variables using the example of the
binomial distribution and to introduce the hypergeometric distribution
8.1
The Cumulative Distribution Function
The probability mass function (pmf) of a discrete random variable X is sufficient to answer any
probability question about X. However answering such questions as
What is the probability of between 40 and 60 heads in 100 tosses of a fair coin?
requires a considerable addition.
Definition 8.1. The cumulative distribution function (cdf ) of a discrete random variable X
with pdf p is the function F defined by
X
F (x) = P (X ≤ x) =
p(x)
y≤x
We will write FX for the cdf of X when the relavant random variable is in doubt.
Proposition 8.2. For any discrete random variable X, the cdf F of X satisfies
1.
lim F (x) = 0
x→−∞
2. lim F (x) = 1
x→∞
3. if x1 < x2 then F (x1 ) ≤ F (x2 )
4. p(b) = F (b) − F (a−) where F (a−) = limx→a− F (x).
> pbinom(q=c(0:10),size=10,prob=.7)
# pbinom computes the cdf
[1] 0.0000059049 0.0001436859 0.0015903864 0.0105920784 0.0473489874
[6] 0.1502683326 0.3503892816 0.6172172136 0.8506916541 0.9717524751
[11] 1.0000000000
> pbinom(11,10,.7)
[1] 1
> pbinom(-3,10,.7)
[1] 0
> pbinom(8,10,.7)-pbinom(5,10,.7)
[1] 0.7004233
Mathematics 343 Class Notes
8.2
19
Simulations
In order to get a better understanding of a random process, it is helpful to be able to perform the
underlying random experiment many times. While it is usually impractical or even impossible
to do that with a real experiment, if the pdf is known, the computer can be used to simulate
the random variable.
Example 8.3. Suppose that a manufacturing process produces widgets in lots of 1,000 and
the probability that any particular widget is defective is 0.1%. In R, the command rbinom
simulates any particular binomial distribution. The following code simulates 100 lots of 1,000
widgets.
> sim=rbinom(100,size=1000,prob=.001)
> table(sim)
sim
0 1 2 3 4
43 32 18 5 2
> dbinom(c(0:4),1000,.001)
[1] 0.36769542 0.36806349 0.18403174 0.06128251 0.01528996
> 1-pbinom(4,1000,.001)
[1] 0.003636878
8.3
The Hypergeometric Distribution
An hypergeometric experiment is characterized by the following assumptions.
1. there is a population of N individuals,
2. the individuals are divided into two groups, one of M individuals (called the “success”
group) and one of N − M individuals (called the “failure” group)
3. a sample of n individuals is selected without replacement in such a way that each subset
of size n is equally likely to be chosen.
Given a hypergeometric experiment with parameters n, N , and M , the random variable X that
counts the number of successes in the sample is said to have a hypergeometric distribution.
Proposition 8.4. A random variable X that has a hypergeometric distribution with parameters
n, N and M has pdf given by
M N −M
h(x; n, M, N ) =
x
n−x
N
n
max{0, n − N + M } ≤ x ≤ min{M, n}
Mathematics 343 Class Notes
20
In R, the functions dhyper and phyper compute the pdf and cdf of a hypergeometric random
variable. Unfortunately, Ruses completely different notation than above. In R, m is used for
M , but n is used for N − M . In other words, m is the number of successes and n is the
number of failures in the population which therefore has size m+n. The size of the sample is
called k in R. Therefore the following R code references a hypergeometric distribution where
the population has size 10, there are 7 successes and 3 failures, and we are choosing 5 objects
without replacement.
> x=c(0:5)
> dhyper(x,m=7,n=3,k=5)
[1] 0.00000000 0.00000000 0.08333333 0.41666667 0.41666667 0.08333333
> phyper(x,m=7,n=3,k=5)
[1] 0.00000000 0.00000000 0.08333333 0.50000000 0.91666667 1.00000000
> table(rhyper(100,m=7,n=3,k=5))
2 3 4
6 39 46
5
9
Homework.
1. Read Sections 3.1 and 3.2 and pages 134–136 of Devore and Berk. Note the notation
X ∼ Bin(n, p) to signify that X is a random variable that has a binomial distribution
with parameters n and p.
2. Do problems 3.12, 3.23 (of Section 3.2 of Devore and Berk).
3. Do problem 3.36 (in Section 3.5).
4. Do problem 3.82(a,b,c) (in Section 3.6).
Mathematics 343 Class Notes
9
21
Expected Values
Goal: To define the expected value of a random variable and to compute expected
values for binomial and hypergeometric random variables
9.1
Expected Value Defined
Definition 9.1. Suppose that X is a discrete random variable with pmf p. The expected value
of X, denoted E(X) or µX , is the following sum, provided that it exists:
X
E(X) =
xp(x)
where the sum is taken over all possible values of X.
Example 9.2. Suppose that a fair six-sided die is tossed once and that X is the number that
occurs. Then p(x) = 1/6 for each x = 1, 2, 3, 4, 5, 6 and
1
1
1
1
1
1
21
E(X) = 1 + 2 + 3 + 4 + 5 + 6 =
= 3.5
6
6
6
6
6
6
6
Intuitively, if the experiment yielding the random variable X is performed many times, the
average of the values of X that actually occur should be approximately E(X). Note also that
E(X) is the center of mass of a system of points with mass p(x) located at x.
9.2
The Expected Value of Binomial and Hypergeometric Random Variables
Theorem 9.3. Suppose that X ∼ Bin(n, p). Then E(X) = np.
Theorem 9.4. If X is a hypergeometric random variable with parameters n, M , and N , then
E(X) = nM/N .
The proof of Theorem 9.4 will be deferred until we have more technology making the proof
easy.
9.3
Functions of Random Variables
Suppose that h is a function that is defined on all values of a random variable X. Then we can
define a new random variable Y by Y = h(X).
Example 9.5. Let h(x) = x2 . Then if X is the random variable that is the numerical result
of rolling a fair die, Y = h(X) is simply the square of that value. The pmf of y is
pY (y) = 1/6
y = 1, 4, 9, 16, 25, 36
Mathematics 343 Class Notes
22
If X is a random variable and Y =
Xh(X), we can find E(Y ) in two steps: first find the pmf pY
of Y from pX and then compute
ypY (y). The next theorem gives us a shortcut.
Theorem 9.6. If X is a random variable and Y = h(X) then
X
E(Y ) = E(h(X)) =
h(x)pX (x)
provided that the sum
X
|h(x)|pX (x) exists.
> x=(0:10)
> px=dbinom(x,10,.65)
> sum(x*px)
[1] 6.5
> y=x^2
> sum(y*px)
[1] 44.525
> 6.5^2
[1] 42.25
> r=rbinom(100,10,.65)
> mean(r)
[1] 6.48
> mean(r^2)
[1] 44.04
An important property of expectation is linearity.
Proposition 9.7 (Linearity of Expectation). Suppose that X is a discrete random variable and
a and b are constants. Then E(aX + b) = aE(X) + b.
9.4
The Negative Binomial Random Variable
A negative binomial distribution results from an experiment similar to a binomial distribution.
The distribution is characterized by two parameters: r a positive integer and p a probability.
The conditions for the distribution are the following:
1. The experiment consists of a sequence of independent trials.
2. Each trial can result in either a success or a failure.
3. The probability of a success on each trial is p.
4. The experiment continues until a total of r successes have been performed.
Mathematics 343 Class Notes
23
The random variable X that results from a negative binomial experiment is the number of
failures that precede the rth success. (Some books count the total number of trials rather than
the number of failures.) Notice this is the first example of a random variable that can take on
infinitely many values – X can be 0, 1, 2, . . . .
Proposition 9.8. The pmf of a random variable X that results from a negative binomial
experiment with parameters p and r is
x+r−1 r
nb(x; r, p) = P (X = x) =
p (1 − p)x
x = 0, 1, 2, . . .
r−1
Theorem 9.9. The expected value of a negative binomial random variable X with parameters
r and p is r(1 − p)/p.
Obviously, R knows the negative binomial distribution. In dnbinom, the parameters r and p
are named size and p respectively.
> dnbinom(x=(0:30),size=3,p=1/6)
[1] 0.004629630 0.011574074 0.019290123 0.026791838 0.033489798 0.039071431
[7] 0.043412701 0.046513608 0.048451675 0.049348928 0.049348928 0.048601217
[13] 0.047251183 0.045433830 0.043270314 0.040866408 0.038312257 0.035682985
[19] 0.033039801 0.030431396 0.027895446 0.025460129 0.023145572 0.020965192
[25] 0.018926909 0.017034219 0.015287119 0.013682915 0.012216889 0.010882861
[31] 0.009673654
> pnbinom(30,size=3,p=1/6)
[1] 0.929983
> sim=rnbinom(1000,size=3,p=1/6)
> table(sim)
sim
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
4 14 12 27 25 47 44 35 44 46 48 53 55 40 44 40 49 31 30 33 37 19 28 17 12 21
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 48 49 50 51 52
21 12 17 8 9 10 8 14 7 3 6 6 1 5 2 2 2 1 1 1 1 1 1 1 1 1
53 57 59
1 1 1
Homework.
1. Read Devore and Berk, pages 109–111 and page 138.
2. Do problems 3.29, 3.33, 3.34, 3.37, and 3.87 of Devore and Berk.
Mathematics 343 Class Notes
24
3. Consider the random variable D that obtains by adding the faces of two fair dice when
they are thrown.
(a) Write the pmf of D.
(b) Compute the expected value of D.
4. A basketball player makes 90% of her free throws. Suppose that in practice she shoots free
throws until she misses one. What is the probability that she takes at least 20 throws?
5. One of the first problems in probability theory was proposed to Pascal by the Chevalier
de Mere. The gambling game in question was this. The player throws two dice repeatedly
until a double-six occurs. The gambler wins the game if this happens in 24 or fewer
throws. What is the probability that the gambler wins?
uv
Mathematics 343 Class Notes
10
25
Variance
Goal: To define the variance of a discrete random variable
10.1
The Variance
Definition 10.1. Let X be a discrete random variable with pmf p and expected value µ. Then
2 , is
the variance of X, denoted by V (X) or σX
X
V (X) = E[(X − µ)2 )] =
(x − µ)2 p(x)
The standard deviation of X, denoted SD(X) or σX , is
q
2
σX = σX
The variance of a random variable X is a measure of the spread or range of possible values of
X. The larger the variance of X, the more likely it is that X will take on values far away from
the mean. The following table summarizes the properties of the discrete distributions that we
have met so far.
Distribution
Binomial
Hypergeometric
Parameters
n, p
n
x
M
x
n, N, M
Negative Binomial
10.2
r, p
pmf
− p)n−x
px (1
N −M
n−x
N
n
x+r−1 r
p (1 − p)x
r−1
Mean
np
M
N
Variance
np(1 − p)
n
N −n
N −1
r(1 − p)
p
M
N
1−
M
N
r(1 − p)
p2
Formulas for Variance
It is often easier to compute the variance using the following fact:
Proposition 10.2. For any discrete random variable X with expected value µ,
V (X) = E(X 2 ) − (E(X))2 = E(X 2 ) − µ2
The variance of a linear function of X is easily computed from that of X.
Proposition 10.3. Suppose that X is a random variable and Y = aX + b for constants a, b.
Then
2
σY2 = a2 σX
and
σY = |a|σX
.
Mathematics 343 Class Notes
10.3
26
Chebyshev’s Inequality
Chebyshev’s Inequality says that it is likely that the value of X be within several standard
deviations of the mean of X.
Theorem 10.4 (Chebyshev’s Inequality). Suppose that X is a random variable with mean µ
and variance σ 2 and let k ≥ 1. Then
P (|X − µ| ≥ kσ) ≤
1
k2
Proof. Let A = {x : |x − µ| ≥ kσ} and let D be the set of all possible values of X. We have
X
X
X
σ2 =
(x − µ)2 p(x) ≥
(x − µ)2 p(x) ≥
k 2 σ 2 p(x)
x∈D
Therefore
x∈A
x∈A
X
1
≥
p(x)
k2
x∈A
This is the desired inequality. (Notice that the inequality is true for k < 1 as well however it
isn’t very interesting!)
The following R example shows that the Chebyshev bound is rather conservative, at least in
the case of this particular random variable. The example being
p illustrated there is a binomial
random variable with n = 100, p = .5 and so µ = 50 and σ = 100/4 = 5.
> pbinom(40,100,.5)+(1-pbinom(59,100,.5))
[1] 0.05688793
> pbinom(35,100,.5)+(1-pbinom(64,100,.5))
[1] 0.003517642
> pbinom(30,100,.5)+(1-pbinom(69,100,.5))
[1] 7.85014e-05
# k=2
# k=3
# k=4
Homework.
1. Read Section 3.3 of Devore and Berk.
2. Suppose that a multiple choice test has 30 questions with 5 choices each. Suppose that
the test-taker guesses purely randomly on each question. Let X be the random variable
that counts the number of correct guesses.
Mathematics 343 Class Notes
27
(a) Compute the expected value of X.
(b) Compute the variance of X.
(c) Use Chebyshev’s inequality to compute the probability that the test taker gets at
least 15 problems right.
(d) Use R to find the exact probability that the test taker gets at least 15 problems
right.
3. Do p. 116 28, p. 133 72.
Mathematics 343 Class Notes
11
28
Hypothesis Testing
Goal: To introduce hypothesis testing through the example of the binomial distribution
11.1
Setting
Suppose that a real-world process is modeled by a binomial distribution for which we know n
but do not know p. Examples abound.
Example 11.1. A factory produces the ubiquitous widget. It claims that the probability that
any widget is defective is less than 0.1%. We receive a shipment of 100 widgets. We wonder
whether the claim about the defective rate is really true. The shipment is an example of a
binomial experiment with n = 100 and p unknown.
Example 11.2. A National Football League team is trying to decide whether to replace its
field goal kicker with a new one. The team estimates that the current kicker makes about 30%
of his kicks from 45 yards out. They want to try out the new kicker by asking him to try 20
kicks from 45 yards out. This might be modeled by a binomial distribution with n = 20 and p
unknown. The team is hoping that p > .3.
Example 11.3. A standard test for ESP works as follows. A card with one of five printed
symbols is selected without the person claiming to have ESP being able to see it. The purported
psychic is asked to name what symbol is on the card while the experimenter looks at it and
“thinks” about it. A typical experiment consists of 25 trials. This is an example of a binomial
experiment with n = 25 and unknown p. The experimenter usually believes that p = .2.
In each of these examples, we have a hypothesis about p that we could be considered to be
testing. In Example 11.1, we are testing whether p = .001. In example 11.2, we are testing
whether p ≤ .3. Finally, in Example 11.3, we are obviously testing the hypothesis that p = .2.
11.2
Hypotheses
A hypothesis proposes a possible state of affairs with respect to a probability distribution
governing an experiment that we are about to perform. Examples:
1. A hypothesis stating a fixed value of a parameter: p = .5.
2. A hypothesis stating a range of values of a parameter: p ≥ .7.
3. A hypothesis about the nature of the distribution itself; X has a binomial distribution.
Mathematics 343 Class Notes
29
In a typical hypothesis test, we pit two hypotheses against each other:
1. Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a hypothesis
that the data analysis is intended to investigate. It is usually thought of as the “default” or
“status quo” hypothesis that we will accept unless the data gives us substantial evidence
against it.
2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the
hypothesis that we are wanting to put forward as true if we have sufficient evidence
against the null hypothesis.
3. Possible Decisions. On the basis of the data we will either reject H0 or fail to reject
H0 (in favor of Ha ).
4. Asymmetry. Note that H0 and Ha are not treated equally. The idea is that H0 is the
default and only if we are reasonably sure that H0 is false do we reject it in favor of Ha .
H0 is “innocent until proven guilty” and this metaphor from the criminal justice system
is good to keep in mind.
In the examples above, the pairs of hypotheses that we are probably wanting to test are:
H0
Ha
p = .001
p > .001
Example 11.1
11.3
H0
Ha
p = .3
p > .3
Example 11.2
H0
Ha
p = .2
p > .2
Example 11.3
Decisions and Errors
How do we decide to reject H0 ? Obviously, we perform the experiment and decide whether the
result is in greater accord with H0 or Ha . There are two types of errors that we could make.
Definition 11.4. A Type I error is the error of rejecting H0 even though it is true. The
probability of a type I error is denoted by α.
A Type II error is the error of not rejecting H0 even though it is false. The probability of a
Type II error is denoted by β.
In all of the examples above, a reasonable strategy to follow is to perform the experiment and
reject H0 if the resulting number of successes is too large.
Definition 11.5. A test statistic is a random variable on which the decision is to be based.
A rejection region is the set of all possible values of of the test statistic that would lead us to
reject H0 .
Mathematics 343 Class Notes
30
In all these examples, our test statistic will simply be the number of successes of the binomial
experiment X. Our rejection region will be sets of the form R = {x : x ≥ x0 } for some constant
x0 . The number x0 will be chosen based on our choice of α.
Convention. Random variables are denoted by uppercase letters such as X. Random variables
are functions to be applied to the outcomes of experiments. The value of a random variable X
computed on a particular trial of the experiment will be denoted by the corresponding lowercase
letter, in this instance x. After the experiment, x is data and is simply a number.
The following R output provides relevant information for Example 11.1.
> pbinom(c(5:10),25,.2)
[1] 0.6166894 0.7800353 0.8908772 0.9532258 0.9826681 0.9944451
Consider, in this example, the decision rule:
Reject H0 if and only if x ≥ 10.
This decision procedure has a probability of less than 2% of making a type I error. (If the null
hypothesis is true, the probability is 98.2% that we will not get a value this extreme.)
One approach to setting up a test procedure is to choose a rejection region in advance of the
experiment based on some desired limit on the value of α. Typical choices are .05 and .01. For
the above example, if we decide that α should be no more than .05 we would reject H0 if x ≥ 9.
If we are more conservative and desire α ≤ .01, then we would reject H0 if x ≥ 11.
Another approach to testing is to avoid defining the rejection region altogether and to report
the result of the experiment in such a way that those interpreting the results can choose their
own α.
Definition 11.6. The p-value of the value x of a test statistic X is the probability that, if H0
is true, X would have a value at least as extreme (in favor of the alternate hypothesis) as x.
In the above example, the p-value of an outcome of 10 is 1 − .982 = .018. The p-value tells us
exactly for which α would we reject H0 .
Note the extraordinarily confusing feature of the name of the p-value. There are
two completely different p’s in this story. There is the p that is the value of the
parameter of the underlying binomial distribution. And there is the p that results
from analyzing the data. We’ll just have to deal with this. By the way, some authors
call the parameter in the binomial distribution π. That would help.
Mathematics 343 Class Notes
31
Homework.
1. Devore and Berk pages 418–422 cover this material. They might be somewhat difficult to
read however as they use language developed over the course of Chapters 7 and 8.
2. Do problems 3.68 and 3.69 of Devore and Berk.
3. Do problem 9.9 of Devore and Berk (yes, Chapter 9). This introduces a new wrinkle in
that the alternate hypothesis is two-sided.
4. In Example 11.2, the team decides to fire the old kicker and hire the new one if the kicker
makes 8 or more kicks (out of the 20 trial kicks).
(a) What is the p-value of this test?
(b) Suppose that the new kicker actually has a 35% probability of making such kicks.
What is the probability that the team will hire the new kicker based on this test?
Mathematics 343 Class Notes
12
32
Continuous Random Variables
Goal: To introduce the concept of continuous random variable
12.1
Provisional Definition
Definition 12.1. A random variable X is continuous if its possible values consist of a union
of intervals of real numbers and the probability that X is any single real number is 0.
We will have a better definition later. This one will suffice for now. The basic idea is that
continuous random variables are those which represent measurements that are taken on a real
number scale.
Example 12.2. Suppose a student is selected at random from the Calvin population. Continuous random variables that we might report are weight, height, GPA, cholestoral level, etc.
Probability as Area or Mass
0
0.4
0.0
50
0.2
100
Density
150
0.6
200
0.8
250
12.2
2.0
2.5
3.0
3.5
4.0
2.0
2.5
3.0
3.5
4.0
Figure 2: Senior GPAs
The two histograms are of the GPAs of all seniors at a certain college located somewhere in
Western Michigan. The difference between the two histograms is in the units used on the y
Mathematics 343 Class Notes
33
axis. In the first histogram, one can read off the counts (out of the 1333 seniors) of the number
of seniors in each bin. The second histogram is a density histogram. The y-axis units are called
density units and are scaled so that the total area of all bins in the hisotogram is 1 (where area
is simply width in x units times height in these density units).
Important Observation 12.3. Given a density histogram of a population, the probability
that a randomly chosen individual from the population comes from any particular bar of the
histogram is equal to the area of that bar.
0.8
Density as Limit of Histograms
0.4
0.0
0.0
0.2
0.2
Density
0.4
0.6
0.6
12.3
2.0
2.5
3.0
3.5
4.0
2.0
2.5
3.0
3.5
4.0
Figure 3: GPA Data, Different Bin Widths
The histogram on the right, with its smaller bin widths, gives a better picture of the distribution
of senior GPAs. With a finite population, there is a limit to the narrowness of the bins. With
very small bin widths, many bins would have no individuals while some might have two or three.
This would lead to a very spikey appearance and a loss of information. The histogram in Figure
4 has a smooth curve superimposed and this curve seems to approximate the distribution of the
GPAs well. We will defer the discussion of exactly how curve is drawn until much later in the
course but a key property of the histograms that it preserves is that the area under the curve
is 1.
There are two ways to view this curve. On the one hand we might consider it a smooth approximation of a finite population. On the other hand, we might think of the curve as modelling
Mathematics 343 Class Notes
34
0.0
0.2
0.4
0.6
0.8
a conceptual population of something like “all possible seniors at the unnamed college.” In
general, of course such a curve can only fit exactly an infinite, conceptual population.
2.0
2.5
3.0
3.5
4.0
Figure 4: GPA Data with Smooth Density
12.4
Probability Density Functions
The above discussion motivates the following definition.
Definition 12.4. A probability density function, (pdf ) is a function f (x) such that
1. f (x) ≥ 0 for all x and
Z ∞
2.
f (x) dx = 1.
−∞
A pdf is the pdf for a random variable X if for every a, b
Z b
P (a ≤ X ≤ b) =
f (x) dx
a
Example 12.5. A model for the waiting time X (in minutes) to the next radioactive event
recorded on a Geiger counter is given by the following pdf:
f (x) = 100e−100x
x≥0
Mathematics 343 Class Notes
35
The probability that we will wait at most 0.01 minutes is given by
Z 0.01
0.01
f (x) dx = −e−100x = 1 − e−1 = .632
0
12.5
0
R Code for Drawing Histograms
> sr=read.csv(’sr.csv’)
> sr
SATM SATV ACT
GPA
1
700 670 30 3.992
2
NA
NA 25 3.376
3
NA
NA 20 3.020
4
NA
NA 24 3.509
5
NA
NA 32 3.970
10
710 680 32 4.000
......................
1330 670 700 27 3.830
1331
NA
NA 25 2.234
1332
NA
NA 27 3.163
1333
NA
NA 24 2.886
> hist(sr$GPA)
> hist(sr$GPA,br=seq(1.7,4.0,.230))
> layout(matrix(c(1,2),1,2))
> layout.show(2)
> hist(sr$GPA,br=seq(1.7,4.0,.230))
> hist(sr$GPA,br=seq(1.7,4.0,.230),prob=T)
> layout(1)
> hist(sr$GPA,br=seq(1.7,4.0,.115),prob=T)
> d=density(sr$GPA)
> lines(d)
Homework.
1. Read Devore and Berk, pages 155–159
2. Do problems 4.1,3,4,10
Mathematics 343 Class Notes
13
36
Continuous Random Variables - II
Goal: To define the cdf of a continuous random variable and to introduce two important families of random variables
13.1
The cdf of a Random Variable
Definition 13.1. The cumulative distribution function (cdf ) of a continuous random variable
X is the function F (x) = P (X ≤ x).
The following proposition is almost the same as Propostion 8.2.
Proposition 13.2. For any continuous random variable X, the cdf F of X satisfies
1.
lim F (x) = 0
x→−∞
2. lim F (x) = 1
x→∞
3. if x1 < x2 then F (x1 ) ≤ F (x2 )
4. for all a, F (a) = P (X ≤ a) = P (X < a)
5. for all a, b, P (a ≤ X ≤ b) = F (b) − F (a).
We now provide the correct definition of a continuous random variable.
Definition 13.3. A random variable X is continuous if the cdf F of X is a continuous function.
Part (4) of the last proposition is not true for the cdf of a discrete random variable but is the
new feature of continuous random variables.
If X is a continuous random variable, X does not necessarily have a pdf but it always has a
cdf. However if X has a pdf, we have
Z
x
F (x) = P (X ≤ x) =
f (x) dx
−∞
More importantly,
Theorem 13.4. Suppose that X is a continuous random variable with continuous pdf f . Then
the cdf F of X satisfies F 0 (x) = f (x) for all x.
Mathematics 343 Class Notes
13.2
37
Uniform Random Variables
Definition 13.5. A random variable X has a uniform distribution with parameters A < B if
the pdf of X is
1
f (x) =
A≤x≤B
B−A
(Convention: a definition of a pdf such as that of f above implies that f (x) = 0 for all x not in
the specified domain.)
R commands dunif, punif, and runif compute the pdf, cdf, and random numbers from a
uniform distribution.
> punif(.3,0,1)
[1] 0.3
> punif(13,10,20)
[1] 0.3
> s=runif(100,0,1)
> hist(s,prob=T)
13.3
Exponential Random Variables
Definition 13.6. Random variable X has an exponential distribution with parameter λ if X
has pdf
f (x) = λe−λx
x≥0
Besides the commands dexp, pexp and rexp, the command qexp computes quantiles (or percentiles) of the exponential distribution.
Definition 13.7. If p is a real number such that 0 < p < 1, the 100pth -percentile of X is any
number q such that P (X ≤ q) = p. If p = .5, a 100pth -percentile is called a median of X.
For a continuous random variable X, F (q) = p if q is the 100pth -percentile. However this is not
usually the case for discrete random variables.
> x=seq(0,10,.01)
> plot(x,p)
> p=dexp(x,.5)
> plot(x,p)
> pexp(1,.5)
[1] 0.3934693
Mathematics 343 Class Notes
38
> s=rexp(100,.5)
> hist(s)
> qexp(.25,.5)
[1] 0.5753641
> qexp(.5,.5)
[1] 1.386294
> qexp(.75,.5)
[1] 2.772589
> qexp(1,.5)
[1] Inf
>
The exponential distribution has an important property that characterizes it and dictates when
it should be used in an application. Suppose that X is an exponential random variable with
parameter λ that is used to model the waiting time in minutes for something to occur. Then
the probability that the waiting time exceeds t minutes is
Z ∞
P (X ≥ t) =
f (x) dx = e−λt
t
Now suppose that we have an exponential random variable and we have waited at t0 units of
time. The probability that we must wait at least t more units of time is
P (X ≥ t + t0 |X ≥ t0 ) =
e−λ(t+t0 )
= e−λt
et0
In other words, the conditional probability that we must wait at least t more minutes does not
depend on how long we have already waited. This property of the exponential distribution is
referred to as the memoryless property.
Homework.
1. Read Section 4.1.
2. Do problems 4.11, 4.12, 4.14, 4.16
Mathematics 343 Class Notes
14
39
Expected Values of Continuous Random Variables
Goal: To define the expected value of continuous random variables and introduce
an important theoretical tool - the moment generating function
14.1
Expected Value
Definition 14.1. The expected value of a random variable X with pdf f is
Z ∞
xf (x) dx
E(X) = µx =
−∞
Note that it is possible that the expected value of X does not exist due to the divergence of the
improper integral. However if the range of X is a finite interval, the expected value will always
exist.
Example 14.2. The expected value of a uniform random variable with parameters A and B
is (A + B)/2. The expected value of an exponential random variable with parameter λ is 1/λ.
14.2
Functions of Random Variables
Just as in the discrete case, given a function h and a random variable X we can define a new
random variable Y = h(X). To compute E(Y ) by the definition we would have to find the pdf
of Y and then integrate. Techniques for finding the pdf of Y will be discussed in a later section
but it is not always easy to do. However the following theorem saves us.
Theorem 14.3 (The Law of the Lazy Statistician). Suppose that X is a continuous random
variable with pdf f and that Y = h(X). Then
Z ∞
E(Y ) = E(h(X)) =
h(x)f (x) dx
−∞
Example 14.4. Suppose that a random variable has a uniform distribution with parameters
A = 0 and B = 1. The Y = X 2 has expected value
Z 1
1
µY =
x2 dx = .
3
0
Note that for all constants a, b,
E(aX + b) = aE(X) + b
On the basis of this property, we say that E is a linear operator.
Mathematics 343 Class Notes
14.3
40
Variance
Definition 14.5. The variance of a continuous random variable with pdf f and mean µ is
Z ∞
2
2
V (X) = σx = E[(X − µ) ] =
(x − µ)2 f (x) dx
∞
The standard deviation of X, σX , is given by σ =
√
σ2.
Just as in the case of a discrete random variable, we have the following computational formula
for the variance:
2
σX
= E(X 2 ) − µ2X
The variance of a uniform random variable with parameters A and B is (A + B)/12. The
variance of an exponential random variable with parameter λ is 1/λ2 .
14.4
The Moment Generating Function
Definition 14.6. The moment generating function of a random variable X is the function M
defined by
M (t) = E(etx )
provided this expectation exists for some open interval containing 0.
Note that the definition makes sense for discrete as well as continuous random variables. Note
also that M (0) = 1 but that M (t) may not exist for any t 6= 0. That is the issue with the
provision at the end of the definition.
Example 14.7. Suppose that X is an exponetial random variable with parameter λ. Then
Z ∞
λ
M (t) =
etx λe−λx dx =
λ−t
0
which exists for all t 6= λ.
(r)
Let MX (t) denote the rth derivative of M . Then
Proposition 14.8. Suppose that X is a random variable and MX (t) exists. Then
(r)
E(X r ) = MX (0)
An important property of the moment generating function is the following.
Mathematics 343 Class Notes
41
Theorem 14.9. Suppose that X and Y are random variables that have moment generating
functions. Then X and Y are identically distributed (have the same cdf ) if and only if MX (t) =
MY (t) for all t in some open interval containing 0.
Homework.
1. Read Section 4.2.
2. Do problems 4.18,21,25,28,31
Mathematics 343 Class Notes
15
42
Normal Distribution
Goal: To defne the normal distribution and introduce its properties
15.1
The Normal Distribution
Definition 15.1. A continuous random variable X has the normal distribution with parameters
µ and σ (∞ < µ < ∞ and 0 < σ) if the pdf of X is
f (x; µ, σ) = √
1
2
2
e−(x−µ) /(2σ )
2πσ
−∞<x<∞
We write X ∼ N (µ, σ) if X is a normal random variable with parameters µ and σ. Annoyingly,
some books will use σ 2 as the parameter so one has to check carefully. R uses σ. Note that the
pdf of a normal random variable is unimodal and symmetric about x = µ.
Proposition 15.2.
Z
∞
−∞
√
1
2
2
e−(x−µ) /(2σ ) dx = 1
2πσ
The parameters µ and σ are aptly named.
Proposition 15.3. If X is a normal random variable with parameters µ and σ, then
µX = µ
σX = σ
Proposition 15.4. The moment generating function of a normal random variable with parameters µ and σ is
2 2
MX (t) = eµt+σ t /2
15.2
The Standard Normal Distribution
The normal distribution with mean µ = 0 and standard deviation σ = 1 is called the standard
normal distribution. Such a random variable is often named Z and the cdf of Z is often called
Φ. In other words
Z z
1
2
√ e−x /2 dx
Φ(z) = P (Z ≤ z) =
2π
−∞
The standard normal distribution occurs so frequently, that a standard terminology has developed to name certain of its properties. For example
Definition 15.5. If Z is a standard normal random variable and 0 < α ≤ .5, then zα denotes
the number such that
P (Z ≥ zα ) = 1 − α
Mathematics 343 Class Notes
43
It is helpful to commit to memory certain important values of zα .
z.05 = 1.645
z.025 = 1.96
z.005 = 2.58
Due to the symmetry of the normal pdf, we can use these values to write probability statements
such as
P (−1.96 < Z < 1.96) = 1 − 2(.025) = .95
That is, approximately 95% of the probability is within 2σ of µ.
We can directly translate results about the standard normal distribution to any normal distribution using the following.
Proposition 15.6. If X is a normal random variable with parameters µ and σ, then
Z=
X −µ
σ
is a standard normal random variable. In other words
x−µ
P (X ≤ x) = Φ
σ
15.3
The Normal Approximation of the Binomial
We will eventually have a whole theory of approximating various distributions with a normal
distribution. One important example is the following
Proposition 15.7. Suppose that X is a binomial random variable with parameters n and p
and such that n is large and p is not too close to the extreme p
values of 0 and 1. Then the
distribution of X is approximately normal with µ = np and σ = np(1 − p). In other words
!
x + .5 − np
P (X ≤ x) ' Φ p
(15.1)
np(1 − p)
(The .5 in the numerator is called the “continuity correction” and it attempts to account for the
fact that X is discrete while the standard normal distribution is continuous.)
Mathematics 343 Class Notes
15.4
44
R examples
> x=seq(0,10,.1)
> p=dnorm(x,5,2)
> plot(x,p)
> pnorm(7,5,2)
[1] 0.8413447
> pnorm(1.64,0,1)
[1] 0.9494974
> qnorm(.95,0,1)
[1] 1.644854
> sim=rnorm(1000,0,1)
> hist(sim)
Homework.
1. Read Section 4.3 of Devore and Berk.
2. Do Problems 4.40,42,55,60
3. In this problem we investigate the accuracy of the approximation in Equation 15.1.
(a) Suppose that X ∼ Bin(100, .5). What percentage error does Equation 15.1 make in
approximating P (X ≤ 20)? P (X ≤ 30)? P (X ≤ 40)? P (X ≤ 45)?
(b) Suppose that X ∼ Bin(100, .1). What percentage error does Equation 15.1 make in
approximating P (X ≤ 5)? P (X ≤ 10)? P (X ≤ 15)? P (X ≤ 20)?
(c) Suppose that X ∼ Bin(30, .3). What percentage error does Equation 15.1 make in
approximating P (X ≤ 4)? P (X ≤ 6)? P (X ≤ 8)? P (X ≤ 12)?
Mathematics 343 Class Notes
16
45
The Gamma Function and Distribution
Goal: To define the gamma distribution
16.1
The Gamma Function
Definition 16.1. The gamma function Γ is defined by
Z ∞
xα−1 e−x dx
Γ(α) =
α > 0.
0
Proposition 16.2. The gamma function satisfies the following properties
1. Γ(1) = 1
2. For all α > 1, Γ(α) = (α − 1)Γ(α − 1)
3. For all natural numbers n > 1, Γ(n) = (n − 1)!
√
4. Γ(1/2) = π.
16.2
The Gamma Distribution
Definition 16.3. A continuous random variable X has the gamma distribution with parameters
α and β (X ∼ Gamma(α, β)) if the pdf of X is
f (x; α, β) =
1
xα−1 e−x/β
β α Γ(α)
x≥0
It is easy to see that f (x; α, β) is a pdf by using the properties of the gamma function. The
parameter α is usually called a “shape” parameter and β is called a “scale” parameter.
Proposition 16.4. If X ∼ Gamma(α, β), then
µX = αβ
16.3
2
σX
= αβ 2
Special Cases of the Gamma Distribtion
The exponential distribution with parameter λ is a special case of the gamma distribution with
α = 1 and β = 1/λ.
The chi-squared distribution has one parameter ν called the degrees of freedom of the distribution and is the special case of the gamma distribution with α = ν/2 and β = 2. Note that this
implies that a chi-squared random variable has µ = ν and σ 2 = 2ν.
Mathematics 343 Class Notes
46
R knows both the gamma function and the gamma distribution.
> gamma(4)
[1] 6
> gamma(.5)
[1] 1.772454
> x=seq(0.1,5.0,.01)
> y=dgamma(x,shape=2,scale=3)
> y=pgamma(x,shape=2,scale=3)
> y=dgamma(x,shape=1,scale=2)-dexp(x,1/2)
> y=dgamma(x,shape=2,scale=2)-dchisq(x,4)
>
16.4
# should be zeros
# should be zeros
The Poisson Distribution
The Poisson Distribution is a discrete distribution related to the exponential distribution.
Definition 16.5. A Poisson random variable X with parameter λ is a discrete random variable
with pmf
λx
p(x; λ) = e−λ
x = 0, 1, 2, . . .
x!
A Possion random variable with parameter λ has mean λ and variance λ. The relation to the
exponential distribution is this.
Proposition 16.6. Let P be a process that generates events such that the distribution of elapsed
time between the occurence of successive events is exponential with parameter λ. Then the
distribution of the number of occurences in a time interval of length 1 is Poisson with parameter
λ.
We will see a proof of this later in the course.
Homework.
1. Read Section 4.4.
2. Do problems 4.71, 76, 77.
Mathematics 343 Class Notes
17
47
Transformations of Random Variables
Goal: To find the pdf of Y = g(X) given that of X
17.1
The CDF Method
Suppose that X is a continuous random variable and Y = g(X). We will first suppose that g
is one-to-one on the range of X.
Definition 17.1. The range of a random variable X is the set of all x such that X(o) = x for
some outcome of the experiment.
We first illustrate the cdf method with a simple X and an increasing g.
Example 17.2. Suppose that X is uniform on [0, 1] and Y = X 2 so that g(x) = x2 . (Note
that g is not one-to-one on every interval but it is in this case.) The range of Y is [0, 1] as well.
We have
√
√
FY (y) = P (Y ≤ y) = P (X 2 ≤ y) = P (X ≤ y) = y
0≤y≤1
Therefore we have
1
fY (y) = FY0 (y) = √
2 y
1≤y≤∞
In this next example, the only thing that changes is that h is decreasing on the range of X.
Example 17.3. Suppose that X is uniform on [0, 1] and Y = 1/X so that g(x) = 1/x. The
range of Y is [1, ∞). We have
FY (y) = P (Y ≤ y) = P (1/X ≤ y) = P (X ≥ 1/y) = 1 − 1/y
Therefore we have
fY (y) = FY0 (y) =
1
y2
1≤y<∞
1≤y≤∞
Here is the general result.
Proposition 17.4. Suppose that X is a continuous random variable and Y = g(X). Suppose
further that g is one-to-one and differentiable on the range of X so that g has a differentiable
inverse h. Then
fY (y) = fX (h(y))|h0 (y)|
for all y in the range of Y .
Mathematics 343 Class Notes
48
Proof. Suppose that g is an increasing function on the range of X so that h is increasing on
the range of Y . The proof for a decreasing g is similar. Then
FY (y) = P (Y ≤ y) = P (g(X) ≤ y) = P (X ≤ h(y)) = FX (h(y))
Thus
fY (y) =
17.2
d
d
FY (y) =
FX (h(y)) = FX0 (h(y))h0 (y) = fY (y)h0 (y)
dy
dy
g is not Monotonic
The cdf method still works of g is not monotonic. One simply has to be careful in finding the
appropriate regions on which to evaluate the cdf of X.
Example 17.5. Let Z be a standard normal random variable (i.e., µ = 0 and σ = 1). Recall
that the cdf of Z is denoted Φ and the range of Z is (−∞, ∞). Let Y = Z 2 . To find the pdf of
Y we again use the cdf method.
√
√
√
√
FY (y) = P (Y ≤ y) = P (Z 2 ≤ y) = P (− y ≤ Z ≤ y) = Φ( y) − Φ(− y)
0 ≤ y (17.2)
Therefore
1
1
√
√
fY (y) = √ fZ ( y) + √ fZ (− y)
2 y
2 y
0≤y
By the symmetry of fZ , we thus have
√ 2
1
1
fY (y) = √ √ e− y /2 = √ y −1/2 e−y/2
y 2π
2π
0≤y
This density is an example of a gamma distribution. In fact it is a chi-squared distribution with
one degree of freedom.
Notice in the preceding example that equation 17.2 does not depend on the fact that we were
using the standard normal distribution. Whenever the transformation Y = X 2 is used we have
that
√
√
FY (y) = FX ( y) − FX (− y)
y≥0
Homework.
1. Read Section 4.7.
2. Do problems 4.109, 110, 114, 118.
Mathematics 343 Class Notes
18
49
Jointly Distributed Random Variables
Goal: To consider experiements in which several random variables are observed
18.1
Many Random Variables
It is often the case that a random experiment results in several random variables that are of
interest.
Example 18.1. A senior at a fixed but unnamed college is selected and both her GPA and
ACT score are recorded. The result is an ordered pair: (GPA, ACT). We really want to record
this data as an ordered pair — recording the GPA and ACT values separately loses the fact
that these measurements may be related.
Example 18.2. A mediocre golfer hits a drive. For the drive both the distance travelled D
and the deviation from the center line M are recorded.
Example 18.3. A voter is selected for an exit poll and is asked a dozen yes-no questions. The
answers are recorded as Q1 , . . . , Q12 (using a convention such as 1 means yes and 0 means no).
Formally,
Definition 18.4. If n random variables X1 , . . . , Xn are defined on the sample space of an
experiment, the random vector (X1 , . . . , Xn ) is the function that assigns to each outcome o the
vector (X1 (o), . . . , Xn (o)).
Though many of our examples will be random pairs which we will usually denote (X, Y ) it is
not uncommon in “real” applications to consider random vectors that have a large number of
components. One very important special case of this is to repeat a single random experiment
k times for some large k, and think of the k-different values of a random variable X defined
on that experiment as one k-tuple of values. Essentially, we treat the k replications of an
experiment as one replication of a larger, compound experiment.
18.2
Discrete Random Vectors
If each of the k random variables that form a random vector (X1 , . . . , Xk ) are discrete, we call
the random vector a discrete random vector. The natural extension of the pmf to this situation
is given by the following definition (which we give for the case k = 2).
Definition 18.5. Given random variables X and Y defined on the sample space of a random
experiment, the joint probability mass function p of X and Y is defined by
p(x, y) = P (X = x and Y = y)
Mathematics 343 Class Notes
50
Obviously, for random variables X1 , . . . , Xn , the pmf is a function of n-variables usually written
p(x1 , . . . , xn ).
Example 18.6. Suppose that a random experiment consists of tossing two fair dice. Two
random variables associated with this experiment are
S =
the smaller number of the two numbers that appear
L =
the larger number of the two numbers that appear
The joint pmf of S and L is given by the following table:
L
S
1
2
3
4
5
6
1
1/36
2
1/18
1/36
3
1/18
1/18
1/36
4
1/18
1/18
1/18
1/36
5
1/18
1/18
1/18
1/18
1/36
6
1/18
1/18
1/18
1/18
1/18
1/36
Notice that if the pmf p for such a random vector is given, we can compute the probability of
any event by adding the appropriate values of p. In particular, we can recover the pmf of each
of the random variables that make up the random vector. For example, if (X, Y ) is a random
pair, we compute pX (x) and pY (y) by
X
X
pX (x) =
p(x, y)
pY (y) =
p(x, y)
y
x
In the case of a random pair (X, Y ) the functions pX (x) and pY (y) are called the marginal
probability mass functions of X and Y respectively.
While pX and pY can always be recovered from the joint pmf p, the converse is not true. A
special case is the case that X and Y are independent.
Definition 18.7. Discrete random variables X and Y are independent if for every x and y,
p(x, y) = pX (x)pY (y)
This definition really just says that the events X = x and Y = y are independent. The random
variables S and L of Example 18.6 are obviously not independent.
The definitions extend to discrete random vectors consisting of n random variables. Rather
than give a general definition with an unreasonable amount of notation, we give an example.
Mathematics 343 Class Notes
51
Proposition 18.8. Let (X1 , . . . , Xn ) be a random vector with pmf p. The joint pmf of (X1 , X2 )
is the function pX1 ,X2 by
X
pX1 ,X2 (x1 , x2 ) =
p(x1 , x2 , . . . , xn )
x3 ,...,xn
There is of course a joint pmf for any subvector (Xi1 , . . . , Xik ) of the vector (X1 , . . . , Xn ).
Definition 18.9. The random variables X1 , . . . , Xn are independent if for every subvector
(Xi1 , . . . , Xik ) of the vector (X1 , . . . , Xn ), the joint pmf of (Xi1 , . . . , Xik ) is equal to the product
of the marginal pmfs of the Xij .
18.3
The Multinomial Distribution
The exact generalization of the binomial distribution to an experiment with r > 2 possible
outcomes is the multinomial distribution. Thus the conditions on a multinomial experiment
are:
1. there are n independent trials of a simpler experiment
2. each trial results on one of r possible outcomes
3. the ith outcome happens with probability pi in any trial
Definition 18.10. A random vector (X1 , . . . , Xr ) has the multinomial distribution with parameters n, p1 , . . . , pr if the joint pmf is
p(x1 , . . . , xr ) =
n!
px1 · · · pxr r
x1 ! · · · xr ! 1
x1 + · · · xr = n
> dmultinom(c(1,2,2,2,1,2),10,prob=c(1/6,1/6,1/6,1/6,1/6,1/6))
[1] 0.003750857
> rmultinom(1,10,prob=c(1/6,1/6,1/6,1/6,1/6,1/6))
[,1]
[1,]
1
[2,]
2
[3,]
3
[4,]
1
[5,]
2
[6,]
1
>
Mathematics 343 Class Notes
Homework.
1. Read Devore and Berk pages 230, 231.
2. Do problems 5.3, 5.4, 5.8.
52
Mathematics 343 Class Notes
19
53
Continuous Random Vectors
Goal: To extend the concept of random vector to continuous random variables
19.1
Joint Probability Density Functions
Suppose that (X, Y ) is a random vector and that both X and Y are continuous random variables. The notion of pdf extends to this situation in a natural way.
Definition 19.1. Suppose that X and Y are continuous random variables associated with some
experiment. Then a function f (x, y) is a joint probability density function for X and Y if for
every (reasonable) set A in the Cartesian plane
ZZ
P [(X, Y ) ∈ A] =
f (x, y) dx dy
A
In the special case that A is the rectangle {(x, y) : a ≤ x ≤ b, c ≤ y ≤ d},
Z bZ
P (a ≤ X ≤ b, c ≤ Y ≤ d) =
d
f (x, y) dy dx
a
c
The physical interpretation of probability is the same as in the one variable case. The plane
has mass (probability) 1 and the density function p(x, y) is the mass per unit area at the point
(x, y). A joint density function must satisfy the properties familiar from the one-variable case:
1. For all x, y, we have that f (x, y) ≥ 0,
Z ∞Z ∞
2.
f (x, y) dx dy = 1
−∞
−∞
The definition of pdf is easily extended to random vectors (X1 , . . . , Xn ) however multiple integration with n-fold integrals becomes difficult quickly.
19.2
Marginal pdf and Independence
Just as in the single variable case, the pdf of each individual random variable is recoverable
from the joint pdf.
Proposition 19.2. Given a random vector (X, Y ) with pdf f , the marginal pdf of X and Y
can be computed as
Z ∞
Z ∞
fX (x) =
f (x, y) dy
fY (y) =
f (x, y) dx
−∞
−∞
Mathematics 343 Class Notes
54
And the natural definition of independence is
Definition 19.3. Random variables X and Y with joint pdf f are independent if for every
pair (x, y)
f (x, y) = fX (x)fY (y)
The definitions of marginal pdf and independence extend to n random variables in the obvious
way.
Homework.
1. Read Devore and Berk Section 5.1.
2. Do problems 5.12, 13, 15
Mathematics 343 Class Notes
20
55
Expected Values, the Multivariate Case
Goal: To compute expected values of functions of a random vector
20.1
Computing Expected Values
For ease of notation, we consider random pairs (X, Y ) but we could easily extend to random
n-vectors.
Suppose that Z = h(X, Y ) is a function of the random pair (X, Y ). Since Z is a random
variable, we could compute its expected value if we knew its pmf (pdf). But as in the single
variable case, there is a shortcut. We state the theorem only for the continuous case but the
discrete case is the same but with sums replacing integrals.
Theorem 20.1. Suppose that Z = h(X, Y ) and the pair (X, Y ) has pdf f (x, y). Then
Z ∞Z ∞
E(Z) = E[h(X, Y )] =
h(x, y)f (x, y) dx dy
−∞
−∞
Example 20.2. Suppose that the space shuttle has a part and a spare and the time-to-failure
of these parts is exponential with parameter λ. Presumably, the expected value of X + Y is
relevant to determining the expected lifetime of the system on which the part depends.
It is often the case (as in the previous example) that we are dealing with independent X and Y .
Depending on the form of h(X, Y ), the work involved in computing E[h(X, Y )] can be greatly
simplified.
Example 20.3. Suppose that X and Y are independent random variables with uniform distributions on [0, 1]. What is E(XY )?
The general fact is
Proposition 20.4. Suppose that X and Y are independent and that h(X, Y ) = f (X)g(Y ) for
some functions f, g. Then
E[h(X, Y )] = E[f (X)]E[g(Y )].
It’s important to realize that E(XY ) is not necessarily equal to E(X)E(Y ) in the general case.
Example 20.5. Suppose that (X, Y ) has the joint pdf
f (x, y) = x + y
Then E(XY ) = 1/3 but E(X)E(Y ) = (7/12)2 .
0 ≤ x, y ≤ 1
Mathematics 343 Class Notes
20.2
56
Covariance and Correlation
Definition 20.6. The covariance between X and Y is
Cov(X, Y ) = E[(X − µX )(Y − µY )]
The covariance between X and Y measures the relationship between X and Y . There is a
shortcut formula for covariance.
Proposition 20.7. Cov(X, Y ) = E(XY ) − µX µY .
Proposition 20.8. For every X, Y and constants a, b,
1. Cov(X, Y ) = Cov(Y, X)
2. Cov(aX, Y ) = a Cov(X, Y )
3. Cov(X + b, Y ) = Cov(X, Y )
4. Cov(X, X) = Var(X).
5. If X and Y are independent then Cov(X, Y ) = 0.
It is important to note that the converse of Proposition 20.8(e) is not true.
Example 20.9. There are random variables X and Y such that Cov(X, Y ) = 0 but X and
Y are not independent. There is a straightforward discrete example in the text. For a continuous example let X be uniform on [−1, 1] and let Y = |X|. It is obvious that Y and X are
not independent – indeed the value of Y is completely determined from that of X. However
Cov(X, Y ) = Cov(X, |X|) = E(X|X|) − E(X)E(|X|) = 0 − 0(1/2) = 0.
Covariance has units – it is convenient to make covariance a dimensionless quantity by normalizing.
Definition 20.10. The correlation between X and Y is given by
Corr(X, Y ) =
Cov(X, Y )
σX σY
Corr(X, Y ) is also denoted ρX,Y or simply ρ if X and Y are understood.
Proposition 20.11. For all random variables X and Y
1. −1 ≤ Corr(X, Y ) ≤ 1
Mathematics 343 Class Notes
57
2. Corr(aX + b, X) = 1 if a > 0 and Corr(aX + b, X) = −1 if a < 0.
3. If | Corr(X, Y )| = 1 then Y = aX + b for all but a set of x-values of probability 0.
Proof. To prove (a), define random variables V and W by
W = X − µX
V = Y − µY
Now for any real number t
E[(tW + V )2 ] = t2 E(W 2 ) + 2tE(V W ) + E(V 2 )
(20.3)
Since E[(tW + V )2 ] ≥ 0 for every t, considering the right hand side of 20.3, which is a quadratic
polynomial in t must be nonnegative for every t. This in turn means that
4E(V W )2 − 4E(W 2 )E(V 2 ) ≤ 0
Now the result follows since E(V W ) = Cov(X, Y ), E(W 2 ) = Var(X), and E(V 2 ) = Var(Y ).
To prove (c), note that Corr(X, Y ) = 1 only if the quadratic equation in 20.3 has the value 0
for some t0 . For this t0 , we have E[(t0 W + V )2 ] = 0. This must mean that t0 W + V is the zero
random variable, at least on a set of probability 1. In this case we have V = −t0 W which is
equivalent to (c).
From the above proof, we can say more. If | Corr(X, Y )| = 1 then, letting ρ = Corr(X, Y ),
X − µX
Y − µY
=ρ
σY
σX
Example 20.12. Suppose that (X1 , . . . , Xr ) has a mulitinomial distribution with parameters
n, p1 , . . . , pr . Then we have
Xi ∼ Bin(n, pi ) E(Xi ) = npi
Var(Xi ) = npi (1 − pi )
Homework.
1. Read Devore and Berk, Section 5.2.
2. Do problems 5.18, 5.23, 5.24, 5.30 of Devore and Berk.
Cov(Xi , Xj ) = −npi pj
Mathematics 343 Class Notes
21
58
Conditional Distributions
Goal: To define conditional distributions
21.1
Conditional Mass and Density Functions
Definition 21.1. Given a random pair (X, Y ), the random variable Y |x is defined by the pmf
(pdf)
p(x, y)
f (x, y)
pY |x (y) =
fY |x (y) =
px (x)
fX (x)
The random variable X|y is defined similarly. Often pY |x (y) is written pY |x (y|x).
Example 21.2. Recall the experiment in example 18.6 in which two dice are tossed and the
smaller and larger numbers are reported as S and L. Then
pS|2 (1) =
p(1, 2)
2/36
2
=
=
pL (2)
3/36
3
Example 21.3. Let (X, Y ) be defined by the joint pdf
f (x, y) = x + y
0 ≤ x, y ≤ 1
This joint distribution was considered in Example 20.5. Then
fY |x (y) =
x+y
f (x, y)
=
fX (x)
x + 1/2
Since Y |x is a random variable, we can compute E(Y |x) and Var(Y |x) in the usual way. In
general, of course, E(Y |x) depends on x.
Note also that if X and Y are independent, then fY |x (y) = fY (y) as we would expect from the
informal meaning of independence.
21.2
Using Conditional Probabilities to Generate Joint Distributions
Example 21.4. Suppose that two random numbers X and Y are generated as follows. X has
a uniform distribution on [0, 1]. Given X = x, Y is chosen to have a uniform distribution on
[x, 1]. In other words Y ≥ X but no particular values of Y > X are “favored.” Then
f (x, y) = fX (x)fY |x (y) = 1
1
1−x
0≤x≤y≤1
We have, for example,
Z
1Z y
E(Y ) =
Z
1Z y
yf (x, y) dx dy =
0
0
0
0
y
dx dy = 3/4
1−x
Mathematics 343 Class Notes
21.3
59
The Bivariate Normal Distribution
The most important joint distribution is the bivariate normal distribution. It is a five parameter
distribution: µX , µY , σX , σY , and ρ. The density of this joint distribution is the amazing
expression
2πσx σy
1
p
"
1
Exp −
2
2(1
−
ρ2 )
1−ρ
x − µX
σX
2
(x − µX )(y − µY )
− 2ρ
+
σX σY
y − µY
σY
2 !#
which has domain = ∞ < x, y < ∞.
It is characterized by the following properties:
Proposition 21.5. If the random pair (X, Y ) has the bivariate normal distribution
2 .
1. X has a normal distribution with mean µX and variance σX
2. Y has a normal distribution with mean µY and variance σY2 .
3. the correlation of X and Y is ρ.
4. the conditional distribution of Y |x is a normal distribution with
x − µX
σX
2
2
= σY (1 − ρ )
µY |x = µY + ρσY
σY2 |x
Conversely, any random pair with these five properties has the bivariate normal distribution.
R makes the multivariate normal distribution available via a package that must be installed
and loaded: mvnorm. The functions dmvnorm, pmvnorm and rmvnorm return what you would
expect. Since we haven’t defined the cdf of a random pair, we have that definition here:
Definition 21.6. If (X, Y ) is a random pair, the cdf of (X, Y ) is the function
F (x, y) = P (X ≤ x and Y ≤ y)
The function pmvnorm computes the cdf of the multivariate normal distribution. One needs to
supply these functions with a vector of means (µX , µY ) and the covariance matrix
Var(X)
Cov(X, Y )
Cov(X, Y )
Var(Y )
Recall that Cov(X, Y ) = ρσX σY . Suppose that SAT Math and Verbal scores have means of
500, standard deviations of 100 and a correlation of 0.8. Then the following computes the
probability that both math and verbal scores of a randomly chosen individual are below 500.
Mathematics 343 Class Notes
> m=c(500,500)
> cv=matrix(c(10000,8000,8000,10000),nrow=2,ncol=2)
> pmvnorm(upper=c(500,500),sigma=cv,mean=m)
[1] 0.3975836
attr(,"error")
[1] 1e-15
attr(,"msg")
[1] "Normal Completion"
Homework.
1. Read Devore and Berk, Section 5.3.
2. Do problems 5.36, 5.45, 5.46, 5.53, 5.57
60
Mathematics 343 Class Notes
22
61
Statistics
Goal: To define statistics and study them by simulation
22.1
Random Samples
Definition 22.1. The random variables X1 , . . . , Xn are a random sample if
1. The Xi ’s are independent,
2. The Xi ’s have identical distributions.
We also say that the Xi ’s are i.i.d. (independent and identically distributed).
The R commands rnorm, runif, rbinom, etc. are intended to simulate a random sample.
Not all data can be considered to have arisen from a random sample. There are often dependencies to worry about. For example, sampling from a finite population almost never produces
independent random variables. However we will often try to construct experiments so that the
data can be analyzed as having come from a random sample.
Note that in any experiment, x1 , . . . , xn denote the result of the random variables X1 , . . . , Xn
for this particular trial of the experiment. Uppercase letters are random variables and the
corresponding lowercase letter is a number.
22.2
Statistics
To analyze data, we often compute certain summary values. For example
Definition 22.2. Given x1 , . . . , xn , the (sample) mean of these numbers, x is
x=
x1 + · · · + xn
n
The process of producing the sample mean is the result of applying a function to the values of
the random variables X1 , . . . , Xn . The sample mean could be considered the result of a random
variable.
Definition 22.3. Given a random sample X1 , . . . , Xn any function Y = h(X1 , . . . , Xn ) is called
a statistic.
Definition 22.4. Given a random sample X1 , . . . , Xn , the sample mean is the random variable
X=
X1 + · · · + Xn
n
Mathematics 343 Class Notes
62
We sometimes write X n to signify the random variable that is the sample mean of a sample of
size n.
22.3
The Sampling Distribution of the Statistic
If X1 , . . . , Xn is a random sample and Y = g(X1 , . . . , Xn ) is a statistic, then Y has a distribution
that depends on that of the Xi ’s (which all have the same distribution). Obviously, it might
not be easy to figure out what the pdf of Y is, even if we know the pdf of the Xi ’s.
22.4
Simulation
Simulations sometimes help to understand and estimate the sampling distribution of Y . The
following R code simulates samples of size 100 from an exponential distribution with λ = 1.
The experiment of generating random samples of size 100 is replicated 1,000 times. The sample
mean is computed for each replicate. Therefore we get 1,000 different values for the sample
mean. R accomplishes the simulation in the following steps. First, 100,000 random numbers are
generated from the exponential distribution with parameter λ = 1. Presumably, these numbers
act like they are being generated independently. These 100,000 numbers comprise the vector x.
The vector y contains 100 copies of the numbers from 1 to 1,000. The idea is that the numbers
in y are being used to label the numbers in x with 100 numbers in x to receive each of the
numbers from 1 to 1,000. Numbers in x with the same label from y are treated as if they are
in the same sample of size 100. The tapply command takes three arguments. The first two
are vectors of the same length, in this case x and y. The third is an R function. The command
applies the function to each group of elements in x with the groups determined by the labels in
y. Therefore, in the example below xbar consists of 1,000 sample means of 100 numbers from
an exponential distribution with λ = 1.
> y=rep(1:1000,100)
> x=rexp(100000,1)
> summary(x)
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
1.181e-05 2.891e-01 6.968e-01 1.003e+00 1.395e+00 1.050e+01
> xbar=tapply(x,y,mean)
> length(xbar)
[1] 1000
> summary(xbar)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0.7259 0.9326 1.0020 1.0030 1.0670 1.3740
>
Mathematics 343 Class Notes
63
Homework.
1. Read Section 6.1 of Devore and Berk.
2. Using the R example in the notes as a model, study the sampling distribution of the
sample mean X of random samples from a gamma distribution with paramters α = 2 and
β = 1. Do four different simulations with the sample size being simulated being n = 4,
n = 10, n = 30 and n = 100. Do 500 replications of each sample.
(a) Comment on the differences in shape of the sampling distribution as n varies. In
other words, as n increases how does the shape of the sampling distribution change?
(b) Comment of the differences in spread of the sampling distribution as n varies.
Mathematics 343 Class Notes
23
64
Distributions of Sums of Random Variables
Goal: To develop methods for determining the distribution of a sum of random
variables
23.1
Linear Combinations of Random Variables
Theorem 23.1. If Y = a1 X1 + · · · + an Xn for some constants a1 , . . . , an , then
E(Y ) = a1 E(X1 ) + · · · + an E(Xn )
Corollary 23.2. If each of X1 , . . . , Xn has mean µ, then E(X) = µ.
Notes:
1. A random variable y of form Y = a1 X1 + · · · + an Xn is called a linear combination of the
random variables X1 , . . . , Xn .
2. Note that Theorem 23.1 does not require independence of the random variables X1 , . . . , Xn .
Similarly Corollary 23.2 does not require that the random variables are identically distributed but only that they have the same mean.
The situation for the variance is more complicated.
Theorem 23.3. If Y = a1 X1 + · · · + an Xn for some constants a1 , . . . , an , then
Var(Y ) =
n X
n
X
ai aj Cov(Xi , Xj )
i=1 i=1
Corollary 23.4. For any two random variables X1 and X2 ,
Var(X1 + X2 ) = Var(X1 ) + Var(X2 ) + 2 Cov(X1 , X2 )
Var(X1 − X2 ) = Var(X1 ) + Var(X2 ) − 2 Cov(X1 , X2 )
Corollary 23.5. If Y = a1 X1 + · · · + an Xn for some constants a1 , . . . , an , and the random
variables X1 , . . . , Xn are independent then
Var(Y ) =
n
X
a2i Var(Xi )
i=1
The following theorem is a direct corollary of the Corollaries 23.2 and 23.5.
Mathematics 343 Class Notes
65
Theorem 23.6. Suppose that random variables X1 , . . . , Xn are a random sample each with
mean µ and variance σ 2 . Then
E(X) = µ
23.2
Var(X̄) =
σ2
n
The Distribution of a Linear Combination - The mgf Method
Theorem 23.7. Suppose that X1 , . . . , Xn are independent random variables such that the moment generating function MXi (t) of each rv Xi exists. Let Y = a1 X1 + · · · + Xn . Then
MY (t) = MX1 (a1 t)MX2 (a2 t) · · · MXn (an t)
Example 23.8. Suppose that X1 and X2 are independent exponential random variables with
λ = 1. Then Y = X1 + X2 is a gamma random variable with α = 2 and β = 1.
Corollary 23.9. Suppose that the random variables X1 , . . . , Xn are independent and that each
is normally distributed. Then Y = a1 X1 + · · · + an Xn is normally distributed.
Homework.
1. Read Devore and Berk, Section 6.3.
2. Do problems 6.28, 6.36 of Devore and Berk.
3. Suppose that X1 , . . . , Xn are independent random variables so that Xi has a gamma
distribution with parameters α and β. What is the distribution of Y = X? (Hint: use
the Moment generating function method.) From this, verify that Theorem 23.6 holds in
this special case.
4. Suppose that X1 , . . . , Xn are independent chi-squared random variables such that Xi has
νi “degrees of freedom.” What is the distribution of Y = X1 + · · · + Xn ?
Mathematics 343 Class Notes
24
66
The Central Limit Theorem
Goal: To develop further the properties of the distribution of the samle mean
24.1
The Distribution of the Sample Mean
Summarizing the preceding section we have
Theorem 24.1. Suppose that X1 , . . . , Xn are iid with mean µ and variance σ 2 . Then
E(X) = µ
Var(X) =
σ2
n
Theorem 24.2. Suppose that X1 , . . . , Xn are independent and normally distributed with mean
µ and standard deviation σ. Then X is normally distributed with mean µ and standard deviation
√
σ/ n.
24.2
The Central Limit Theorem
If the underlying distribution is not normal, we often do not know the distribution of the sample
mean. The following theorem is an important remedy.
Theorem 24.3 (The Central Limit Theorem). Suppose that X1 , . . . , Xn are iid with mean µ
and variance σ 2 . Then
X −µ
√ ≤ z = P (Z ≤ z) = Φ(z)
lim P
n→∞
σ/ n
Corollary 24.4. Suppose that X has a binomial distribution parameters n and p. Then
if
p n is large, X is approximately normally distributed with mean np and standard deviation
np(1 − p).
Homework.
1. Read Section 6.2. pages 291–296.
2. No problems.
Mathematics 343 Class Notes
25
67
Investigating the CLT
Goal: To investigate the accuracy of approximating the distribution of the sample
mean by the normal distribution
25.1
Corollaries and Variants of the CLT
Theorem 25.1 (The Central Limit Theorem). Suppose that X1 , . . . , Xn are iid with mean µ
and variance σ 2 . Then
X −µ
√ ≤ z = P (Z ≤ z) = Φ(z)
lim P
n→∞
σ/ n
Corollary 25.2 (Informal Statement). For n large, X n has a distribution that is approximately
normal with mean µ and variance σ 2 /n. Additionally, T = X1 + · · · + Xn has a distribution
that is approximately normal with mean nµ and variance nσ 2 .
Corollary 25.3. Suppose that X has a binomial distribution parameters n and p. Then
if
p n is large, X is approximately normally distributed with mean np and standard deviation
np(1 − p).
25.2
Using R To Investigate CLT
In the following example we superimpose the density of the appropriate normal distribution
over the histogram of random variables that should be approximately normal by the CLT.
Example 25.4. The Uniform Distribution. If X1 , . . . , Xn is a random sample from a uniform distribution that is uniform on [0, 1], µ = 1/2 and σ 2 = 1/12. The function superhist
produces the histogram of sample means and superimposes the approximating normal distribution as predicted by the CLT. The arguments to superhist include
sm
n
mean
sd
a vector of sample means
sample size
mean of population random variable
standard deviation of population random variable
> superhist
function (sm,mean,sd,n){
hist(sm,prob=T);
normcurve(mean,sd,n)}
> normcurve
Mathematics 343 Class Notes
68
function (m,s,n) {
x= seq(m-3*s/sqrt(n),m+3*s/sqrt(n),length=100);
y=dnorm(x,m,s/sqrt(n));
return(lines(x,y))
}
> superhist(replicate(1000,mean(runif(2))),1/2,1/sqrt(12),2)
> superhist(replicate(1000,mean(runif(10))),1/2,1/sqrt(12),10)
> superhist(replicate(1000,mean(rexp(2))),1,1,2)
> superhist(replicate(1000,mean(rexp(10))),1,1,10)
The histograms below are the result of simulating sample sizes of 2 and 10 respectively from a
uniform distribution on [0, 1].
Histogram of sm
3
0
0.0
1
2
Density
1.0
0.5
Density
1.5
4
2.0
Histogram of sm
0.0
0.4
sm
0.8
0.2
0.4
0.6
0.8
sm
The uniform distribution is symmetric. The exponential distribution is highly skewed. The
result of simulating sample sizes of 2 and 10 respectively from an exponential distribution with
λ = 1 is shown in the next pair of histograms.
Mathematics 343 Class Notes
69
Histogram of sm
1.2
0.8
0.0
0
1
2
3
4
sm
25.3
0.4
Density
0.4
0.2
0.0
Density
0.6
Histogram of sm
5
0.5 1.0 1.5 2.0
sm
Numerical Comparisons
The function normquant compares the simulated distribution of the sample mean to what is
predicted by the normal cdf by computing the number of simulated values in the intervals
(−∞, −3), (−3, −2), (−2, −1), (−1, 0), (0, 1), (1, 2), (2, 3), (3, ∞) (in standardized units).
> normquant
function (sm,mean,sd,n){
devs=c(); reps=length(sm);
pct=0;normpct=0;
for (i in -3:3){ br= mean+i*sd/sqrt(n);
counts=length(sm[sm<br]);
devs=c(devs,counts/reps-pct-pnorm(i,0,1)+normpct);
normpct=pnorm(i,0,1);
pct=counts/reps};
devs=c(devs,(1-pct)-(1-normpct));
return(devs)}
> normquant(replicate(1000,mean(runif(2))),1/2,1/sqrt(12),2)
Mathematics 343 Class Notes
70
[1] -0.001349898 -0.002400234 0.001094878 -0.011344746 0.002655254
[6] 0.019094878 -0.006400234 -0.001349898
> normquant(replicate(1000,mean(runif(10))),1/2,1/sqrt(12),10)
[1] -0.000349898 0.007599766 -0.018905122 0.019655254 -0.008344746
[6] 0.007094878 -0.005400234 -0.001349898
> normquant(replicate(1000,mean(rexp(2))),1,1,2)
[1] -0.001349898 -0.021400234 -0.014905122 0.129655254 -0.100344746
[6] -0.015905122 0.006599766 0.017650102
Homework.
1. Investigate the accuracy of the approximation of the CLT for the gamma distribution.
Specifically, consider the gamma distribution with parameters α = 5 and β = 2.
(a) From Exercise 3 of Section 23, what is the distribution of the sample mean of a
sample of size n from this distribution? (No new question here.) What is the mean
and the standard deviation of X n ?
(b) Fix n = 2. Using the distribution in part (a), find the probability that the sample
mean X is in each of the intervals (−∞, −3), (−3, −2), (−2, −1), (−1, 0), (0, 1), (1, 2),
(2, 3), (3, ∞) where the intervals are expressed in terms of the standardization of X.
Compare these probabilities with those predicted by the Central Limit Theorem.
(c) Repeat the previous part for n = 10.
Mathematics 343 Class Notes
26
71
“Proof ” of the Central Limit Theorem
Goal: To prove the Central Limit Theorem and also to introduce a graphical technique for testing normality
26.1
Proof of the CLT
Theorem 26.1 (The Central Limit Theorem). Suppose that X1 , . . . , Xn are iid with mean µ
and variance σ 2 . Then
X −µ
√ ≤ z = P (Z ≤ z) = Φ(z)
lim P
n→∞
σ/ n
Proof. Appendix, Chapter 5. Note that the proof relies on some facts about moment generating
functions that are beyond the scope of the course. But it is a good example of the moment
generating function method of determining the distribution of a function of random variables.
26.2
Normal Probability Plots and qqnorm
The book’s version of normal probability plots varies slightly from the the version implemented
in R (qqnorm) but the differences are small and the idea is the same. A normal probability
plot is a way of plotting a vector of data so that if the data comes from a distrbution that is
normal, the plot should be (almost) a straight line.
Definition 26.2. If x1 , . . . , xn is a sequence of n (not necessarily distinct) numbers, the ith
smallest such is denoted x(i) . (In other words, x(1) ≤ x(2) ≤ · · · ≤ x(n) .)
The next definition has some variants in different books. This one is most common (and is
reasonable).
Definition 26.3. If x1 , . . . , xn is a sequence of n numbers, then xi is called the 100(i − .5)/nth
sample percentile.
Earlier, we defined the 100pth percentile of a distribution. Namely, given a continuous random
variable X and p such that 0 ≤ p ≤ 1, the 100pth percentile of X is the unique number qp such
that P (X ≤ qp ) = p.
Definition 26.4. A normal probability plot for a sequence x1 , . . . , xn of numbers is the plot
of the pairs
(x(i) , q(i−.5)/n )
where qp is the 100pth percentile of the standard normal distribution.
Mathematics 343 Class Notes
72
If the sequence of numbers x1 , . . . , xn arises from a random sample from a normal distribution,
then its normal probability plot should be (approximately) a straight line. In R, the function
qqnorm produces a normal probability plot. If n > 10, the plot it produces is exactly the plot
described in Definition 26.4. If n < 10, the quantiles of the normal distribution are modified
somewhat for technical reasons that we will consider later.
> qqnorm(runif(100),main="Uniform")
> qqnorm(rexp(100),main="Exponential")
> qqnorm(rnorm(100),main="Normal")
● ●
−2
−1
0
1
●
2
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
Theoretical Quantiles
−2
−1
0
1
1
0
−1
−2
−3
1
0
2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●●●●●●
Sample Quantiles
4
●●
●
●●
●
●
●
●
3
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
● ●●
●
●
2
●●
●●●●
●
●
●
●
●
●
●
●
●
●
Normal
5
Exponential
Sample Quantiles
0.6
0.4
0.0
0.2
Sample Quantiles
0.8
1.0
Uniform
2
Theoretical Quantiles
●
●
−2
−1
0
1
Theoretical Quantiles
Homework.
1. Read the Appendix to Chapter 5 in Devore and Berk and also Section 4.6, pages 206–212.
2. No further problems are assigned.
2
Mathematics 343 Class Notes
27
73
Estimation of Parameters
Goal: To introduce the concept of estimation of a parameter
27.1
Parameter of a Population
Recall the notions of population and parameter from Section 1.3.
Definition 27.1. A population is a well-defined collection of individuals.
Definition 27.2. A parameter is a numerical characteristic of a population.
The framework for the next several sections is this: while the population is well-defined, the
parameter is unknown. We will collect data (a sample of the population) to estimate the
parameter. To cast this as a problem in statistics, we will assume that sampling from the
population is represented by a random variable X called the population random variable.
Definition 27.3. A random sample from a population X is an iid sequence of random variables
X1 , . . . , Xn all of which have the same distribution as X.
Problem 27.4 (Parameter Estimation). Given a random variable X and a parameter θ associated with X the value of which is unknown, compute an estimate of θ from the result of a
random sample X1 , . . . , Xn from X.
27.2
Example I - Estimating p in the Binomial Distribution
Example 27.5. The manufacturing process of a part results in a certain probability of a
defective part. Estimate that rate.
Example 27.6. Not every seed in a bag of seeds will germinate. Estimate the germination
percentage.
Example 27.7. Some free-throw shooters are better than others. Estimate the probability
that a given free-throw shooter will make a free-throw.
Each of these examples is a (theoretical) population characterized by a parameter p and the
value of p is unknown. (Insert diatribe about Greek or roman letters here.) The relevant
random variable is the one that results from the experiment of testing a part (planting a seed,
shooting a free-throw) and recording whether the result of an experiment is a success or not.
The random sample is to repeat the experiment n times independently and record the results.
The obvious candidate for an estimator is to compute X/n where X is the number of successes since we know the distribution of the X is binomial with parameters n (known) and p
(unknown).
Mathematics 343 Class Notes
74
Definition 27.8. An estimator of θ is a statistic θ̂ used to estimate θ. The number that results
from evaluating the estimator on the result of the experiment is called the estimate of the
parameter.
Summary: If X is a binomial random variable, p̂ = X/n is an estimator of p.
Example 27.9. (Example 27.5 continued) We test n = 100 parts. Record the number of
defectives X. Then X/100 is an estimator of the defective rate. If we observe 5 defectives, .05
is the estimate of the defective rate.
27.3
Competing Estimators
Instead of X/n, Laplace proposed (X + 1)/(n + 2). Wilson (and Devore and Berk) propose
(X + 2)/(n + 4).
Which should we choose?
Principle I - Lack of Bias
Note that if X ∼ Bin(n, p),
E(X) = p
E[(X + 1)/(n + 2)] =
np + 1
n+2
E[(X + 2)(n + 4)] =
np + 2
n+4
Note that the Laplace and Wilson estimators are “biased” in the sense that they tend to
overestimate small p and underestimate large p. The estimator X/n has no such bias.
Definition 27.10. An estimator θ̂ of a parameter θ is unbiased if E(θ̂) = θ. The bias of an
estimator θ̂ is E(θ̂) − θ.
(Note that we don’t generally know the bias since we don’t know θ.)
Principle II – Close to θ on average
Definition 27.11. The mean squre error of an estimator θ̂ is
MSE(θ̂) = E[(θ̂ − θ)2 ]
(Again, we do not generally know MSE(θ̂) since it depends on the unknown θ.)
Proposition 27.12. MSE(θ̂) = Var(θ̂) + Bias(θ̂)2
Let p̂, p̂L , and p̂W be the three estimators for p above (L for Laplace and W for Wilson).
Mathematics 343 Class Notes
75
Estimator
Bias
Variance
p̂
0
p(1 − p)
n
p̂L
1/n − 2p/n
1 + 2/n
p(1 − p)
n + 2 + 4/n
p̂W
2/n − 4p/n
1 + 4/n
p(1 − p)
n + 8 + 16/n
0.000
0.004
MSE
0.010
0.000
MSE
0.020
0.008
A plot of the mean square error of each of these three estimators for n = 10 and n = 30 shows
that which estimator has least mean square error depends on the true (unknown) value of p
and the (known) value of n.
0.0
0.2
0.4
0.6
0.8
1.0
p
Homework.
1. Read Devore and Berk, pages 326–331.
2. Do problems 7.7,8,14
0.0
0.2
0.4
0.6
p
0.8
1.0
Mathematics 343 Class Notes
28
76
Estimating µ
Goal: To estimate µ from a random sample
Proposition 28.1. If X is a population random variable and X1 , . . . , Xn is a random sample
from X, then µ̂ = X is an unbiased estimator of µX .
Note that we also know the variance of the estimator X̄ even if we do not know the distribution
of X, provided X itself has a variance. Namely
Var(X) =
2
σX
n
Note also that Var(X) = MSE(X).
In the special case that X has a normal distribution, we have something more.
Proposition 28.2. If X is normally distributed, the X is an unbiased estimator of µ and
among all unbiased estimators it has the least variance. (We call X the MVUE - Mimimum
Variance Unbiased Estimator.)
Important Notes:
1. If X is not normal, X may not be MVUE.
2. An estimator that is MVUE need not have minimum MSE.
Homework.
1. Read Devore and Berk, pages 326–335.
2. Do problems 7.15,16,19.
3. Extra Credit: Problem 7.20 looks interesting although it took me quite a while to figure
out what was interesting in my answer to 7.20a. Try it if you like.
Mathematics 343 Class Notes
29
77
Estimating σ 2
Goal: To find an unbiased estimator for σ 2
29.1
S2
Setting: X1 , . . . , Xn is a random sample from a distribution that has mean µ and variance σ 2 .
Goal: Estimate σ 2 .
The natural estimator is
n
X
(Xi − µ)2 /n which is obviously an unbiased estimator. But µ is
i=1
usually not known so this is not helpful.
Definition 29.1. The sample variance is the statistic
2
S =
n
X
(Xi − X)2 /(n − 1)
i=1
Proposition 29.2. If X1 , . . . , Xn is a random sample from a distribution with variance σ 2 ,
then S 2 is an unbiased estimator of σ 2 .
Note that the Proposition implies that using n in the denomiator gives a biased estimator (that
on average underestimates σ). Some insight as to why comes from
Proposition 29.3. Let x1 , . . . , xn be real numbers with mean x. Then x is the unique number
c that minimizes
n
X
f (c) =
(xi − c)2
i=1
In other words, by approximating µ by X, we make the expression for computing variance
smaller.
29.2
The Normal Case
In Example 17.5, we showed that if Z is a standard normal random variable then Z 2 has a
chi-squared distribution with one degree of freedom. Also in Exercise 4 of Section 23, we found
that the sum of n independent ch-squared random variables with degrees of freedom ν1 , . . . , νn
respecitively is also chi-squared with the degrees of freedom ν1 + · · · + νn . The following is
immediate.
Mathematics 343 Class Notes
78
Proposition 29.4. Suppose that X1 , . . . , Xn is a random sample from a normal distribution
with mean µ and variance σ 2 . Then
n X
Xi − µ 2
σ
i=1
has a chi-squared distribution with n degrees of freedom and
X −µ
√
σ/ n
2
has a chi-squared distribution with 1 degree of freedom.
Lemma 29.5.
2
n X
Xi − X
i=1
σ
=
n X
Xi − µ 2
i=1
σ
−
X −µ
√
σ/ n
2
Proposition 29.6. If X1 , . . . , Xn is a random sample from a normal distribution then S 2 and
X are independent random variables.
Corollary 29.7. If X1 , . . . , Xn is a random sample from a normal distribution, then the
random variable (n − 1)S 2 /σ 2 has a chi-squared distribution with n − 1 degrees of freedom.
Thus S 2 has mean σ 2 and variance 2σ 4 /(n − 1).
Important Notes:
1. Proposition 29.6 is true of normal distributions but is not true in general.
2. Though S 2 is always an unbiased estimator of σ 2 , S is never an unbiased estimator of σ.
Homework.
1. Read Devore and Berk pages 309–312.
2. Do problem 7.10 of Devore and Berk.
Mathematics 343 Class Notes
30
79
Summary of Properties of Estimators
Goal: To review the main properties of estimators.
30.1
Summary
1. X is a random variable and θ is a parameter (unknown) of X
2. A random sample X1 , . . . , Xn is generated from X
3. A statistic θ̂ = θ̂(X1 , . . . , Xn ) is computed
4. We would like θ̂ to have small MSE(θ̂) = E[(θ̂ − θ)2 ]
5. θ̂ is unbiased if E(θ̂) = θ. Unbiasedness is a good thing.
6. MSE(θ̂) = Bias2 (θ̂) + Var(θ̂) so small variance is a good thing.
2 /n.
7. For any X, X is an MVUE of µX with variance σX
8. For any X, S 2 is an unbiased estimator of σ 2 .
9. For normal X, S 2 and X are independent and nS 2 /σ 2 is chi-squared with (n − 1) degrees
of freedom.
10. It’s often much easier to find an unbiased estimator than to find the estimator with least
MSE. Furthermore, there may not be an estimator with least MSE over the full range of
parameter values.
The Standard Error of the Estimator
Definition 30.1. If θ̂ is an estimator for θ, the standard error of θ̂ is
q
σθ̂ = Var(θ̂)
The standard error of θ̂ is a measure of how precise the estimator is. Unfortunately, we usually
don’t know it since it depends on unknown parameters. If we can estimate σθ̂ , we write sθ̂ for
this estimate.
Example 30.2. If X ∼ Bin(n, p) withpp known, we have p̂ = X/n is an unbiased estimator of p
with Var(p̂) p
= p(1 − p)/n. Thus σp̂ = p(1 − p)/n. Since p̂ is an estimator of p it is reasonable
to use sp̂ = p̂(1 − p̂)/n.
Mathematics 343 Class Notes
80
Homework.
1. Read pages 338-339 of Devore and Berk.
2. Do problems 7.11, 7.13 of Devore and Berk.
Information concerning Test 2
The test is Monday, November 6. The test is in-class and closed-book. Calculators are allowed
and our favorite distribution chart will be available. The test covers Sections 16–26 of the
notes. This includes the following sections of the book: 4.4,4.7,5.1,5.2,5.3,6.1,6.2,6.3 (and a few
miscellaneous other pages).
The test has five problems. They are approximately as follows.
1. If X has such and such a distribution and Y = g(X), what is the pdf of Y ? (see 4.7)
2. If X has such and such a distribution and Y has a distribution that is conditional on X,
tell me a lot about Y . (This is similar to 5.45 or the problem gone over in class on the
day this one was handed back.)
3. A question about the CLT. You must be able to state it accurately including defining all
notation used in its statement. You should know exactly what it is about and how it is
applied.
4. A certain joint density is given where X and Y are both discrete. Compute miscellaneous
things concerning it.
5. Some linear combination of random variables is described. You might have to compute
some or all of the mean, variance, and moment generating function of said random variables.
Mathematics 343 Class Notes
31
81
Method of Moments
Goal: To estimate parameters using the method of moments
31.1
Moments
Definition 31.1. Suppose that X1 , . . . , Xn are a random sample from a population random
variable X.
1. The k th population moment is µ0k = E(X k ).
2. The k th central population moment is µk = E[(X − µ)k ] for k ≥ 2.
3. The k th sample moment is Mk = (1/n)
n
X
Xik .
i=1
2 + µ2 , and M = X.
Obviously, E(Mk ) = µ0k . Also note that µ01 = µX , µ02 = σX
1
X
31.2
Method of Moments - One parameter
Example 31.2. Suppose that X is exponential with parameter λ. Then µ = 1/λ. So E(X) =
1/λ. Therefore to estimate λ we might use 1/X. (Of course this estimator is not unbiased.)
In general, the method of moments to estimate a parameter θ works like this. Suppose µ is
some known function of θ. Then to estimate θ we use
X=µ
and solve this equation for θ in terms of X.
Example 31.3. Suppose that X1 , . . . , Xn is a random sample from a distribution that has pdf
f (x; θ) = θ(x − 1/2) + 1
0≤x≤1
−2≤θ ≤2
One can show that µX = 1/2 + θ/12. Therefore the method of moments estimator comes from
solving
1
θ̂
X= +
2 12
So
θ̂ = 12X − 6
Note something peculiar about this estimator. It is certainly possible that θ̂ gives a value of θ
that is impossible.
Mathematics 343 Class Notes
31.3
82
Two parameters
Example 31.4. The gamma distribution has two parameters α and β. Suppose that we wish
to estimate both of these given a sample.
> x=rgamma(30,shape=alpha,scale=beta)
> x
[1] 24.62216 29.25221 35.77913 39.77655
[9] 32.55297 23.14998 29.32863 22.15848
[17] 28.38134 20.18289 47.29603 22.40046
[25] 26.45068 39.11045 21.36603 17.03293
> hist(x,prob=T)
27.73842
47.13237
32.42016
26.85245
20.47076 68.53810 29.17282
35.77347 25.08374 25.61091
58.93412 19.39311 38.03705
35.56088
We have µ = αβ and σ 2 = αβ 2 so µ02 = σ 2 + µ2 = αβ 2 + α2 β 2 . So we have the two equations
M1 = α̂β̂
M2 = α̂β̂ 2 + α̂2 β̂ 2
In our particular example we have
> m1=mean(x)
> m2=mean(x^2)
> betahat=(m2-m1*m1)/m1
> alphahat=m1/betahat
> alphahat
[1] 7.51535
> betahat
[1] 4.211644
The plot of the true density (α = 10 and β = 2) is shown on the next page with the density
computed using estimated parameters and also with the histogram of the data.
Example 31.5. The method of moments estimator can always be used to estimate µ and σ 2 .
We must solve
M1 = µ̂
M2 = µ̂2 + σ̂ 2
This obviously yields our usual µ̂ = X and the estimate
n
X
i=1
(Xi − X)2 /n for σ 2 .
83
0.02
0.00
Density
0.04
Mathematics 343 Class Notes
10
20
30
40
50
60
x
Homework.
1. Read Devore and Berk pages 344-346.
2. Do problem 7.23a.
3. A Rayleigh distribution has one parameter θ and pdf
f (x; θ) =
x −x2 /2θ2
e
θ2
Find the method of moments estimator of θ.
x≥0
70
Mathematics 343 Class Notes
32
84
Maximum Likelihood Estimation
Goal: To estimate parameters using the method of maximum likelihood
32.1
An Example
Example 32.1. Suppose that X ∼ Bin(n, p) with p unknown. The density is
n x
f (x; p) =
p (1 − p)n−x
x
(32.4)
If x is known, we can think of f (x; p) as a function of p. It is reasonable to
choose p̂ to maximize
n
this function. That is, for fixed x we will maximize 32.4. Note that x is a positive constant
so we can omit it and also it is enough to find p̂ that maximizes the natural logarithm of f .
Namely our problem is to maximize
L(p) = x ln p + (n − x) ln(1 − p)
Solving L0 (x) = 0 gives p =
32.2
x
X
. In other words we should use p̂ = .
n
n
The General Method
Suppose that X1 , . . . , Xn have joint density function f (x1 , . . . , xn ; θ1 , . . . , θm ) where θ1 , . . . , θm
are unknown parameters. If x1 , . . . , xn are observed sample values, the function
L(θ1 , . . . , θm ) = f (x1 , . . . , xn ; θ1 , . . . , θm )
is called the likelihood function of the sample.
Definition 32.2. Given the n values x1 , . . . , xn of random variables X1 , . . . , Xn with likelihood
function L(θ1 , . . . , θm ), the maximum likelihood estimates of θ1 , . . . , θm are those values that
maximize the function L. If the random variables Xi are substituted in place of their values xi ,
the resultant random variables are the maximum likelihood estimators θ̂1 , . . . , θ̂m of θ1 , . . . , θm .
Example 32.3. Suppose X1 , . . . , Xn is a random sample from an exponential distribution with
parameter λ. Then the joint density function is
f (x1 , . . . , xn ; λ) = (λe−λx1 ) · · · (λe−λxn ) = λn e−λ
The log of the likelihood function is therefore
n ln λ − λ
X
Maximizing this function we find that
λ̂ = 1/X.
xi
P
xi
Mathematics 343 Class Notes
85
Example 32.4. Suppose that X1 , . . . , Xn is a random sample from a normal distribution with
parameters µ and σ 2 , Then the maximum likelihood estimators of µ and σ are
P
(Xi − X)2
c
2
µ̂ = X
σ =
n
Unfortunately, it is often difficult to find the values of the parameters that maximize the likelihood function.
Example 32.5. Let X1 , . . . , Xn be a random sample from a gamma distribution with parameters α and β. The likelihood function is
L(α, β) = β −nα (Γ(α))−n (x1 · · · xn )α−1 e−(
Homework.
1. Read Devore and Berk, pages 346–348.
2. Do problems 7.21, 7.23b, 7.25.
P
xi )/β
Mathematics 343 Class Notes
33
86
Nonlinear Optimization Using R
Goal: To use R to find maximum likelihood estimates for the gamma distribution
33.1
Finding Minima of Nonlinear Functions
Example 33.1. The function h(x) = x2 has a minimum of 0 at x = 2. R finds minima
numerically. The command nlm takes two arguments: a function to be minimized and a starting
value for the iterative procedure.
> h
function(x){ (x-2)^2 }
> nlm(h,3)
$minimum
[1] 0
$estimate
[1] 2
$gradient
[1] 0
$code
[1] 1
$iterations
[1] 2
Example 33.2. The function k(x) = x4 − 8x3 + 16x + 1 has two local minima. Which one of
these is found depends on the starting point of the iteration.
> k
function (x) {x^4-8*x^3+16*x+1}
> nlm(k,3)
$minimum
[1] -335.912
$estimate
[1] 5.884481
$gradient
[1] 2.414972e-07
$code
[1] 1
$iterations
[1] 6
> nlm(k,-1)
Mathematics 343 Class Notes
87
$minimum
[1] -7.316241
$estimate
[1] -0.7687347
$gradient
[1] 3.939959e-06
$code
[1] 1
$iterations
[1] 5
Example 33.3. The function m(x, y) = x2 + xy + y 2 + 3x − 3y + 4 has at minimum at (1, 1).
Note that the argument to m in R is a vector of length two that carries the two variables of
the function.
> m
function (w) {x=w[1]; y=w[2] ; x^2 + x*y+y^2-3*x-3*y+4 }
> nlm(m,c(2,3))
$minimum
[1] 1
$estimate
[1] 0.9999997 0.9999997
$gradient
[1] -4.440892e-10 -1.332268e-09
$code
[1] 1
$iterations
[1] 3
Example 33.4. Let’s get more interesting and solve a problem from a recent Mathematics 162
test.
A piece of sheet metal of length 10 inches is bent to form a trapezoidal trough as in the picture.
Find the length of the sides x and the angle of the bend θ so that the area of the cross-section
of the trough is maximized.
A
A
A
xA
A
Aθ
A
Mathematics 343 Class Notes
88
The maximum occurs at x = 10/3, θ = π/6. Note that R gets close but is dumb enough not
to know that it should find a reasonable value of θ. Note also that the function trap returns
the negative of the area of the trapezoid. This is because a minimum of trap corresponds to a
maximum value for the area.
> trap
function (x) { s=x[1]; t=x[2]; return(-s*cos(t)*(s*sin(t)+(10-2*s)))}
> n=nlm(trap,c(3,1))
> n
$minimum
[1] -14.43376
$estimate
[1]
3.333313 -12.042781
$gradient
[1] 1.971768e-08 1.389486e-07
$code
[1] 1
$iterations
[1] 9
> n$estimate[2]+4*pi
[1] 0.5235893
> pi/6
[1] 0.5235988
33.2
Maximum Likelihood Estimation for the Gamma Function
Example 33.5. Let X1 , . . . , Xn be a random sample from a gamma distribution with parameters α and β. The likelihood function is
L(α, β) = β −nα (Γ(α))−n (x1 · · · xn )α−1 e−(
P
xi )/β
Rather than maximizing L, we minimize − log(L(α, β)). Notice that dgamma can be used to
return the logarithm of the density function. Also, note that we need to pass the values of the
array x through nlm to the function f. This is the role of the third argument to nlm.
> f
function (a,x) {alpha=a[1];beta=a[2]; -sum(dgamma(x,shape=alpha,scale=beta,log=T)) }
> x=rgamma(30,shape=10,scale=4)
> n=nlm(f,c(8,2),x=x)
> n
$minimum
Mathematics 343 Class Notes
[1] 115.6313
$estimate
[1] 10.788300 3.588843
$gradient
[1] -1.654462e-06 8.604496e-06
$code
[1] 1
$iterations
[1] 18
Homework.
1. Do problem 7.28a.
89
Mathematics 343 Class Notes
34
90
Beta Distribution Project
Goal: To keep students occupied while the instructor is out of town
34.1
The (Standard) Beta Distribution
The beta distribution is a distribution with parameters α, β > 0. (Refer to pages 203–204 of
Devore and Berk where this distribution is called the standard beta distribution.) The pdf is
Γ(α + β) α−1
x
(1 − x)β−1
Γ(α)Γ(β)
0≤x≤1
The beta distribution has mean and variance given by
µ=
α
α+β
σ2 =
αβ
(α +
β)2 (α
+ β + 1)
R knows the beta distribution. The commands dbeta, rbeta, pbeta work as expected. The
parameters α and β are called shape1 and shape2 in R.
1.0
0.5
0.0
y
1.5
2.0
> x=seq(0,1,.01)
> y=dbeta(x,shape1=2,shape2=4)
> plot(x,y,type="l")
0.0
0.2
0.4
0.6
x
0.8
1.0
Mathematics 343 Class Notes
34.2
91
Collect Some Data
Given the domain of the beta distribution (0 ≤ x ≤ 1), it is often used to model data that
are percentages. For example, we could use the beta distribution to model the distribution of
free-throw shooting percentages of basketball players in the National Basketball Association.
Collect some data that is (or at least approximates) a random sample from what might be
a beta distribution. For example, you might find a dataset that represents a population of
percentages and choose your own random sample from it. Be creative here, I don’t want to see
eight collections of free-throw shooting percentages of National Basketball Association players.
1. Clearly describe your data, the manner in which you collected it.
2. Describe exactly the population (whether actual or theoretical) that the data is a sample
from.
3. Is your data a true random sample from the population? If not, what biases might have
been introduced in the sampling?
4. Present your data numerically and also in some useful graphical way.
34.3
Estimate Some Parameters
1. Using the data you have collected, compute the method of moments estimators for the
parameters α and β in the beta distribution.
2. Using the data you have collected, compute the maximum likelihood estimators for the
parameters α and β.
3. Assess the fit of your two respective sets of estimators in some useful (qualatative) way.
Perhaps you might want to plot the respective densities obtained from these estimators.
Mathematics 343 Class Notes
35
92
Maximum Likelihood Estimation Yet Again
Goal: To deal with some “issues” related to maximum likelihood estimation
35.1
Maximization Is Not Always Through Differentiation
We know this already. The maximum of a function may occur at an endpoint of the domain or
at a point where the function is not differentiable rather than at a point where the derivative
is 0.
Example 35.1. Suppose that x1 , . . . , xn is a sample from a distribution that is uniform on
[0, θ], where θ is unknown. Then we know that θ ≥ x1 , . . . , xn and for all such θ, L(θ) = 1/θn .
This is maximimized when θ is as small as possible, that is θ = max{x1 , . . . , xn }.
Example 35.2. Suppose that X is hypergeometric with parameters n, M , and N and that N
is the only unknown parameter. Then the likelihood function is
L(N ) =
M
x
N −M
n−x
N
n
This is a function of an integer variable and so it is not differentiable. We analyze this situation
as follows. Consider L(N )/L(N − 1). we have
L(N )
(N − M )(N − n)
=
L(N − 1)
N (N − M − n + x)
This ratio is larger than 1 if and only if N < M n/x. Thus the value of N that maximizes L(N )
is the greatest integer N such that N < M n/x which is denoted bM n/xc.
35.2
The Invariance Principle
Proposition 35.3. Suppose that θ̂ is the maximum likelihood estimator of θ. Then h(θ̂) is the
maximum likelihood estimator of h(θ). (This result is true for m-tuples of parameters as well.)
Example 35.4. Suppose that α̂ and β̂ are the maximum likelihood estimators of α and β in
the gamma distribution. Then α̂β̂ is the maximum likelihood estimator of αβ = µ.
Example 35.5. The invariance principle is very useful as it applies to any function h. But in
a sense it is bad news. For example we know that the maximum likelihood estimator of µ in
the normal distribution is X̄ and is unbiased. Thus X̄ 2 is the maximum likelihood estimator of
µ2 but we know that it cannot be an unbiased estimator.
Mathematics 343 Class Notes
35.3
93
Large Sample Behavior of the MLE
Although the MLE is sometimes difficult to compute and is not necessarily unbiased, it can be
proved that for large samples the MLE has some desirable properties. Suppose that we have a
random sample from a population with a pdf f (x; θ) that depends on one unknown parameter.
One technical assumption that we must make is that the possible values of x do not depend on
θ (unlike Example 35.1). Then we have the following theorem, stated informally. (It is really a
limit theorem like the Central Limit Theorem.)
Theorem 35.6. For large n, the distribution of the Maximum Likelihood Estimator θ̂ of θ
approaches a normal distribution, its mean approaches θ and its variance approaches 0. Furthermore, for large n, its variance is nearly as small as that of any unbiased estimator of θ.
But beware, small samples are not large!
Homework.
1. Read Devore and Berk, pages 351–353.
2. Do problem 7.31 of Devore and Berk.
Mathematics 343 Class Notes
36
94
Confidence Intervals
Goal: To develop the theory of confidence intervals in the context of a simple (unrealistic) example.
36.1
Example
Suppose that X1 , . . . , Xn is a random sample from a distribution that is normal with unknown
mean µ and known variance σ 2 . (Knowing σ 2 without knowing µ is unrealistic.) Then we
know
1. X is an unbiased estimator of µ
2.
X −µ
√ has a standard normal distribution.
σ/ n
Therefore
P
X −µ
√ < 1.96
−1.96 <
σ/ n
= .95
and so by algebra
P
The interval
σ
σ
X − 1.96 √ < µ < X + 1.96 √
n
n
σ
σ
X − 1.96 √ , X + 1.96 √
n
n
= .95
is a random interval.
Definition 36.1. Suppose that X1 , . . . , Xn is a random sample from a distribution that is
normal with mean µ and known variance σ 2 . Suppose that x1 , . . . , xn is the observed sample.
The interval
σ
σ
√
√
x − 1.96
, x + 1.96
n
n
is called a 95% confidence interval for µ.
Example 36.2. A machine creates rods that are to have a diameter of 23 millimeters. It is
known that the standard deviation of the actual diameters of parts created over time is 0.1
mm. A random sample of 40 parts are measured precisely to determine if the machine is still
producing rods of diameter 23 mm. The data and 95% confidence interval are given by
Mathematics 343 Class Notes
> x
[1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186
[11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054
[21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889
[31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031
> mean(x)
[1] 23.024
> c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40))
[1] 22.993 23.055
95
23.015
22.995
23.019
23.059
23.089
22.894
23.073
23.117
It appears that the process could still be producing rods of diameter 23 mm.
36.2
Interpreting Confidence Intervals
In example 36.2, we can say something like “we are 95% confident that the true mean is in the
interval (22.993, 23.055).” But beware:
This is not a probability statement! That is, we do not say that the probability
that the true mean is in the interval (22.993, 23.055) is 95%. There is no probability
after the experiment is done, only before.
The correct probability statement is one that we make before the experiment.
If we are to generate a 95% confidence interval for the mean from a random sample of
size 40 from a normal distribution with standard deviation 0.1, then the probability
is 95% that the resulting confidence interval will contain the mean.
Another way of saying this using the relative frequency interpretation of probability is
If we generate many 95% confidence intervals by this procedure, approximately 95%
of them will contain the mean of the population.
After the experiment, a good way of saying what confidence means is this
Either the population mean is in (22.993, 23.055) or something very surprising happened.
Below we generate 1000 random samples of size 40 from a normal distribution with mean 20
and standard deviation 1. Since we know µ for this simulation, we can check whether the 1000
corresponding confidence intervals contain µ. We would expect about 950 of them to contain
µ.
Mathematics 343 Class Notes
96
> zint
function (x,sigma,alpha) { z = -qnorm(alpha/2,0,1);
c(mean(x)-z*sigma/sqrt(length(x)),mean(x)+z*sigma/sqrt(length(x)))}
> cints=replicate(1000,zint(rnorm(40,20,1),1,.05))
> cints[,1]
[1] 19.504 20.124
> sum((cints[1,]<20)&(cints[2,]>20))
[1] 964
> cints=replicate(1000,zint(rnorm(40,20,1),1,.05))
> sum((cints[1,]<20)&(cints[2,]>20))
[1] 952
36.3
Confidence Levels and Sample Size
Nothing is magic about 95%. To generate a confidence interval at a different level of confidence,
we simply change the 1.96. Let zβ denote the number such that Φ(zβ ) = 1 − β where Φ is the
cdf of a standard normal random variable. Then a 100(1 − α)% confidence interval for µ is
given by
σ
σ
x − zα/2 √ , x + zα/2 √
n
n
Note that the size of the confidence interval is determined by α and n. Of course higher
levels of confidence require larger confidence intervals. And larger sample sizes result in smaller
confidence intervals. Both these are intuitively obvious.
Homework.
1. Read Section 8.1 of Devore and Berk.
2. Do problems 8.2, 8.3, 8.5, 8.8, 8.9 of Devore and Berk.
Mathematics 343 Class Notes
37
97
An Important (but Overused) Confidence Interval
Goal: To construct confidence intervals for the mean of a normal distribution without the silly assumption that the variance is known
37.1
The Problem
Given a random sample X1 , . . . , Xn from a normal distribution with unknown mean we want
to construct a confidence interval for µ without assuming that σ is known. The steps are:
1. Recall that
X −µ
√ has a standard normal distribution.
σ/ n
2. Since σ is unknown, it seems advisable to approximate σ by S, the sample standard
deviation.
3. Now we need to know the distribution of
37.2
X −µ
√ .
S/ n
The t Distribution
Definition 37.1. A random variable T has a t distribution (with parameter ν ≥ 1, called the
degrees of freedom of the distribution) if it has pdf
1 Γ((ν + 1)/2)
1
f (t) = √
2
Γ(ν/2) (1 + t /ν)(ν+1)/2
πν
−∞<t<∞
Some properties of the t distribution include
1. f is symmetric about t = 0 and unimodal. In fact f looks bell-shaped.
2. Indeed the mean of T is 0 if ν > 1 and does not exist if ν = 1.
3. The variance of T is ν/(ν − 2) if ν > 2.
4. For large ν, T is approximately standard normal.
Theorem 37.2. If X1 , . . . , Xn is a random sample from a normal distribution with mean µ
and variance σ 2 , then the random variable
X −µ
√
S/ n
has a t distribution with n − 1 degrees of freedom.
Mathematics 343 Class Notes
37.3
98
The Confidence Interval
Analogous to zβ , define tβ,ν to be the unique number such that
P (T > tβ,ν ) = β
where T is random variable that has a t distribution with ν degrees of freedom. We have the
following:
Proposition 37.3. If x1 , . . . , xn are the observed values of a random sample from a normal
distribution with unknown mean µ and t∗ = tα/2,n−1 , the interval
s
s
x̄ − t √ , x̄ + t∗ √
n
n
∗
is an 100(1 − α)% confidence interval for µ.
37.4
R and t
As usual, dt, pt, qt, and rt compute the usual functions associated with the t distribution.
The relavant parameter is the degrees of freedom.
0.2
0.1
> -qt(.025,5)
[1] 2.570582
> -qt(.025,10)
[1] 2.228139
> -qt(.025,30)
[1] 2.042272
> -qt(.025,100)
[1] 1.983972
0.3
x=seq(-3,3,.01)
y=dt(x,5)
z=dt(x,10)
w=dnorm(x)
plot(x,y,type="l")
lines(x,z)
lines(x,w)
y
>
>
>
>
>
>
>
−3
−2
−1
0
x
1
2
3
Mathematics 343 Class Notes
99
Confidence intervals are produced using t.test. Below we construct 95% and 90% confidence
intervals for the sepal width of the species virginica irises in an historically important dataset
(that is built into R).
>
>
>
>
data(iris)
sw=iris$Sepal.Width[iris$Species=="virginica"]
hist(sw)
t.test(sw)
One Sample t-test
data: sw
t = 65.208, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.882347 3.065653
sample estimates:
mean of x
2.974
> t.test(sw,conf.level=.9)
One Sample t-test
data: sw
t = 65.208, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
2.897536 3.050464
sample estimates:
mean of x
2.974
37.5
What if X is not Normal?
Suppose that X1 , . . . , Xn is a random sample from a distribution that is not known to be normal
but we still want to construct a confidence interval for µ. If the sample size is large and the
distribution is known to be reasonably symmetric, it is common to use the same confidence
interval as if it were normal. This probably should be a last resort.
Example 37.4. The GPAs of the 1333 seniors at a college somewhere in the Midwest do not
have a normal distribution. (The distribution is decidedly skewed to the right.) A sample of
Mathematics 343 Class Notes
100
30 seniors is taken and a 95% confidence interval for the population GPA is reported using a
t confidence interval. In an R simulation of 1000 such samples, we find that 949 of the 1000
confidence intervals contain the true mean of the population!
> g$GPA
[1] 3.992 3.376 3.020 3.509 3.970 3.917 3.243 3.547 3.416 4.000 3.448 3.908
..........................................
[1321] 3.312 2.621 3.494 2.507 3.222 2.892 3.344 3.417 3.656 3.830 2.234 3.163
[1333] 2.886
> f
function (){x=sample(g$GPA,30,replace=F) ;
t=t.test(x);
t$conf.int}
> cints=replicate(1000,f())
> sum( (cints[1,]<mean(g$GPA))&(cints[2,]>mean(g$GPA)) )
[1] 949
Homework.
1. Read Devore and Berk pages 383–385.
2. Do problem 8.34.
Mathematics 343 Class Notes
38
101
Confidence Intervals Using the CLT
Goal: To construct approximate confidence intervals for parameters of distributions
that are not normal
38.1
Setting
We suppose that we have a random sample X1 , . . . , Xn from a distribution that is not necessarily
normal and an unknown parameter θ. Suppose that θ̂ is an estimator for θ that has the following
properties:
1. the distribution of θ̂ is approximately normal
2. it is approximately unbiased
3. an expression for σθ̂ is available (in terms of quantaties that can be estimated)
Then
P
−zα/2
θ̂ − θ
<
< zα/2
σθ̂
!
≈1−α
We can use this to generate an approximate confidence interval. If sθ̂ is the estimate for σθ̂
then the interval is
sσ̂
sσ̂
x̄ − zα/2 √ , x̄ + zα/2 √
n
n
38.2
Example 1 - Estimating µ
By the CLT, if X1 , . . . , Xn is a random sample from any distribution and n is large then X is
X −µ
√ has a distribution that is approximately normal. Since
an unbiased estimator for µ and
σ/ n
σ is unknown, we need to estimate σ to construct a confidence interval. If we estimate σ by S,
we have the following approximate confidence interval for µ
P
S
S
X − zα/2 √ < µ < X + zα/2 √
n
n
≈1−α
Note that this is the same interval as generated in the last section except that there the t
distribution is used.
Mathematics 343 Class Notes
38.3
102
Example 2 - The Binomial Distribution
Suppose that X is binomial with parameters n and p and that p is unknown. The estimator p̂ =
X
n is an unbiased estimator of p. The CLT allows us to approximate the binomial distribution
by a normal distribution and so we can write
P
!
p̂ − p
−zα/2 < p
< zα/2
p(1 − p)/n
≈1−α
(38.5)
Equation 38.5 is the starting point for several different approximate confidence intervals.
38.3.1
The Wald interval.
If we estimate σ by substituting p̂ for p, we get the “Wald interval”
p̂ − zα/2
p
p̂(1 − p̂)/n, p̂ + zα/2
p
p̂(1 − p̂)/n
Until about the year 2000, this was the standard confidence interval suggested in most elementary statistics textbooks if the sample size is large enough. Books varied as to what large
enough meant. A typical piece of advice is to only use this interval if np̂(1 − p̂) ≥ 10. However,
you should never use this interval.
Definition 38.1. Suppose that I is a random interval used as a confidence interval for θ. The
coverage probability of I is P (θ ∈ I). (In other words, the coverage probability is the true
confidence level of the confidence intervals produced by I.)
The coverage probability of the (approximately) 95% Wald confidence intervals is almost always
less than 95% and could be quite a bit less depending on p and the sample size. For example,
if p = .2, it takes a sample size of 118 to guarantee that the coverage probability of the
Wald confidence interval is at least 93%. For very small probabilities, it takes thousands of
observations to ensure that the coverage probability of the Wald interval approaches 95%.
38.3.2
The Wilson Interval.
Since 1927, a much better interval than the Wald interval has been known although it wasn’t
always appreciated how much better the Wilson interval is. The Wilson interval is derived by
solving the inequality in 38.5 so that p is isolated in the middle. We get the following (impressive
looking) approximate confidence interval statement:
Mathematics 343 Class Notes
p̂ +
P
2
zα/2
2n
103
r
− zα/2
1+
p̂(1−p̂)
n
+
2
zα/2
p̂ +
4n2
<p<
2 )/n
(zα/2
2
zα/2
2n
r
+ zα/2
1+
p̂(1−p̂)
n
+
2
zα/2
4n2
2 )/n
(zα/2
≈1−α
The following R code computes this interval in the case that x = 7, n = 10. (The option
correct=F will be considered later.)
> prop.test(7,10,correct=F)
1-sample proportions test without continuity correction
data: 7 out of 10, null probability 0.5
X-squared = 1.6, df = 1, p-value = 0.2059
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3967781 0.8922087
sample estimates:
p
0.7
The Wilson interval performs much better than the Wald interval. If np̂(1 − p̂) ≥ 10, you can
be reasonably certain that the coverage probability of the 95% Wilson interval is at least 93%.
Notice that the center of the Wilson interval is not p̂. It is
z2
2 /2
α/2
x + zα/2
p̂ + 2n
2 )/n = n + z 2
1 + (zα/2
α/2
2
A way to think about this is that the center of the interval comes from adding zα/2
trials and
2 /2 successes to the observed data. For a 95% confidence interval, this is very close to adding
zα/2
2 successes and 4 trials (which gives a point estimator for p that we studied earlier).
38.3.3
The Agresti-Coull Interval.
Agresti and Coull (1998) suggest combining the biased estimator of p̂ that is used in the Wilson
interval together with the simpler estimate for the standard error that comes from the Wald
interval. In particular, if we are looking for a 100(1 − α)% confidence interval and x is the
number of successes observed in n trials, define
2
x̃ = x + zα/2
/2
2
ñ = n + zα/2
p̃ =
x̃
ñ
Mathematics 343 Class Notes
104
Then the Agresti-Coull interval is
r
p̃ − zα/2
p̃(1 − p̃)
, p̃ + zα/2
ñ
r
p̃(1 − p̃)
ñ
!
In practice, this estimator is even better than the Wilson estimator and is now almost universally
the recommended one, even in basic statistics textbooks. For the particular example of x = 7
and n = 10, the Wilson and Agresti-Coull intervals are compared below.
> agco
function (x,n,alpha) { z= -qnorm(alpha/2); ntilde=n+z^2;
ptilde = (x+z^2/2)/ntilde ;
se = sqrt( ptilde*(1-ptilde)/ntilde);
c(ptilde-z*se,ptilde+z*se)}
> agco(7,10,.05)
[1] 0.3923253 0.8966616
> prop.test(7,10,correct=F)
1-sample proportions test without continuity correction
data: 7 out of 10, null probability 0.5
X-squared = 1.6, df = 1, p-value = 0.2059
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3967781 0.8922087
sample estimates:
p
0.7
Homework.
1. Read Devore and Berk, Section 8.2.
2. Do problems 8.20 and 8.22 of Devore and Berk. You may use either Wilson or AgrestiCoull confidence intervals. Problem 8.22 asks for an upper confidence bound, i.e., a
one-sided confidence interval. In R, a useful option to prop.test is alternative=. Read
the help document to learn more. These homework problems are due Friday, December
8.
Mathematics 343 Class Notes
39
105
The Bootstrap
Goal: To generate “bootstrap” confidence intervals
39.1
Approximations
We constructed exact 95% confidence intervals for the mean from random samples from a normal
distribution. There are other instances where exact confidence intervals can be constructed.
But usually, the 95% confidence intervals that we construct are approximate (i.e., their coverage
probability is only approximately 95%). The approximation is usually because we do not know
the distribution of our estimator θ̂ and use something like the CLT to approximate it.
The bootstrap is a method of constructing approximate confidence intervals that relies on two
pieces of intuition.
1. Rather than make a distributional assumption about the population (e.g., the data comes
from a gamma distribution), we use the data itself to give us an approximation to the
distribution of the population.
2. Simulation can be used to construct approximate confidence intervals if the distribution
is known.
39.2
The Empirical Density Function
Definition 39.1. Suppose that x1 , . . . , xn are n numbers that are the result of a random sample
from a population. The empirical density function of the sample is the function
p(xi ) = 1/n
1≤i≤n
The first principle used in the bootstrap is that the empirical density function is a good approximation to the actual density function of the unknown random variable. More precisely,
the cumulative distribution function corresponding to this density function is a good approximation to the population cdf. The next example graphs the empirical density function of a
sample (from a known distribution) against the actual cdf.
0.8
0.6
0.4
x=rnorm(30,10,1)
x=sort(x)
y=c(1:30)/30
plot(x,y,type="s")
y=pnorm(x,10,1)
lines(x,y)
0.0
0.2
>
>
>
>
>
>
106
1.0
Mathematics 343 Class Notes
8
39.3
9
10
11
12
The Bootstrap
Suppose we want to find a confidence interval for θ and we have a reasonable estimator θ̂ for θ.
Let x1 , . . . , xn be the result of a random sample of size n. The bootstrap simulates an estimate
of the sampling distribution for θ̂ as follows.
Definition 39.2. A bootstrap sample from x1 , . . . , xn is a random sample of size n from
{x1 , . . . , xn } with replacement. That is a bootstrap sample is a random sample from a population with pdf equal to the empirical pdf of the original sample.
To compute a bootstrap confidence interval, choose many (e.g., 1000) bootstrap samples and
compute the values of θ̂ for each. Let θ̂i be the value of θ̂ in the ith bootstrap sample. The
values θ̂i give an estimate for the distribution of θ̂.
Definition 39.3. A 100(1 − α)% bootstrap (percentile) confidence interval for θ is (a, b) where
a is the α/2 percentile of the set of bootstrap values θ̂i and b is the 100 − α/2 percentile of
these values.
> x
[1] 1.903 3.901 3.027 2.391 3.813 3.364 3.108 3.596 2.709 3.873 2.728 3.435
[13] 3.167 3.820 3.719 3.212 3.763 3.773 3.611 3.764 2.544 3.619 3.527 2.768
[25] 3.058 2.960 3.612 3.774 2.979 3.041
> hattheta=replicate(1000,mean(sample(x,30,replace=T)))
> hist(hattheta)
> s=sort(hattheta)
> s[25]
[1] 3.093667
> s[975]
[1] 3.460367
> t.test(x)
Mathematics 343 Class Notes
107
One Sample t-test
Histogram of hattheta
0
50
100
150
200
data: x
t = 35.3424, df = 29, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
3.095183 3.475417
sample estimates:
mean of x
3.2853
2.9
3.1
3.3
3.5
There are more sophisticated ways to generate confidence intervals from the bootstrap values
θ̂i . Devore and Berk describe BCa (bias-corrected, accelerated). An R package, simpleboot
(which must be loaded), computes these two intervals (as well as others).
> x
[1] 3.267 2.496 2.534 3.393 3.439 3.737 3.833 3.336 3.319 2.630 2.382 1.989
[13] 3.330 3.061 2.452 3.258 3.054 3.918 3.788 3.821 3.162 2.794 3.919 3.623
[25] 3.319 3.651 3.764 2.607 3.745 2.946
> b=one.boot(x,mean,1000)
> b
$t0
[1] 3.2189
$t
[,1]
Mathematics 343 Class Notes
108
[1,] 3.273233
[2,] 3.170400
[3,] 3.211500
................
[997,] 3.293700
[998,] 3.318433
[999,] 3.405000
[1000,] 3.072833
$R
[1] 1000
> boot.ci(b)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = b)
Intervals :
Level
Normal
95%
( 3.031, 3.404 )
Basic
( 3.038, 3.417 )
Level
Percentile
BCa
95%
( 3.020, 3.400 )
( 3.011, 3.397 )
Calculations and Intervals on Original Scale
Warning message:
bootstrap variances needed for studentized intervals in: boot.ci(b)
Homework.
1. Read Devore and Berk, Section 8.5.
Mathematics 343 Class Notes
40
109
Testing Hypotheses About the Mean
Goal: To review hypothesis testing in the context of hypotheses about the mean of
a (normal) distribution
40.1
Setting
X1 , . . . , Xn is a random sample from a normal distribution with unknown µ. We want to test
a hypothesis about µ.
Example 40.1. Kellogg’s makes Raisin Bran and fills boxes that are labelled 11 oz. NIST
mandates testing protocols to ensure that this claim is accurate. Suppose that a shipment
of 250 boxes, called the inspection lot, is to be tested. The mandated procedure is to take a
random sample of 12 boxes from this shipment. If any box is more than 1/2 ounce underweight,
then the lot is declared defective. Else, the sample mean x̄ and the sample standard deviation
s are computed. The shipment is rejected if (x̄ − 11)/s ≤ −0.635.
40.2
The Hypothesis Testing Technology
We need two hypotheses:
1. Null Hypothesis. The null hypothesis, denoted H0 , is a hypothesis that the data
analysis is intended to investigate. It is usually thought of as the “default” or “status
quo” hypothesis that we will accept unless the data gives us substantial evidence against
it.
2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the
hypothesis that we are wanting to put forward as true if we have sufficient evidence
against the null hypothesis.
In the context of Example 40.1, our hypotheses are
H0 :
Ha :
µ = 11
µ < 11
We will make one of two decisions. Either we will reject H0 (in favor of Ha ) or we will not
reject H0 .
Mathematics 343 Class Notes
110
There are two possible errors:
1. A Type I error is the error of rejecting H0 even though it is true. The probability of a
type I error is denoted by α.
2. A Type II error is the error of not rejecting H0 even though it is false. The probability
of a Type II error is denoted by β.
We test the hypothesis by computing a test statistic.
Definition 40.2. A test statistic is a random variable on which the decision is to be based.
A rejection region is the set of all possible values of of the test statistic that would lead us to
reject H0 .
40.3
Testing Hypotheses About the Mean of the Normal Distribution
Suppose that X1 , . . . , Xn is a random sample from a normal distribution with unknown mean
µ. Suppose that we have the following null and alternate hypotheses:
H0
Ha
µ = µ0
µ < µ0
T =
X − µ0
√
S/ n
We will use the following test statistic:
The important fact about this statistic is that if H0 is true then the distribution of T is known.
(It is a t distribution with n − 1 degrees of freedom.) Thus we can construct a rejection region
based on our desired α. In this case our test is
Reject H0 if and only if T < −tα,n−1 .
Note that in example 40.1, n = 12. If we let α = .025, we have that the test gives
√
Reject H0 if and only if (X − 11)/S < −t.025,11 / 12 = 0.635.
This means that the NIST test really is a hypothesis test with α = .025. Indeed the NIST
manual says that “this method gives acceptable lots a 97.5% chance of passing.” Of course
the NIST method implicitly is relying on the assumption that the distribution of the lot is
Mathematics 343 Class Notes
111
normal. Is this unwise? And the sample size is only 12 so we should be cautious about using
the t-distribution for a non-normal population.
The following R session shows the hypothesis test of a possible sample of size 12 of 11 oz boxes
of Raisin Bran. Note that even though this sample seems to suggest underfilling, the lot passes
the test.
> x
[1] 10.74900 11.04724 10.86442 10.98675 10.80881 11.33170 10.73323 10.69521
[9] 10.83790 10.90010 10.99387 10.88968
> t.test(x,alternative="less",mu=11)
One Sample t-test
data: x
t = -1.9358, df = 11, p-value = 0.0395
alternative hypothesis: true mean is less than 11
95 percent confidence interval:
-Inf 10.99300
sample estimates:
mean of x
10.90316
> (mean(x)-11)/sd(x)
[1] -0.5588134
In the use of t.test above, we specified the null hypothesis, mu=11, the alternate hypothesis alternative="less", and (of course) the data. The possible alternative hypotheses are
two.sided, less, and greater.
40.4
p-Value of a Hypothesis Test
An important number that R reports is the p-value of the statistic.
Definition 40.3. The p-value of a statistical test is the probability that, if H0 is true, the test
statistic T would have a value at least as extreme (in favor of the alternate hypothesis) as the
value t that actually occured.
The p-value is generally taken to be a measure of the strength of the evidence against H0 .
A small p-value is taken to be strong evidence against H0 . There are various incorrect ways
of saying what the p-value means. It is not the probability that the null hypothesis is true.
Indeed, it is a probability that only makes sense if the null hypothesis is true. Similarly, 1 − p
is not the probability that the alternative hypothesis is true.
Mathematics 343 Class Notes
112
Rather than reporting the results of hypothesis tests as “accept” or “reject” at the such-andsuch level, it is much more standard to report the p-value of the test. If a p-value is reported,
the reader can determine precisely those α for which the null hypothesis would be rejected at
significance level α. If the p-value of the test is less than α, we would reject the null hypothesis.
In this case, it is also sometimes said that the test is “significant at the α level.”
40.5
Power
How likely is it that the NIST test will identify a bad lot of Raisin Bran as being bad? This
is a question about β, the probability of a Type II error. The answer depends on how bad the
lot actually is. In order to ask this question, we need to know the distribution of
T =
X̄ − 11
√
S/ 12
if µ 6= 11. This distribution depends on the true mean µ, the standard deviation σ (which we
do not know), and the sample size. Suppose for example that the true mean is 10.9 and the
standard deviation is 0.1. The following R code determines the power of the test. (Power is
1 − β.) Note that the arguments are
delta
sd
n
sig.level
type
alternative
the deviation of the true mean from the null hypothesis mean
the true standard deviation
the sample size
α
this t-test is called a one.sample test
we tested a one.sided alternative
> power.t.test(n=12,delta=.1,sd=.1,sig.level=.025,
+ type="one.sample",alternative="one.sided")
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
12
0.1
0.1
0.025
0.8828915
one.sided
With these hypothesized values, the power of the test is 88%. This means that the hypothesis
test would detect average underfilling by an average of 0.1 oz 88% of the time if the true
Mathematics 343 Class Notes
113
standard deviation is also 0.1 oz. Obviously power goes up as n goes up and goes down as σ
goes up.
> sd=c(.05,.1,.15)
> power.t.test(n=12,delta=.1,sd=sd,
+ sig.level=.025,type="one.sample",alternative="one.sided")
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
12
0.1
0.05, 0.10, 0.15
0.025
0.9999909, 0.8828915, 0.5580037
one.sided
The power of the test would also go up as the true deviation from the mean goes up.
> diff=seq(0,.1,.01)
> power.t.test(n=12,delta=diff,sd=.1,
+ sig.level=.025,type="one.sample",alternative="one.sided")
One-sample t test power calculation
n = 12
delta = 0.00, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10
sd = 0.1
sig.level = 0.025
power = 0.02500000, 0.05024502, 0.09249152, 0.15643493, 0.24401839,
0.35263574, 0.47466264, 0.59891866, 0.71365697, 0.80978484, 0.88289152
alternative = one.sided
Homework.
1. Read pages 418–421 and 435–436 of Devore and Berk.
2. Do problem 9.35.
© Copyright 2026 Paperzz