part B

Math 243 – Spring 2007
January 29, 2007
Important Informtion
1. Course Web Page: http://www.calvin.edu/ rpruim/courses/m243/S07/.
2. Provide exam “black out dates” (with reasons) via email by Friday.
Introduction to Statistics
1. What is statistics?
2. 8 steps in a statistical study (Lady Tasting Tea Example)
c
2007
Randall Pruim ([email protected])
1
January 30, 2007
Math 243 – Spring 2007
2
Announcements
• Send exam “blackout dates” via email by Friday.
• Meet in Mac Lab on Thursday. (NH 067, in the basement.) Read SimpleR, pages 1–7 BEFORE
class on Thursday.
• Problem Set 2 is due on Friday.
• Current plan: problem sets will usually be due Tuesdays and Fridays.
Looking at Data
Goals:
Today we will learn
1. how data sets are organized and the terminology used to describe
and talk about data sets.
2. some common graphical displays of variables (stemplot and histograms) and what they tell us about the distribution of a variable.
3. how to use R to make these graphical displays for us.
1. Structure of a data set
(a) units (individuals, subjects, cases, ...)
(b) variables (measurements made on individuals)
2. Types of Variables
(a) categorical (qualitative) – places units into categories
(b) quantitative (numerical) – measurement on some scale
•
•
•
•
•
nominal:
ordinal:
interval:
ratio:
discrete vs. continuous:
3. Distribution of a variable (what values and how often)
4. Graphing distributions
(a) stemplots
(b) histograms
c
2007
Randall Pruim ([email protected])
January 30, 2007
Math 243 – Spring 2007
3
Useful R Commands
• Getting Started, Getting Data
◦ getCalvin(’m243’) – load functions, data, etc. for this class.
◦ source(’foo.R’) – execute the commands in the file foo.R.
· From your own machine you can use
source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’)
to do the same thing as getCalvin(’m243’).
· require(Devore6) – load package of data sets from textbook. (Other packages can be loaded
similarly. getCalvin(’m243’) will load a few for you, including Devore6.)
· data(xmp01.05), data(ex01.05), data(faithful) – attach data set for an Example 1.5,
Exercise 1.5, or a named data set. (The data type most commonly used for a data set in R is
called a data frame. A data frame can contain many variables, each of which is represented
as a vector, which is much like an array.)
· names(faithful) – see the names of the variables in this data set.
• Graphical Summaries
◦ stem(xmp01.05$bingePct) – basic stemplot of bingePct variable. Notice $ to get at a single
variable (as a vector).
◦ stemplot(xmp01) – fancier stemplot defined in m243.R; by default makes a stemplot for each
numeric variable in data frame.
◦ hist(faithful$eruptions) – make a histogram.
◦ histogram(~eruptions,data=faithful) – make a histogram of eruptions times for Old Faithful.
(Requires lattice package.)
• Getting Some Help
◦ ?hist, ?faithful – get documentation for the hist() command or the faithful data frame.
(Works with most other built-in or prepaaged commands as well. Won’t necessarily work with
utilities I provide.)
◦ args(hist) – list the arguments of the hist() function.
◦ apropos(plot) – list all functions that have plot in their name. Useful for recalling the name
of a function or checking to see if a function exists.
c
2007
Randall Pruim ([email protected])
February 2, 2007
Math 243 – Spring 2007
4
Quatiles and the Five Number Summary
1. What is a median?
A1. the middle value of a sorted data set (or average of two middle values if even number data values)
A2. a value with half the distribution above and half the distribution below
2. q-quantile is a value such that the proportion of the distribution below the value is q (and 1-q is
above)
• percentile is the same thing expressed as a percentage instead of as a proportion.
• quartiles, quintiles, deciles also common
• median = .5-quantile = 50th percentile
3. Different methods for determining quantiles from data
• median of each half
◦ variations depending on whether we include or exclude the median of the original data
• ruler method:
◦ imagine the data arranged along a ruler and determine location of quantiles by measuring
◦ variations depending on where the data are “positioned” along the ruler.
4. Boxplots are a graphical version of the five-number summary (the five quartiles).
Useful R Commands
• quantile() and fivenum() give (different) versions of quantiles.
• boxplot() and bwplot() produce boxplots. bwplot() requires the lattice package (which is installed
as part of the basic R installation but may need to be loaded).
• summary() – summarize an object
The summary() command does different things depending on the type of object it is given. For a
data frame it gives a summary of each variable. For quantitative variables it gives the five-number
summary plus the mean. For a categorical variables it provides (a portion of) the distribution as a
table. summary() can be used on many other kinds of objects as well.
> summary(1:10)
Min. 1st Qu.
1.00
3.25
Median
5.50
Mean 3rd Qu.
5.50
7.75
Max.
10.00
• The colon operator can be used to make a consecutive sequence. seq() gives finer control:
> 1:10
[1] 1 2 3 4 5
> seq(0,100,by=20)
[1]
0 20 40 60
6
7
8
9 10
80 100
c
2007
Randall Pruim ([email protected])
February 2, 2007
Math 243 – Spring 2007
5
• table(x) gives a table of the values in the vector x.
• cut() makes a categorical variable (which R calls a factor ) from a quantitative variable by placing
each value into a bin. Bin divisions are specified with the break argument. This is much like the first
step in forming a histogram. Can you figure out this example?
> table(cut((1:10)^2,breaks=c(0,20,40,60,80,100)))
(0,20]
4
(20,40]
2
(40,60]
1
(60,80] (80,100]
1
2
• cumsum() is sometimes handy. Here is an example
> cumsum(1:10)
[1] 1 3 6 10 15 21 28 36 45 55
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
February 5, 2007
6
Randomness & Probability
1. Calculation of variance and standard deviation
2. Two key properties of randomness
(a)
(b)
3. Important terminology
• outcome:
• sample space:
• event:
4. Random Variables
• Notation: P(X = 8), P(X ≤ 7), etc.
• Important example: sampling distributions
5. Two types of probability
(a)
(b)
Before class tomorrow: Find a penny (the newest American penny you can locate) and flip/spin/tip it
50 times, keeping track of how many heads and tails you get for each.
Useful R Commands
R has several utilities for working with important random variables. The Lady Tasting Tea situation that we
talked about on day one is an example of a binomial random variable. We’ll learn more about this random
variable in the coming days.
• rbinom(n,size,prob) – simulates X n times, where X is a binomial random variable with success
probability prob and size trials.
To simulate 1000 Ladies Tasting Tea who are just guessing: rbinom(1000,10,.5).
• dbinom(q,size,prob) – calculates P(X = q) where X is a binomial random variable with success
probability prob and size trials.
For the Lady Tasting Tea, dbinom(8,10,.5) gives the probability of getting exactly 8 of 10 correct
just by guessing. We’ll learn how to do this calculation by hand tomorrow.
• pbinom(q,size,prob) – calculates P(X ≤ q) where X is a binomial random variable with success
probability prob and size trials.
For the Lady Tasting Tea, 1 - pbinom(7,10,.5) gives the probability of getting at least 8 of 10
correct just by guessing.
• There are versions of the functions above for many other random variables, too. The names are all
rdist, ddist, pdist, where dist is replaced by some version of the name of the distribution. Stay
tuned.
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
February 6, 2007
7
Probability calculations using the Theoretical Method
1. Three Probability Axioms
(a) P(E) ∈ [0, 1]
(b) P(S) = 1 (where S is the sample space)
(c) Probability of disjoint union is sum of probabilities.
• P(A ∪ B) = P(A) + P(B) provided A ∩ B = ∅.
• P(A1 ∪ A2 ∪ · · · ∪ Ak ) = P(A1 ) + P(A2 ) + · · · + P(Ak ) provided Ai ∩ Aj = ∅ whenever i 6= j.
P∞
• P (∪∞
i=1 Ai ) =
i=1 P(Ai ) provided Ai ∩ Aj = ∅ whenever i 6= j.
2. More Probability Rules and Examples
(a) Equally Likely Rule: If every outcome in a sample space is equally likely, then
P(E) =
|E|
N (E)
size of event
=
=
|S|
N (S)
size of sample space
• probability of heads is .5 if heads and tails are equally likely
• many other applications (fair dice, shuffled cards, etc.)
• easy to apply if we can determine sizes of E and S (see item 3 below)
(b) Complement Rule: P(E) = 1 − P(E 0 ).
• Every probability problem is really two probability problems; we can choose to solve the
“easier” of the two.
(c) Inclusion/Exclusion: P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
3. Counting without counting
To use the equally likely rule, we need to be able to count the number of outcomes in our event and in
the sample space. For small sets, we can do this by inspection, but for larger sets, we need to “count
without counting”.
(a) Bijection Rule: Don’t double count, don’t skip
(b) Sum Rule: Divide and Conquor
(c) Difference Rule: Compensate for double counting
(d) Product Rule: One stage at a time
• Permutations
(e) Division Rule: Another way to compensate for “double” counting
• Combinations (binomial coefficients)
• Lady Tasting Tea (guessing)
c
2007
Randall Pruim ([email protected])
February 6, 2007
Math 243 – Spring 2007
8
Examples: Calculating Theoretical Probabilities
1. Equally Likely Rule: If every outcome in a sample space is equally likely, then
P(E) =
|E|
N (E)
size of event
=
=
|S|
N (S)
size of sample space
• Only applies when all outcomes are equally likely.
• Easy way to determine probabilities if the counting problems are easy; can be challenging when
counting is more difficult.
• Two easy examples:
◦ Probabilty of rolling doubles (with two fair 6-sided dice)
◦ Daily 3 Lottery game
2. Permutations and Combinations as systematic counting methods
(a) Probability of getting a (5-card) flush
• ordered hand method (using multiplication principle)
Number of hands = 52 · 51 · 50 · 49 · 48
Number of flushes = 52 · 12 · 11 · 10 · 9
• unordered hand method
Each unordered hand has 5 · 4 · 3 · 2 · 1 orderings. So
52 · 51 · 50 · 49 · 48
Number of hands =
5·4·3·2·1
52 · 12 · 11 · 10 · 9
Number of flushes =
5·4·3·2·1
(b) Permutations: Pk,n = number of ways to order k objects from a set of n objects.
n!
Pk,n = n(n − 1)(n − 2) · · · (n − k + 1) =
(n − k)!
n
(c) Combinations: k = number of ways to select k objects from a set of n objects (order doesn’t
matter).
Pk,n
n
n!
=
=
k
k!
k!(n − k)!
choose a suit, then 5 cards from that suit
z }| {
13
4
5
• Probability of getting a 5-card flush =
52
5
| {z }
choose 5 cards from 52
Useful R Commands
• factorial(n) – compute n! (n factorial)
• choose(n,k) – compute nk (n choose k)
c
2007
Randall Pruim ([email protected])
February 8, 2007
Math 243 – Spring 2007
9
More Probability Problems
1. Lady Tasting Tea
Suppose the lady is just guessing, let X be the number of correct guesses.
10
number of ways to guess 8 correctly in 10 tries
8
• P(X = 8) =
= 10
number of ways to guess
2
• R can easily make a table for different numbers of correct guesses:
> cbind(5:10,dbinom(5:10,10,.5),1/dbinom(5:10,10,.5))
[1,]
[2,]
[3,]
[4,]
[5,]
[6,]
5
6
7
8
9
10
0.2460937500
4.063492 # about 1 time in 4
0.2050781250
4.876190
0.1171875000
8.533333
0.0439453125
22.755556
0.0097656250 102.400000
0.0009765625 1024.000000 # 1 time in 1024
Not very likely to 8 or more correct by guessing.
2. Probability of rolling 5 dice and getting at least two that match.
6·5·4·3·2
65
• P(at least two match) = 1 − P(all different).
• Think complement: P(all different) =
3. Let X be the number of suits in a 5-card hand, what is the distribution of X?
>
>
>
>
>
p<-rep(NA,4)
p[1] <- choose(4,1) * choose(13,5) / choose(52,5)
p[2] <- choose(4,2) * ( choose(26,5) - (choose(13,5)+choose(13,5))) / choose(52,5)
p[4] <- choose(4,1) * choose(13,2) * 13 * 13 * 13 / choose(52,5)
p[3] <- 1 - sum(p[-3])
# sum of all probabilities must be 1
> rbind(1:4,p)
1.000000000 2.0000000 3.0000000 4.0000000
p 0.001980792 0.1459184 0.5883553 0.2637455
Conditional Probability
1. Q: If a family has two kids and at least one is a boy, what is the probability that both are boys?
GG
GB
BG
BB
2. General Formala: Probability of A given B = P(A | B) =
probability=1/3
P(A ∩ B)
P(B)
• a Venn diagram is useful to picture this
c
2007
Randall Pruim ([email protected])
February 9, 2007
Math 243 – Spring 2007
10
Conditional Probability
1. Definition: P (A | B) =
P(A∩B)
P(B)
2. Example: Suppose items are produced on two assembly lines. Some are good and some are defective
(bad). One day the production data were as follows:
Assembly Line 1
Assembly Line 2
• P(Bad) = 3/18;
Bad
2
1
Good
6
9
P(AL 1) = 8/18
• P(Bad | AL 1) = 2/8 =
2/18
8/18 ;
P(AL 1 | Bad) = 2/3 =
2/18
3/18
3. Turning the formula around: P (A ∩ B) = P(A) · P(B | A) = P(B) · P(A | B)
(a) Example: Yahtzee; roll 5 dice
• P(5 different numbers) = 1 · 56 · 46 · 36 · 26 = 0.0925926
• P(large straight | 5 different numbers) = 2/6 = 1/3 (6 possible missing numbers – two yield
large straight.)
• So P(large straight) = P(5 different)·P(large straight | 5 different) = ( 56 · 64 · 36 · 26 ) 31 = .0308642
(b) Example: Medical Tests.
Suppose a test correctly identifies diseased people 98% of the time, and correctly identifies healthy
people 99%. Furthermore assume that one person in 1000 has the disease. If a random person is
tested and the test comes back positive, what is the probability that the person has the disease?
• P(D) = 0.001; P(H) = 0.999
• P(+ | D) = 0.98; P(− | H) = .99
P(− | D) = 0.02; P(+ | H) = .01
P(D) · P(+ | D)
.001 · .98
P(D ∩ +)
=
=
= 0.0893
• P(D | +) =
P(+)
P(D ∩ +) + P(H ∩ +)
.001 · .98 + .999 · .01
(Wow!)
Indpendence
1. If P (A | B) = P(A) then we say that A and B are independent.
• Intuition: knowing whether or not B occurs gives no information to change the probabilities that
A occurs; the two events are independent.
2. If A and B are independent, then P (A ∩ B) = P(A) · P(B).
3. Two ways independence shows up in statistics:
(a) As a hypothesis to be tested using data. (E.g., does it appear that smoking and getting lung
cancer are independent?)
(b) As an assumption (because it makes probability calculations easier). For example, we will usually
assume some sort of indpendence in our sampling methods. (Stay tuned.)
c
2007
Randall Pruim ([email protected])
February 12, 2007
Math 243 – Spring 2007
11
Discrete Distributions
1. Can be described by giving a table of probabilities.
2. Can be described by giving the pmf (probability mass function, also called distribution function), a
function giving values for P(X = x).
3. Can be described by giving the cdf (cummulative distribution function), a function giving values for
P(X ≤ x).
4. Often we can give a single formula for all the distributions in a family of distributions. Such
formulas will mention parameters – numbers that distinguish the members of the family from one
another. (See examples below.)
Some Important Families of Distributions
1. Geometric Distributions
•
•
•
•
two outcomes for each trial (S and F)
probablity of S on each trial is constant p
each trial independent of others
repeat until we get S
If we let X count the number of failures before success, then X is geometric with parameter p.
• pdf = g(x; p) = (1 − p)x p
2. Binomial Distributions
•
•
•
•
two outcomes for each trial (S and F)
probablity of S on each trial is constant p
each trial independent of others
repeat predetermined number of times (n)
If we let X count the number of successes in n tries, then X is binomial with parameters n and p.
n x
• pdf = b(x; n, p) =
p (1 − p)n−x
x
We can use binomial distributions to model the lady tasting tea under the assumption that her probability of being correct is p.
Useful R Commands
• dbinom(x,size,prob) = pmf for a binomial distribution (think d for d istribution).
• pbinom(x,size,prob) = P(X ≤ x) = cdf for a binomial distribution.
• dgeom(), pgeom() do the same for geometric distribution.
• Similar functions exist for many families of distributions.
c
2007
Randall Pruim ([email protected])
February 13, 2007
Math 243 – Spring 2007
12
Describing Discrete Distributions
Let X be a discrete random variable with pmf f .
1. Pictures: Probability histograms and line graphs (pages 103–104)
2. Expected Value (mean)
• motivating example: computing GPA = “average grade”
X
X
• E(X) = µX =
x · f (x) =
value · probability
• Easy to do in R for familiar finite distributions. Example:
> sum(0:10 * dbinom(0:10,size=10,prob=.8))
# = 8
P
• Expected Value of a function: E(h(X)) = h(x) · f (x).
3. Variance (and standard deviation)
P
2 = E (X − µ )2 =
• Var(X) = σX
(x − µX )2 f (x)
X
• Alternate Formula: Var(X) = E(X 2 ) − [E(X)]2
Proof: Expand algebraically; split into 3 sums; simplify each sum.
4. Expected Value of Linear Function: E(aX + b) = a E(X) + b
5. Variance of Linear Function: Var(aX + b) = a2 E(X)
Examples
1. Daily 3 Lottery played straight
W = winnings ($)
probability
0
.999
500
.001
2. Bernoulli random variable (with paramter p)
value of X
probability
0
1−p
1
p
3. Searching for either one of two out of 5 possibilities. (Sock drawer, for example.) Y = number of
attempts one is found.
value of Y
probability
1
.4
2
.3
3
.2
4
.1
4. Daily 3 Lottery played boxed (123)
Z = winnings ($)
probability
0
.994
41
.005
291
.001
c
2007
Randall Pruim ([email protected])
February 13, 2007
Math 243 – Spring 2007
13
Imporant Named Discrete Distributions
family (R name)
pdf
expected value (mean)
variance
Bernoulli
f (1) = p, f (0) = 1 − p
p
pq
q =1−p
binomial (binom)
x x n−x
b(x; n, p) =
p q
n
np
npq
q =1−p
geometric (geom)
g(x; p) = q x p
q
1
= −1
p
p
q
p2
q =1−p
rq
p2
q =1−p
np
N −n
· npq
N −1
M
p=
N
q =1−p
kp
m+n−k
· kpq
m+n−1
negative binomial
(nbinom)
hypergeometric
[text book]
nb(x; p, r) =
h(x; n, M, N ) =
m
n
x n−x
n+m
k
dhyper(x,m,n,k)
x+r−1
r−1
M
x
q x pr
N −M
n−x
N
n
r·
q
= r · p1 − 1
p
notes
m
n+m
q =1−p
p=
• Both R and our text count the number of failures to get the value for the random variable called
geometric or negative binomial. Some authors count instead the number of trials (successes + failures).
• The book and R parameterize the hypergeometric distributions in differnt (but equivalent) ways. It
is traditional to describe the hypergeometric model in terms of selecting objects (usually balls or
marbles) from an urn. Each object has one of two colors. Given this description, here are the two
paramterizations:
book
N
R
n+m
number of objects with target color
M
m
number of objects with other color
N −M
n
number of objects drawn from urn
n
k
number of objects in urn
In both cases the random variable counts the number of items drawn from the urn that have the
“target” color.
c
2007
Randall Pruim ([email protected])
February 15, 2007
Math 243 – Spring 2007
14
More Examples of Discrete Distributions
1. Lady Tasting Tea (again)
Suppose that we are once again in the business of testing whether the lady can really tell the difference
between the two infusion orders. This time we adopt a different strategy. We will again prepare 5 cups
of tea, but this time we make sure that 5 cups are prepared each way. Does this matter? Is it a better
idea than our other method? Worse?
Suppose the lady is just guessing. She knows there are five cups of each type, so she is asked to place a
marker by the 5 cups that had tea put in them first. Let X be the number of these that are correctly
placed. Let’s work out the pmf for X:
value of X
probability
0
1
2
3
4
5
What is P(X ≥ 4)? How does it compare with getting 8 or more correct in the binomial situation?
2. The hypergeometric distribution
The random variable in item 1 is an example of a hypergeometric distribution. It is traditional to
describe this distribution in terms of an urn model. Suppose
• m white balls are in the urn,
• n black balls are in the urn,
• k balls are selected at random from the urn, and
• X = the number of white balls selected (among the k);
then X has a hypergeometric distribution. The pmf for X (using R’s parameterization) is
h(x; m, n, k) =
3. Going to a Play
Here are the data from our class survey about going to a play after loosing either a ticket or cash.
attend play?
no
yes
lost cash
9
13
lost ticket
14
9
The proportion of survey participants who would buy a new ticket differs depending on what was lost,
but is this just random chance, or are people who lose cash really more likely to buy a ticket than
those who lose a ticket?
c
2007
Randall Pruim ([email protected])
February 15, 2007
Math 243 – Spring 2007
15
Useful R Code
> read.csv(’http://www.calvin.edu/~rpruim/data/survey/littleSurvey.csv’) -> survey
> dim(survey)
[1] 279 27
> names(survey) # output surpressed
> table(survey$section)
143A
21
143B
16
143C
20
> summary(survey)
143D
110
243A dcm07
45
67
# output surpressed
> levels(survey$playVer)
[1] "v1" "v2"
> levels(survey$playVer) = c(’lostCash’,’lostTicket’)
> xtabs(~play+playVer,survey)
playVer
play lostCash lostTicket
no
61
69
yes
103
44
> xtabs(~play+playVer,survey,subset=section==’243A’)
playVer
play lostCash lostTicket
0
0
no
9
14
yes
13
9
> round(dhyper(0:5,5,5,5),digits=4)
[1] 0.0040 0.0992 0.3968 0.3968 0.0992 0.0040
> round(dhyper(0:22,22,23,22),digits=4)
[1] 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0044 0.0203 0.0635
[10] 0.1382 0.2124 0.2317 0.1797 0.0987 0.0381 0.0102 0.0018 0.0002
[19] 0.0000 0.0000 0.0000 0.0000 0.0000
> plot(dhyper(0:22,22,23,22))
> xyplot(dhyper(0:22,22,23,22)~0:22,groups=dhyper(0:22,22,23,22) <= dhyper(9,22,23,22),pch=16)
c
2007
Randall Pruim ([email protected])
February 15, 2007
Math 243 – Spring 2007
> fisher.test(xtabs(~play+playVer,survey,subset=section==’243A’))
Fisher’s Exact Test for Count Data
data: xtabs(~play + playVer, survey, subset = section == "243A")
p-value = 0.238
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.1141432 1.7066640
sample estimates:
odds ratio
0.4533593
> fisher.test(xtabs(~play+playVer,survey))
Fisher’s Exact Test for Count Data
data: xtabs(~play + playVer, survey)
p-value = 0.0001362
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.2236475 0.6366559
sample estimates:
odds ratio
0.3790412
> binom.test(8,10,.5)
Exact binomial test
data: 8 and 10
number of successes = 8, number of trials = 10, p-value =
0.1094
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.4439045 0.9747893
sample estimates:
probability of success
0.8
> 1- pbinom(7,10,.5)
[1] 0.0546875
> 1- pbinom(7,10,.5) + pbinom(2,10,.5)
[1] 0.109375
c
2007
Randall Pruim ([email protected])
16
February 16 & 19, 2007
Math 243 – Spring 2007
17
Rules for Expected Value and Variance
The following three “rules” can be proven by some straightforward algebra applied to the definitions.
1. E(aX + b) = a E(X) + b
2. Var(aX + b) = a2 Var(X)
3. Var(X) = E(X 2 ) − [E(X)]2
Continuous Distributions
1. Definitions
(a) continuous vs. discrete
(b) pdf (probability density function)
• Key Idea:
Rb
• definition of probability using pdf: P(a ≤ X ≤ b) = a f (x) dx
Rx
(c) cdf (cummulative density function): F (x) = P(X ≤ x) = −∞ f (t) dt
• FTC ⇒ pdf is derivative of cdf; cdf is an antiderivative of pdf
2. Examples
(a) uniform distribution: pdf is constant on an interval, what constant?
(b) f (x) = kx2 on [0, 1], what is k?
(c) f (x) = ke−x on [0, ∞), what is k?
3. Expected Value and Variance
• Key Idea:
R∞
• Expected Value: E(X) = −∞ xf (x) dx
R∞
• Variance: Var(X) = −∞ (x − µX )2 f (x) dx
• Three Rules for Expected Value and Variance still true.
Useful R Commands
• f <- function(x) { x^2 } – define a function
• integrate(f,0,3) – do numerical integration give text report that includes the estimate and an
indication of its accuracy
• integrate(f,0,3)$value – do numerical integration, and retun the result as a number
• integrate(function(x) {exp(-x)},0,Inf) – numerical approximation of indefinite integral
• plot(f,xlim=c(0,3)) – make a sketch of the graph of f
c
2007
Randall Pruim ([email protected])
February 20, 2007
Math 243 – Spring 2007
18
Quantiles of Continuous Distributions
1. To find the 90th percentile of a continuous distribution, solve for q in the equation below:
Z q
P (X ≤ q) =
f (x) dx = pdist(q) = .90
−∞
2. Other quantiles are found by replacing .90 with the appropriate proportion.
3. Median = 50th percentile = .5-quantile.
4. R has this built in for commonly used disributions. Example: qexp(.90) gives the 90th percentile of
the exponential distribution with mean and variance equal to 1.
The Normal Distributions
Undoubtedly the most important family of continuous distributions for statistics is the family of normal
distributions. This distribution (and some of its cousins) will occupy us nearly every day for the remainder
of the semester.
1. Reason for importance: Central Limit Theorem (stay tuned)
2. pdf: f (x; µ, σ) =
2
2
√1 e−(x−µ) /2σ
σ 2π
= dnorm(x,mean=0,sd=1)
• Nasty to integrate; usually we can only approximate it numerically
• f (x) has symmetric, bell-shaped graph (the bell curve)
◦ maximum when x = µ
◦ inflection points at µ ± σ
• E(X) = µ; Var(X) = σ 2
3. Linear transformations of normal random variables are normal
(a) We can determine the mean and variance using our rules.
• If X ∼ N (µ, σ), and Y = X − µ then Y ∼ N (0, σ).
• If X ∼ N (µ, σ), and Z = X−µ
σ , then Z ∼ N (0, 1).
(b) One distribution to rule them all. Questions about X ∼ N (µ, σ) can be converted into
questions about the standard normal distribution: Z ∼ N (0, 1). For example,
P(a ≤ X ≤ b) = P
a−µ
b−µ
≤Z≤
σ
σ
(c) 68-95-99.7 Rule
P(µ − σ ≤ X ≤ µ + σ) = P (−1 ≤ Z ≤ 1) ≈ .68
P(µ − 2σ ≤ X ≤ µ + 2σ) = P (−2 ≤ Z ≤ 2) ≈ .95
P(µ − 3σ ≤ X ≤ µ + 3σ) = P (−3 ≤ Z ≤ 3) ≈ .997
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
February 20, 2007
19
4. Some Notation
(a) cdf for standard normal distribution: Φ(x) = pnorm(x)
(b) “backwards quantile”: zα = qnorm(1 − α), that is
P(Z ≤ −zα ) = P(Z ≥ zα ) = α
5. Computers, Calculators, and Normal Probability Tables
• In the old days, when computers and calculators were not so readily available to do calclulations
for us, normal probabilities were obtained from reference tables (as were values of things like
sine, cosine, logarithms, etc.) These probability tables still appear in the backs of most statistics
books. See pages 740–741 of our book, for example.
• Some calculators are also capable of providing you with values equivalent to those given by
pnorm() and qnorm().
c
2007
Randall Pruim ([email protected])
February 22, 2007
Math 243 – Spring 2007
20
An Important Situation
Let’s suppose that
• some event is happening “at random times”, and that
• the probability of an event occuring in any small time interval depends only on (and is proportional
to) the length of the interval, not on when it occurs.
We are interested in two random variables associated with such a situation:
• X counts the number of occurances in a fixed amount of time. [X is discrete positive integer values.]
• Y measures the time until the next occurance.
[Y is continuous on the interval [0, ∞).]
Poisson Distributions
1. What pmf would be a reasonable model for X?
• If we divide our time into n pieces, then the probability of an occurance in one of these subintervals
is proportional to 1/n, so let’s call the probability λ/n.
• If n is large, then the probability of having two occurances in one subinterval is very small – so
let’s just pretend it can’t happen.
• If we assume that occurances in each interval are independent from one another, then a good
approximation for X is
X ≈ Binom(n, λ/n)
because we have n independent subintervals with probability λ/n of an occurance in each one.
• This approximation should get better and better as n → ∞. When n is very large, then
n!
λx
λ n
λ −x
·
· 1−
· 1−
x! (n − x)! nx
n
n
x
x
n λ
≈
·
· e−λ · 1
x! nx
λx
= e−λ
x!
x n
λ
λ n−x
P(X = x) ≈
1−
=
x
n
n
λx
be the pmf for a Poisson random variable with rate parameter
x!
λ. (We’ll see why it is called a rate parameter in just a moment.)
• So let’s let p(x; λ) = e−λ
• We should check that this definition is legitimate.
∞
X
λx
◦ Recall from calculus that
= eλ
x!
[Think: Taylor series.]
x=0
◦ So
∞
X
x=0
e−λ
λx
=1
x!
[Good thing.]
c
2007
Randall Pruim ([email protected])
February 22, 2007
Math 243 – Spring 2007
21
2. Mean and Variance of X
• E(X) =
∞
X
xe
x=0
−λ λ
x
x!
=
∞
X
xe
x=1
−λ λ
x
x!
=λ
∞
X
∞
e
−λ
x=1
X
λx−1
λx
=λ
e−λ
=λ
(x − 1)!
(x)!
x=0
◦ If on average we expect λ occurances in our fixed amount of time, then the rate of occurances
is λ per unit time. Thus λ is a rate.
∞
∞
∞
∞
X
X
X
λx X 2 −λ λx
λx−1
λx
• E(X 2 ) =
x2 e−λ
=
x e
=λ
xe−λ
=λ
(x + 1)e−λ
= λ(λ + 1)
x!
x!
(x − 1)!
(x)!
x=0
x=1
x=1
x=0
• So Var(X) = λ(λ + 1) − λ · λ = λ
3. Example: Football fumbles
• Is a Poisson distribution a good model?
4. Example: Customers come to a small business at an average rate of 6 per hour. Let’s assume that a
Poisson distribution is a good model.
• How unusual is it to go 20 minutes without any customers?
• How unusual is it to have 10 or more customers in an hour?
• Rank in order of likelihood the following three events: (a) fewer than six customers in an hour,
(b) exactly six customers in an hour, (c) more than six customers in an hour.
Exponential Distributions
1. What is the pdf for Y ?
Suppose events occur at a rate of λ per unit time; recall that Y = time until next occurrance.
• Let’s get the cdf first:
P(Y ≤ y) = 1 − P(Y > y) = 1 − P(0 occurances in time [0, y]) = 1 − e−λy
where we get the last probability using the Poisson distribution!
• Differentiate to get the pdf.
P(Y = y) =
d 1 − e−λy = λe−λy
dy
2. Some integration by parts shows that
• E(X) =
1
λ
• Var(X) =
1
λ2
3. Example: Football fumbles again.
• How long should we expect to wait until the next fumble?
• What is the probability of having no fumbles in the first half of a game?
c
2007
Randall Pruim ([email protected])
February 22, 2007
Math 243 – Spring 2007
22
Imporant Distributions
family
pdf or pmf
binomial
hypergeometric
expected value
variance
dbinom(x,size=n,prob=p)
x x n−x
=
p q
n
np
npq
dhyper(x,m,n,k)
n m
kp
m+n−k
· kpq
m+n−1
=
x
n−x
n+m
k
notes
q =1−p
p=
m
n+m
q =1−p
negative binomial
dnbinom(x,size=r,prob=p)
x+r−1 x r
=
q p
r−1
Poisson
dpois(x,lambda=λ) = e−λ
uniform
dunif(x,a,b) =
normal
dnorm(x,mean=µ,sd=σ)
exponential
dexp(x,rate=λ) = λe−λx
1
b−a
λx
x!
rq
p2
q =1−p
λ
λ
λ = rate
b+a
2
(b − a)2
12
µ
σ2
1/λ
1/λ2
r·
q
p
λ = rate
• gamma – a generalization of exponential with two parameters (α = shape and β = scale = 1/rate).
Possible values are [0, ∞). The distributions are all right skewed. The exponential distributions arise
when α = 1 and β = 1/λ.
• Weibull – another generalization of exponential, again with two parameters (α = shape and β =
scale = 1/rate). Possible values are [0, ∞). Not all Weibull distributions are right skewed. The
Weibull distributions have proved to be good models of “time to failure” data (how long maufactured
parts last before failure, for example). The exponential distributions arise when α = 1 and β = 1/λ.
• beta – the standard Beta distributions are continuous distribution that gives values between 0 and 1.
(They can be shifted or stretched to give values on any other interval, if desired). There are two shape
parameters (α = shape1 and β = shape2). The uniform distribution on [0, 1] is a special case when
α = β = 1.
c
2007
Randall Pruim ([email protected])
February 27, 2007
Math 243 – Spring 2007
23
Some Properties of Expected Value & Variance
Let’s suppose that X and Y are discrete random variables with pmf’s p(x) and q(y). Recall that X and Y
are independent if and only if P (X = x & Y = y) = p(x) · q(y). Then the following can be verified by
looking at the sums involved and using some algebra:
1. For any rv’s X and Y , E(X + Y ) = E(X) + E(Y ). (Note: independence not required.)
X
(x + y)P (X = x & Y = y)
E(X + Y ) =
(1)
x,y
=
X
x · P (X = x & Y = y) + y · P (X = x & Y = y)
(2)
x,y
=
XX
x
=
y
y
y · P (X = x & Y = y)
(3)
x
X X
X X
x
P (X = x & Y = y) +
y
P (X = x & Y = y)
x
=
XX
x · P (X = x & Y = y) +
X
y
y
x · P (X = x) +
X
x
(4)
x
y · P (Y = y)
(5)
y
= E(X) + E(Y )
(6)
2. If X and Y are independent, then E(XY ) = E(X) · E(Y ).
X
E(XY ) =
(xy)P (X = x &Y = y)
(7)
x,y
=
X
(xy)p(x)q(y)
(8)
x,y
=
XX
x
=
X
xy · p(x)q(y)
xp(x)
X
x
=
X
(9)
y
y · q(y)
(10)
y
xp(x) E(Y )
(11)
x
= E(X) · E(Y )
(12)
3. If X and Y are independent, then Var(X + Y ) = Var(X) + Var(Y ).
Var(X + Y ) = E((X + Y )2 ) − [E(X + Y )]2
2
(13)
2
2
= E(X + 2XY + Y ) − [E(X) + E(Y )]
2
2
(14)
2
2
= E(X ) + E(2XY ) + E(Y ) − [(E(X)) + 2 E(X) E(Y ) + (E(Y )) ]
2
2
2
2
= E(X ) + 2 E(X) E(Y ) + E(Y ) − [E(X) + 2 E(X) E(Y ) + E(Y ) ]
2
2
2
(15)
(16)
2
(17)
2
= E(X ) − E(X) + E(Y ) − E(Y )
(18)
= Var(X) + Var(Y )
(19)
= E(X ) + E(Y ) − E(X) − E(Y )
2
2
2
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
February 27, 2007
24
Continuous Variables
The results above are still true for continuous variables. The proofs are almost the same, after replacing
sums with integrals and pmf’s with pdf’s.
Application: Mean and Variance of Binomial Distributions
If X ∼ Bin(n, p), we can use these properties to determine the mean and variance of X by writing
X = X1 + X2 + · · · Xn ,
where each Xi is an independent Bernoulli random variable with probability p of success. Recall that
E(X) = p and Var(X) = p(1 − p).
E(X) = E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np
Var(X) = Var(X1 + X2 + · · · + Xn ) = Var(X1 ) + Var(X2 ) + · · · + Var(Xn ) = np(1 − p)
An Extra Property for Normal Distributions
If X and Y are independent and normally distributed, then X + Y will also be normal (and we can calculate
the mean and variance using the methods just described).
Example: If X ∼ N(10, 3) and Y ∼ N(12, 4), what is P(X ≥ Y )?
c
2007
Randall Pruim ([email protected])
March 2, 2007
Math 243 – Spring 2007
The Distribution of X
Consider the following common sampling situation:
• population with µ and variance σ 2
• simple random sample (SRS): X1 , X2 , . . . , Xn
• X=
X1 + X2 + · · · + Xn
1
= (X1 + X2 + · · · + Xn )
n
n
It is important to distinguish three distributions:
1. Population distribution
2. Each particular sample has a distribution
3. The sampling distribution: distribution of X over many samples.
We want to know about the sampling distribution of X.
• First apply our rules for means:
1
1
E(X) = E
(X1 + X2 + · · · + Xn ) = E (X1 + X2 + · · · + Xn )
n
n
1
=
(E(X1 ) + E(X2 ) + · · · + E(Xn ))
n
1
=
(µ + µ + · · · + µ)
n
1
nµ = µ
=
n
• Then apply our rules for variances, remembering that for an SRS the Xi ’s are indpendent:
1
1
Var(X) = Var
(X1 + X2 + · · · + Xn ) = 2 Var (X1 + X2 + · · · + Xn )
n
n
1
=
(Var(X1 ) + Var(X2 ) + · · · + Var(Xn ))
n2
1
=
σ2 + σ2 + · · · + σ2
2
n
1
1
nσ 2 = σ 2
=
n2
n
• And if the population is normal, then the distribution of X is normal, too.
• Better still,
Central Limit Theorem. For any population with mean µ and standard deviation σ,
√
if n is large enough, then the ditribution of X ≈ N(µ, σ/ n).
c
2007
Randall Pruim ([email protected])
25
March 5, 2007
Math 243 – Spring 2007
26
An Important General Setting
1. We are interested in information about some unknown parameter(s) of a population or process.
• We will often make some assumptions about the population distribution. Examples:
◦ perhaps some parameters are known (e.g., we know the standard deviation but not the mean)
◦ maybe we assume we know the family of distriubtions it comes from (e.g., the population is
normal, but we don’t know the mean and standard deviation)
2. We obtain some information by means of a simple random sample (SRS) or by repetition of the process.
• X1 , X2 , . . . , Xn – notation for random sample thought of as random variables. Each Xi has the
same distriubtion as the population and is independent of the others. (We say they are i.i.d.,
independent and identically distributed.
• x1 , x2 , . . . , xn – notation for a particular sample (our data).
3.
Big question:
What does our sample data reveal about the unknown parameters of the population?
(a) Estimation: Can we give good estimates for the unknown parameters, along with some indication
of accuracy?
• What proportion of cups does the lady tasting tea correctly identify?
◦ Unknown paramter: success probability.
◦ Desired answer: “probably” between
and
.
(b) Hypothesis testing: Can we answer, with a specified degree of confidence, some yes/no question
about the unknown parameters?
• Does the lady tasting tea get the answer correct more than half the time? (Unknown
paramter: success probability.)
• We will develop formal procedures for many situations of this sort over the next weeks. For today, we
want to think a bit more generally about the process.
Estimators and Estimates
Let θ be an unknown parameter we want to know about.
• Statistic: a function from a sample to R.
◦ We said earlier that a statistic is a number that describes a sample. This definition makes that
idea a bit more precise.
◦ Examples: mean, median, variance, standard deviation, quantiles, etc.
• Estimator: a statistic applied to X1 , X2 , . . . , Xn .
• Estimate: a statistic applied to x1 , x2 , . . . , xn .
• An estimator is a random variable, an estimate is a number.
◦ estimator: X = X1 +X2n+···Xn
◦ estimate: x = x1 +x2n+···xn
c
2007
Randall Pruim ([email protected])
March 5, 2007
Math 243 – Spring 2007
27
What makes a good estimator?
1. Unbiased: E(θ̂) = θ – it’s correct “on average”
2. Low Variance:
• Standard deviation of an estimator is called standard error and is denoted σθ̂ .
• If we can only estimate the standard error, we have an estimated standard error and is denoted
σ̂θ̂ or sθ̂ .
3. Known Distribution (or known good approximation to it) – that way we can make probability claims
about our estimator.
Estimators (µ̂) for the Populaion Mean (µ)
The sample mean (X) is a very nice estimator for µ:
• Always unbiased.
• Sampling distribution is approximately normal if n is large enough; exactly normal for any n if the
population is normal. (And easy formalas for mean and variance, too!)
• If the population is normal, it is the MVUE (minimum variance unbiased estimator).
But other estimators may be better in some situations:
• median may be better if the population distribution is symmetric and has heavy tails (since outlying
values have a big effect on the mean and are likely to be in our data).
• mean of extremes (mean of maximum and minimum) is best for a population that is uniform.
• trimmed mean works well (but usually not best) in a very wide range of situations. (But it is more
complicated to study mathematically.)
Other Examples
1. Estimating maximum possible of uniform distribution.
2. Estimating variance
(a) if the mean is known
(b) if the mean is unknown
c
2007
Randall Pruim ([email protected])
March 6, 2007
Math 243 – Spring 2007
Sampling Distributions
• Yesterday: General ideas about estimation (estimators and estimates)
• Next two weeks: Two important examples
◦ Estimating µ, a population mean (quantitative data)
· Estimator: X
· Sampling Distribution: X ∼ N(µ, √σn )
◦ Estimating p, a population proportion (categorical data)
· Estimator: p̂ =
X
/n
· Sampling Distributions: X ∼ Binom(n, p);
q p̂ ≈ N p, pq
n
c
2007
Randall Pruim ([email protected])
28
March 8–9, 2007
Math 243 – Spring 2007
29
Confidence Intervals For the Mean of a Normal Population
Assumptions
1. Data are SRS: X1 , X2 , . . . , Xn
2. Population is normal
3. The population variance σ 2 is known. (The population mean µ is unknown.)
Key Idea
Since
• X is “usually” “close to” µ, and
√
• we can quantify this probabilistically since we know X ∼ N(µ, σ/ n), and
• whenever X is close to µ, µ must be close to X,
we can try to find a number m such that
P(|X − µ| ≤ m) = 1 − α
(quantifying “probably” and “close to”). This leads to an interval of the form (X − m, X + m) = X ± m
around X where 1 − α is the confidence level (α is the probably-part) and m is the margin of error (m is
the close-to-part).
The Magic Formula
If we specify a level of confidence of 1 − α, then
P(|X − µ| ≤ z∗ SE) = 1 − α
√
where z∗ = zα/2 is “some number of standard deviations”, and SE = σ/ n is the standard error (standard
deviation of the sampling distribution). The critical number zα/2 is chosen so that
P(Z > zα/2 ) = α/2
that is, we are choosing the critical number so that the fraction α of the standard normal distribution lies
within ±zα/2 . (Draw a quick picutre when making these calculations.)
So our confidence interval becomes:
σ
x ± zα/2 √
n
or more simply written:
x ± z∗ SE
c
2007
Randall Pruim ([email protected])
March 8–9, 2007
Math 243 – Spring 2007
30
Robustness Issues
What if our assumptions are not met?
1. If the population is not normal. . .
If n is “large enough” then the sampling distribution for X is still approximately normal by the Central
Limit Theorem, so the methods are still approximately correct. “Large enough” depends on how the
population differs from normality. A good rule of thumb is that 30 is large enough for nearly all
distributions.
2. If the population variance is unknown (usually the case). . .
we could try plugging in s for σ in all our formulas, BUT . . .
That would not be quite right, even if the population is normal. The problem is that the sampling
distribution of
X −µ
√
s/ n
would no longer be normal. The good news is that the distribution it has is known. It is the t
distribution with n − 1 degrees of freedom.
Our new CI becomes
x ± t∗ eSE
or
s
x ± tn−1,α/2 √
n
3. If we have both problems (don’t know σ and popoulation is not normal). . .
The t intervals are also robust and can be used in most situations where n > 30 and for even smaller
n if the population is unimodal and not terribly skewed.
4. If the sample is not an SRS, then things can become much more difficult (depending on how it differs
from an SRS).
To say these confidence intervals are robust means that the stated confidence level is quite close to the actual
coverage rate of the interval. Sometimes this is verified theoretically (by folks who know more statistics that
we have learned); sometimes it is verified emprically (by running simulations).
More Variations
1. One-sided and lop-sided confidence intervals
2. Determining sample size for specified confidence and margin of error.
3. Confidence Intervals for population proportions (method 1).
c
2007
Randall Pruim ([email protected])
March 12, 2007
Math 243 – Spring 2007
31
Things we have not covered
There are some topis from Chapter 7 that we have not yet covered:
• tolerance and prediction intervals
• an improved method of confidence intervals for proportions (Box 7.10 on page 295).
Hypothesis Testing
Last week we talked about confidence intervals. This week we will focus on the other main category of
inference procedure – hypothesis testing. Actually, we have discussed most of the ideas involved already.
The lady tasting tea is an example of hypothesis testing.
• Hypothesis:
• Statisitcal Hypothsesis:
◦ µ = 100: The population mean is 100.
◦ µ > 100: The population mean greater than 100.
◦ µM > µF : The mean weight of male ogres is more than the mean weight of female ogres.
◦ The population is normal.
• Our goal with hypothesis testing is to assess the evidence provided by data with respect to some claim
(hypothesis) about a population.
◦ A hypothesis test is a formal procedure for comparing observed (sample) data with a hypothesis
whose truth we want to ascertain.
◦ The results of a test are expressed in terms of a probability that measures how well the data
and the hypothesis agree. In other words, it helps us decide if a hypothesis is reasonable or
unreasonable based on the likelihood of getting sample data similar to our data.
Just like confidence intervals, hypothesis testing is based on our knowledge of sampling distributions.
c
2007
Randall Pruim ([email protected])
March 12, 2007
Math 243 – Spring 2007
32
4 step procedure for testing a hypothesis
Step 1: Identify parameters and state the null and alternate hypotheses.
The hypothesis to be tested is called the null hypothesis, designated H0 .
The alternate hypothesis describes what you will believe if you reject the null hypothesis. It is designated
Ha . (It may well be that H0 is a straw man and that we hope or suspect is true instead of H0 .)
Example 1. The average LDL of healthy middle-aged women is 108.4. We suspect that smokers have a
higher LDL (LDL is supposed to be low).
Example 2. The average Beck Depression Index (BDI) of women at their first post-menopausal visit is
5.1. We wonder if women who have quit smoking between their baseline and post visits have a BDI different
from the other healthy women.
Step 2: Compute the test statistic
A test statistic measures how well the sample data agrees with the null hypothesis.
When we are testing a hypothesis about the mean of a population, the test statistic has the form
(estimate) - (hypothesis value)
SE or eSE
We use this statistic because of what we know about sampling distributions when the population is ∼ N(µ, σ):
√
• X ∼ N(µ, σ/ n)
•
X −µ
√ ∼ N(0, 1)
σ/ n
•
X −µ
√ ∼ tn−1 .
s/ n
These results are quite robust, so we can apply them in many situations where we are not willing to assume
the population is normal – especially if the sample size is large.
In each of these cases, the larger the test statistic, the stronger the evidence against the null hypothesis.
This is true of most test statistics.
For our HWS examples (LDL: 159 smokers, x̄ = 115.7, sd=29.8; BDI: 27 quitters, x̄ = 5.6, sd=5.1):
LDL: t =
x̄−µ
√0
s/ n
=
115.7−108.4
√
29.8/ 159
BDI: t =
x̄−µ
√0
s/ n
=
5.6−5.1
√
5.1/ 27
= 3.089
= 0.509
c
2007
Randall Pruim ([email protected])
March 12, 2007
Math 243 – Spring 2007
33
Step 3: Compute the p-value
The p-value (for a given hypothesis test and sample) is the probability, if the null hypothesis is true,
of obtaining a test statistic as extreme or more extreme than the one actually observed.
How do we compute the p-value?
Here is how hypothesis testing works based on the t-distribution:
Suppose that an SRS of size n is drawn from a population having unknown mean µ. To
test the hypothesis H0 : µ = µ0 based on an SRS of size n, compute the one-sample t
statistic
t=
x̄ − µ0
√
s/ n
The p-value for a test of H0 against
• Ha : µ > µ0
is P (T > t)
• Ha : µ < µ0
is P (T < t)
• Ha : µ 6= µ0
is P (T > |t|) + P (T < −|t|) = 2P (T > |t|)
where T is a random variable with a t distribution (df = n − 1).
These p-values are exact if the population distribution is normal and are approximately
correct for large enough n in other cases.
The first two kinds of alternatives lead to one-sided or one-tailed tests. The third alternative leads to
two-sided or two-tailed test.
Example 3. Find the p-value for testing our hypotheses regarding LDL in the HWS. t = 3.09, one-tailed
alternate hypothesis, n = 159 smokers.
>1- pt(3.09,df=158)
[1] 0.0011829
Example 4. Find the p-value for testing our hypothesis regarding BDI in the HWS. t=.51, two-tailed
alternate hypothesis, n = 27 quitters.
>2 * (1 - pt(.51,df=26))
[1] 0.61435
Note: This is exactly how we evaluated the claims of the lady tasting tea.
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
March 12, 2007
34
Step 4: State a conclusion
A decision rule is simply a statement of the conditions under which the null hypothesis is or is not rejected.
This condition generally means choosing a significance level α. If p is less than the significance level
α, we reject the null hypothesis, and decide that the sample data do not support our null hypothesis
and the results of our study are then called statistically significant at significance level α.
If p > α, then we do not reject the null hypothesis. This doesn’t necessarily mean that the null
hypothesis is true, only that our data do not give strong enough evidence to reject it. It is rather like a
court trial. If there is strong enough evidence, we convict (guilty), but if there is reasonable doubt, we find
the defendant “not guilty”. This doesn’t mean the defendant is innocent, only that we don’t have enough
evidence to claim with confidence that he or she is guilty.
Looking back at our previous examples, if we select a significance level of α = 0.05, what do we conclude?
• LDL (p-value < 0.0025)
Since 0.0025 < 0.05, we
the null hypothesis and conclude that the smokers have a significantly
higher LDL. ”Smokers have higher LDL (t = 3.09, df = 158, p < .0025)”.
• BDI (p-value > .5)
This p-value is higher than .05, so we
the null hypothesis. It is quite possible that the null
hypothesis is true and our data differed from the hypothesized value just based on random variation. In
this case we say that the difference between the mean of 5.6 (those who quit smoking) and 5.1 (average
healthy women) is not statistically significant. ”Those who quit smoking do not have a significantly
different average BDI than middle-aged healthy women (t = .51, df = 26, p > .50).”
It’s good to report the actual p-value, rather than just the level of significance at which it was or was not
significant. This is because:
Example 5. Suppose that it costs $60 for an insurance company to investigate accident claims. This cost
was deemed exorbitant compared to other insurance companies, and cost-cutting measures were instituted.
In order to evaluate the impact of these new measures, a sample of 26 recent claims was selected at random.
The sample mean and sd were $57 and $10, respectively. At the 0.01 level, is there a reduction in the average
cost – or can the difference of three dollars ($60-$57) be attributed to the sample we happened to pick?
Hypotheses:
Test statistic:
p-value:
Conclusion:
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
March 12, 2007
35
Example 6. The amount of lead in a certain type of soil, when released by a standard extraction method,
averages 86 parts per million (ppm). A new extraction method is tried, and they wondered if it would
extract a significantly different amount of lead. 41 specimens were obtained, with a mean of 83 ppm lead
and a sd of 10 ppm.
Hypotheses:
Test statistic:
p-value:
Conclusion:
Example 7. The final stage of a chemical process is sampled and the level of impurities determined. The
final stage is recycled if there are too many impurities, and the controls are readjusted if there are too few
impurities (which is an indication that too much catalyst is being added). If it is concluded that the mean
impurity level=.01 gram/liter, the process is continued without interruption. A sample of n = 100 specimens
is measured, with a mean of 0.0112 and a sd of 0.005 g/l. Should the process be interrupted? Use a level of
significance of .05.
Hypotheses:
Test statistic:
p-value:
Conclusion:
c
2007
Randall Pruim ([email protected])
March 13, 2007
Math 243 – Spring 2007
36
Warm-up problem
Example 1.
A sample of 20 mini-boxes of raisins reveals a mean count of 24.2 raisins per box with a standard deviation
of 1.5. Is this enough evidence to doubt the raisin company’s claim that there is an average of 25 raisins per
box?
Statistical Significance
If the p-value is small enough, we say that we have evidence against the null hypothesis that is statistically
significant. The cut-off point for statistical significance is called the significance level and is denoted α.
So if α = 0.05, we will call any test with a p-value smaller than 0.05 statistically significant.
The Rejection Region Approach
0. Determine α (significance level).
1. State null and alternative hypotheses.
2. Determine the rejection region (for which values of our test statistic will we reject the null hypothesis?)
3. Compute test statistic from data.
4. Compare test statistic with rejection region.
Example 2. Repeat Example 20 using the rejection region approach.
The two approaches yield identical conclusions. The rejection region approach is useful for thinking about
statistical power (see below).
Type I Error, Type II Error, and the Power of a Test
Four possible situations:
H0 is true
H0 is false
reject H0
do not reject H0
Two of these are good situations. Two of them are errors. These are errors are referred to as type I and
type II errors.
• P(reject H0 when H0 is true) = α. So α is the type I error rate.
• β = P(don’t reject H0 when H0 is false) can’t be computed. Why?
◦ β = type II error rate can only be computed if
◦ 1 − β is called the power of the test (against a particular alternative).
c
2007
Randall Pruim ([email protected])
.
March 13, 2007
Math 243 – Spring 2007
37
Power Calculations
In general, power calculations are difficult. But when σ is known, power calculations are straightforward.
Example 3. Suppose σ = 2; H0 : µ = 0; Ha : µ > 0; α = 0.05.
Using power.t.test()
R provides a function to determine any one of power, sample size, effect size, standard deviation or significance if all the others are specified.
Example 4. Suppose σ is unknown; H0 : µ = 0; Ha : µ > 0; α = 0.05. Let’s guess that s will be
approximately 2.
• What is the power of this test against an alternative of Ha : µ = 1 when the sample is of size 20?
power.t.test(sd=2,sig.level=.05,delta=1,n=20,type=’one.sample’)
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
20
1
2
0.05
0.5644829
two.sided
• That’s not so good (almost 50% of the time our test will fail to detect an effect of this magnitude).
How large must the sample size be to have 90% power?
>
power.t.test(sd=2,sig.level=.05,delta=1,power=.9,type=’one.sample’)
One-sample t test power calculation
n
delta
sd
sig.level
power
alternative
=
=
=
=
=
=
43.99552
1
2
0.05
0.9
two.sided
c
2007
Randall Pruim ([email protected])
March 13, 2007
Math 243 – Spring 2007
38
• We can also look at power graphically by varying one quantity. Let’s see how sample size affects power:
> xyplot(power.t.test(n=5:1000,sd=2,delta=.5,type=’one.sample’)$power~5:1000,
type=’l’,lwd=2,xlab="effect size",ylab="power",
main="Power of t test when sd=2 and effect size is .5")
• What about effect size and power?
> xyplot(power.t.test(n=40,sd=2,delta=seq(0,3,by=.05),
type=’one.sample’)$power~seq(0,3,by=.05),
type=’l’,lwd=2,xlab="effect size",ylab="power",
main="Power of t test when sd=2 and n=40")
power
power
1.0
0.8
0.6
0.4
0.2
0.0
1.0
0.8
0.6
0.4
0.2
0.0
1.0
1.5
effect size
600
effect size
400
2.0
2.5
800
Power for test if sd=2 and n=40
0.5
200
3.0
1000
Power of t test when magnitude of effect is 1, sd = 2
0
c
2007
Randall Pruim ([email protected])
March 14, 2007
Math 243 – Spring 2007
39
Hypothesis Tests for Proportions
1. State Null and Alternative Hypotheses:
• H0 : p = p0
• Ha : p 6= p0 [ or p < p0 or p > p0 ]
Note: If H0 is true, and X = # of “successes”
• X ∼ Binom(n, p)
p
• X ≈ N(np, np(1 − p))
q
• X/n ≈ N(p, p(1−p)
n )
[provided n is “large enough”]
[provided n is “large enough”]
2. Calculate the test statistic
• x = sample count (of successes)
• p̂ = x/n = sample proportion
r
p̂ − p0
p0 (1 − p0 )
• z=
, where SE =
SE
n
3. Calculate p-value.
If H0 is true, then
(a) Z ≈ N(0, 1)
[z-test, prop.test()]
• Rule of Thumb: approximation is good enough provided np0 ≥ 10 and n(1 − p0 ) ≥ 10.
(b) X ∼ Binom(n, p0 )
[binom.test()]
4. Draw a conclusion. (Is the p-value small enough to reject the null hypothesis?)
Example 1. You are assigned the task of controlling an incoming supply of parts for quality. Your company
will accept a shipment of parts if no more than 5% are defective. You can’t inspect every part, so you sample
200. You find 16 to be defective. What should you do?
Example 2. You are playing a game that involves rolling dice. In the first 20 rolls, a 6 comes up 7 times.
Should you suspect loaded dice?
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
March 14, 2007
40
Comparing significance tests and confidence intervals
Example 3. The drained weights for a sample of 30 cans of fruit have mean 12.087 oz, and standard
deviation 0.2 oz. Test the hypothesis that on average, a 12-oz. drained weight standard is being maintained.
Now let’s compute a 95% CI.
What is the relationship between confidence intervals and two-sided hypothesis tests?
Example 4. Compute a 95% confidence interval for the percentage of defective parts in example 21.
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
March 14, 2007
41
Better Confidence Intervals for Proportions
The simple approximate confidence intervals for proportions that we developed the other day:
r
±
p̂
|{z}
estimate
p̂(1 − p̂)
{zn }
crit.val. |
z∗
|{z}
(20)
eSE
are not very good – especially for small and moderate sample sizes.
• problem: coverage rates are not very accurate
• cause: eSE is a poor estimator for SE unless n is large.
A Better Approach
We could derive a 1 − α confidence interval in a different way by asking the question:
What values of p0 would we not reject if we tested the null hypothesis H0 : p = p0 (with a
two-sided alternative)?
• These values are “plausible values” for p.
• p0 is plausible if the test statistics isn’t too big; that is, if
|p̂ − p0 |
≤ zα/2
zq
p̂0 (1−p0 )
n
• The corresponding equation is quadratic in p0 , so we can solve using the quadratic formula. I’ll spare
you the gory details; here’s the solution:
p̂ +
p0 =
z∗
2n
± z∗
q
p̂q̂
n
+
z∗2
4n2
1 + z∗2 /n
(21)
• For large n, some terms have negligible magnitude, and this is very close to our easier formula (22).
• For a 95% confidence interval, z∗ ≈ 2, and (21) is very nearly (see problem 7.27)
r
p̃ ± z∗
p̃(1 − p̃)
n
(22)
x+2
where p̃ = n+4
. This is sometimes called the plus four confidence interval or the Wilson confidence interval. It has been shown that this confidence interval is very accurate even for quite small
sample sizes, and since it is now more difficult to do than the (22), it is recommended that it be used
for all sample sizes when the confidence level is near 95% (say between 90% and 99%).
• For other confidence levels and computer calculations, (21) can be used.
c
2007
Randall Pruim ([email protected])
March 26, 2007
Math 243 – Spring 2007
42
Review & Preview: Design of Statistical Studies
It is going to become increasing important to pay attention to study design. In general, by study design
we mean answers to questions like the following:
• What population are we studying? How will we sample?
• What variables will be measured? How will they be measured?
• Observational study or experiment?
• What statistical analysis will we do? What assumptions are required for this analysis? How will we
check the assumptions?
Before spring break the situations we considered all had the following simple univariate study design:
• random sample (SRS) from a population
• one variable recorded for each individual (two flavors: quantitative or categorical)
◦ for quantitative variables there were typically some normality assumptions (less important as
sample size gets larger).
• inference via confidence intervals and/or hypothesis tests
The rest of the semester will consist of variations on these themes that are appropriate for other study
designs. What study designs are appropriate in the following examples? For which ones do we already have
the necessary tools?
Example 1. What is the average length of an erruption of Old Faithful? (data(faithful))
Example 2. Are healthy women who quit smoking more likely to gain weight or lose weight? How much
weight?
Example 3. If school kids are asked which is more important to them – good grades, athletic ability, or
popularity –
1. What percentage would choose atheletic ability?
2. Which response would be selected most often?
3. Would the responses be different for boys and girls?
4. Would the responses be different in rural, urban, and suburban schools?
5. Would the responses be different for kids of different races?
> read.table(’http://www.calvin.edu/~rpruim/data/dasl/schoolkids.txt’,header=T) -> kids
> source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’)
> getData(’dasl/schoolkids’) -> kids
c
2007
Randall Pruim ([email protected])
March 26, 2007
Math 243 – Spring 2007
43
Example 4. Gosset (the man who invented Student’s t) did an experiment to test whether kiln dried seeds
have better yield (lbs/acre) than regular seeds. How would you design such an experiment?
> getData(’dasl/gosset’) -> corn
Example 5. Tire treadwear can be measured two different ways: by weighing the tire and by measuring
the depth of the grooves.
1. Do these methods give comparable results?
2. Can we predict one measurement from the other? (With what degree of accuracy?)
> getData(’dasl/tirewear’) -> tire
Paired Study Design
For several of our examples a paired study design is appropriate:
• random sample (SRS) from a population
• two quantitative measurements for each individual
• interested in the difference between the two measurements (how much improvement, weight loss,
change, etc.)
Key idea: Once we form A − B, we are back to a univariate situation, so we already know what to do!
A Correction
The power.t.test() examples from March 13 have been updated. I thought that one-sample tests were
the default, but it turns out that two-sample tests (one of our topics for this week) are the default. The
updated examples are available online.
c
2007
Randall Pruim ([email protected])
March 26, 2007
Math 243 – Spring 2007
44
Normal Quantile Plots, Normal Probability Plots (Sec. 4.6)
We have been making normality assumptions for many of our procedures. How does one check that these
assumptions are reasonable? One useful graphical check is called a normal quantile plot. The idea is to
compare data to the theoretical quantiles of a standard normal distribution.
Example 6. Sample data (in sorted order): 83 92 96 96 97 99 100 101 107 108
• Using our ruler method with data centered in each unit, we see that 83 corresponds to the 0.05-quantile
of our data (see second row of table below).
• What values would we expect for the 0.05-quantile, the 0.15-quantile, etc. if this data came from a
normal distribution? We can use qnorm(seq(.05,.95,by=.1),mean=mean(x),sd=sd(x)) to find out.
• If there is a good fit, plotting the first row against the second should give approximately the straight
line with slope = 1 and intercept = 0. (y = x)
• Since all normal distributions are linear transformations of a standard normal distribution (y = µ+σz),
we could use qnorm(seq(.05,.95,by=.1)) instead. Now a good fit will be roughly linear (x = y =
µ + σz) the slope and intercept will be the mean and standard deviation of a normal distribution (if
we put our data on the vertical axis).
Notice that the graphs look identical except for the scale.
1
83.00
0.05
86.08
−1.64
data (x)
probability
normal quantiles (y)
st normal quantiles (z)
2
92.00
0.15
90.45
−1.04
3
96.00
0.25
93.05
−0.67
4
96.00
0.35
95.13
−0.39
data (x)
5
97.00
0.45
97.00
−0.13
6
99.00
0.55
98.80
0.13
7
100.00
0.65
100.67
0.39
8
101.00
0.75
102.75
0.67
9
107.00
0.85
105.35
1.04
data (x)
105
100
95
90
85
105
100
95
90
85
85
●
●
●
−1
●
90
●
●
●
●
●
0
●
●
●
100
●
●
●
theoretical quantiles (z)
●
95
theoretical quantiles (y)
●
1
●
105
●
●
110
Of course, R can do this all automatically for you:1 require(lattice); qqmath(x) or qqnorm(x).
1
Actually, R uses a slightly different ruler method for small data sets, but the basic idea is the same
c
2007
Randall Pruim ([email protected])
10
108.00
0.95
109.72
1.64
March 27, 2007
Math 243 – Spring 2007
45
Variations on the theme
The same basic ideas that we used when computing confidence intervals and evaluating hypothesis tests for
means of a quantitative variable can be applied in a number of related situations. The best way to think of
these different situations is as variations on an inference theme. To make this easier, we will use a systematic
notation scheme throughout:
• parameters (population)
◦ π, proportion (of a categorical variable)
◦ µ, mean (of quantitative variable)
◦ σ, standard deviation
• statistics (sample)
◦ n, sample size
◦ X, count (of a categorical variable)
◦ p=
X
n,
proportion (of a categorical variable)
◦ x̄, mean (of quantitative variable)
◦ s, standard deviation
• sampling distribution
◦ SE, standard deviation of the sampling distribution (σx̄ or σp̂ are also used for this)
◦ eSE, estimated standard error of the sampling distribution (an estimate for SE)
◦ µp , µx̄ , mean of sampling procedure (for when determining p̂ and x̄, respectively)
Subscripts will be used to indicate
The procedures involving the z (normal) and t distributions are all very similar.
• To do a hypothesis test, compute
t or z =
data estimate − hypothesis value
,
SE or eSE
and compare with the appropriate distribution (using tables or computer).
• To compute a confidence interval, first determine the critical value for the desired level of confidence
(z ∗ or t∗ ), then the confidence interval is
data estimate ± (critical value)(SE or eSE) .
c
2007
Randall Pruim ([email protected])
March 27, 2007
Math 243 – Spring 2007
46
Two Sample Procedures
A two-sample problem is one in which:
1. the goal is to compare the responses in two groups
2. each group is considered to be a sample from a distinct population
3. the responses in each group are independent from those in the other group
The difference between two-sample problems and matched pairs problems is that in matched pairs we have
two measurements that are dependent in some way, whereas in two-sample problems we have independent
measurements from two distinct populations
Suppose we want to compare µ1 with µ2 . We do this by drawing a sample from each population and
calculating x̄1 and x̄2 . In order to know what x̄1 and x̄2 tell us about the difference between µ1 and µ2 we
need to know about the sampling distribution for X̄1 − X̄2 .
Assuming each population has a normal distribution with means µi and standard deviations σi , we already
know that
σ1
σ2
X̄1 ∼ N µ1 , √
and
X̄2 ∼ N µ2 , √
n1
n2
Using our rules for combining means and variances, we see that
s
X̄1 − X̄2 ∼ N (µ1 − µ2 , SE) , where SE =
σ12 σ22
+
n1 n2
Of course, we won’t usually know σ1 and σ2 , so we need to estimate SE using
s
s21
s2
eSE =
+ 2
n1 n2
and
.
Unfortunately the t statistic computed from this does does not have a t-distribution with n1 + n2 − 2 degrees
of freedom, as we might have hoped. Why? A t distribution replaces a N(0,1) distribution only when a
single population standard deviation σ is replaced by a single sample standard deviation s. In this case, we
replaced two standard deviations (σ1 and σ2 ) with their estimates (s1 and s2 ).
The resulting distribution is still approximately a t-distribution, but with the following degrees of freedom:
df = ν =
(s21 /n1 + s22 /n2 )2
(s21 /n1 )2
n1 −1
+
(s22 /n2 )2
n2 −1
=
(eSE12 + eSE22 )2
eSE14
df1
+
eSE24
df2
Some algebra shows that
• If n1 = n2 and s1 = s2 , then ν = n1 + n2 − 2
• ν ≤= n1 + n2 − 2 (sum of degrees of freedom)
• ν ≥= min(n1 − 1, n2 − 1) (smaller degrees of freedom)
c
2007
Randall Pruim ([email protected])
March 27, 2007
Math 243 – Spring 2007
47
Now that we know the distribution involved and a value for eSE, we are all set to do hypothesis testing or
to compute CIs.
Example 1. An agronomist has developed a new plant food. She hopes it will improve yield. To find out,
she treated 48 plants with the new food and obtained a mean yield of 24.4 lbs (s=4.8 lbs). 45 identical
plants were untreated and had a mean yield of 22.3 lbs (s =2.3 lbs). Do the data provide sufficient evidence
to determine that the new plant food is better than no treatment?
Example 2. The weather bureau measured the ozone level at 5 random locations in Orange City before a
cool front moved through and at 5 different random locations afterward. Test whether there is a significant
drop in the ozone level after the front has moved through.
Time
Before front
After front
n
5
5
mean
0.122
.094
variance
.00067
.00016
Question. How could the design of these studies be changed to make them matched pairs designs? What
would be the advantages/disadvantages of such changes?
c
2007
Randall Pruim ([email protected])
March 27, 2007
Math 243 – Spring 2007
48
More Examples
Some of these are two-sample designs, others are paired designs.
Example 3. In testing food products for palatability, General Foods employed a 7-point scale from -3
(terrible) to +3 (excellent) with 0 representing ”average”. Their standard method for testing palatability
was to conduct a taste test with 50 persons - 25 men and 25 women. Does the amount of liquid used in the
samples matter?
>
>
>
>
source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’)
getData(’dasl/tastetest’) -> taste
t.test(score~liq,data=taste)
t.test(score~liq,data=taste,paired=T)
Example 4. Gosset (the man who invented Student’s t) did an experiment to test whether kiln dried seeds
have better yield (lbs/acre) than regular seeds. How would you design such an experiment?
> getData(’dasl/gosset’) -> corn
> with(corn,t.test(reg,kiln))
> with(corn,t.test(reg,kiln,paired=T))
Example 5. Tire treadwear can be measured two different ways: by weighing the tire and by measuring
the depth of the grooves.
1. Do these methods give comparable results?
2. Can we predict one measurement from the other? (With what degree of accuracy?)
> getData(’dasl/tirewear’) -> tire
> with(tire,t.test(weight,groove)
> with(tire,t.test(weight,groove,paired=T))
Example 6. Two varieties of oats were compared in an experiment to determine which variety had the
higher yield. Since soil type also affects yield, the experimenter blocked out its effect by planting each variety
of oats in seven different types of soil. With the data paired by soil types as given below, does it appear
that variety A has the higher mean yield?
H0 :
Yield
Soil type
1
2
3
4
5
6
7
A
71.2
72.6
47.8
76.9
42.5
49.6
62.8
mean(x) = 3.7286
B
65.2
60.7
42.8
73.0
41.7
56.6
57.3
x = A-B
6.0
11.9
5.0
3.9
0.8
-7.0
5.5
sd(x) = 5.7792
Ha :
eSE =
p-value =
95% CI for difference:
c
2007
Randall Pruim ([email protected])
March 30, 2007
Math 243 – Spring 2007
49
Letting R do the work
R can, of course, do all of the procedures we have been learning (and many more). Some useful R commands
for this are
• prop.test(), binom.test()
• t.test(), power.t.test()
• qqmath() (and xqqmath() which adds a reference line
[require(lattice)]
• histogram() and bwplot()
[require(lattice)]
• summary() for computing numerical summaries of data
[require(Hmisc)]
• table() and xtabs() for tabulating counts of data
R Formulas
Lattice graphics and many statistical procedures in R are based on formula interface. Simple formulas in R
have the following form: y~x|z or ~x|z. Often the |z is not needed. Often this can be thought of as “y is
modeled by x (conditioned by z)”. When y is missing, it usually means that R will be caculating something
for you. For example, in a histogram, R computes the heights of the bars (the y-coordinate). In xtabs(),
R computes the frequencies for the cross-table.
Some R Examples
• histogram( y~x|z, data=mydata ) will make a scatterplot of y by x with separate plots for each
level of a categorical variable z. All the variables are taken from the data frame mydata.
• bwplot( ~x|z, data=mydata ) will make a boxplot of x with separate plots for each level of a categorical variable z. bwplot( y~x, data=mydata ) will make side-by-side boxplots of (One of x and y
should be categorical, the other quantiative).
• summary(y~x, data=mydata, fun=favstats) will compute summary statistics of y for each level of
x. favstats is a little function I wrote. You can put things like mean, sd, quantile in there as well.
(favstats does all three.)
• t.test(y~x,data=mydata) will do a two-sample t-test if x has two-levels. If you add paired=TRUE it
will do a paired t-test, assuming that the order in the data set determines the pairs.
• t.test(mydata$x) or with(mydata, t.test(x)) will do a one-sample t-test. t.test(mydata$x, mydata$y)
or with(mydata, t.test(x,y)) will do a two-sample or paired t-tests (depending on how you set
paired).
• You can do pooled t-tests (remember, these assume that σ1 = σ2 and are not very robust against
violations of that assumption), by adding var.equal=TRUE to a t.test().
• Other useful parameters for t.test() include alternative, mu, and conf.level. See ?t.test for
details.
c
2007
Randall Pruim ([email protected])
March 30, 2007
Math 243 – Spring 2007
50
• prop.test() and binom.test() work with summarized data. For example, prop.test(25,100) will
do a 1-proportion test and CI where your data had 25 successes and a sample size of 100. See the help
for these functions for more details.
• For help in tabluation, table(x) and xtabs(~x+y,data=mydata) are useful commands.
More Examples
Here are some questions that we can now answer using the procedures we have been developing.
Example 1. On average, how much weight did a healthy non-smoker gain during the course of a study?
Example 2. Are healthy women who quit smoking more likely to gain weight or lose weight? How much
weight?
Example 3. Are women who quit smoking more likely to gain weight than smokers who do not quit? Do
they gain more weight on average? If so, how much more?
Example 4. If school kids are asked which is more important to them – good grades, athletic ability, or
popularity –
1. What percentage would choose atheletic ability?
2. Which response would be selected most often?
3. Would the responses be different for boys and girls?
4. Would the responses be different in rural, urban, and suburban schools?
5. Would the responses be different for kids of different races?
> source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’)
> getData(’dasl/schoolkids’) -> kids
Example 5. In testing food products for palatability, General Foods employed a 7-point scale from -3
(terrible) to +3 (excellent) with 0 representing ”average”. Their standard method for testing palatability
was to conduct a taste test with 50 persons - 25 men and 25 women. Does the amount of liquid used in the
samples matter?
> getData(’dasl/tastetest’) -> taste
> t.test(score~liq,data=taste)
> t.test(score~liq,data=taste,paired=T)
Example 6. Gosset (the man who invented Student’s t) did an experiment to test whether kiln dried seeds
have better yield (lbs/acre) than regular seeds. How would you design such an experiment?
> getData(’dasl/gosset’) -> corn
> with(corn,t.test(reg,kiln))
> with(corn,t.test(reg,kiln,paired=T))
c
2007
Randall Pruim ([email protected])
March 30, 2007
Math 243 – Spring 2007
51
Example 7. Tire treadwear can be measured two different ways: by weighing the tire and by measuring
the depth of the grooves.
1. Do these methods give comparable results?
2. Can we predict one measurement from the other? (With what degree of accuracy?)
> getData(’dasl/tirewear’) -> tire
> with(tire,t.test(weight,groove)
> with(tire,t.test(weight,groove,paired=T))
Example 8. Does spending a weekend with a group of men banging on drums increase one’s masculinity?
> getData(’m243/drumbeating’) -> tire
R Output
Here is a log of the R commands as done in class today (with my typos removed):
> source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’)
> getData(’dasl/schoolkids’) -> kids
Getting data from: http://www.calvin.edu/~rpruim/data/dasl/schoolkids.txt
# Reference:
# Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social
# Determinant for Children," Research Quarterly for Exercise and Sport, 63,
# 418-424
#
> table(kids$Goals)
Grades Popular Sports
247
141
90
> prop.test(90,247+141+90)
1-sample proportions test with continuity
correction
data: 90 out of 247 + 141 + 90, null probability 0.5
X-squared = 184.5377, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.1548111 0.2268756
sample estimates:
p
0.1882845
> prop.test(90,247+141+90,p=1/3)
1-sample proportions test with continuity
correction
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
March 30, 2007
data: 90 out of 247 + 141 + 90, null probability 1/3
X-squared = 44.6049, df = 1, p-value = 2.411e-11
alternative hypothesis: true p is not equal to 0.3333333
95 percent confidence interval:
0.1548111 0.2268756
sample estimates:
p
0.1882845
> xtabs(~Race+Goals,data=kids)
Goals
Race
Grades Popular Sports
Other
22
9
5
White
225
132
85
> prop.test(c(5,85),c(22+9+5, 225+132+85))
2-sample test for equality of proportions with
continuity correction
data: c(5, 85) out of c(22 + 9 + 5, 225 + 132 + 85)
X-squared = 0.3212, df = 1, p-value = 0.5709
alternative hypothesis: two.sided
95 percent confidence interval:
-0.18723283 0.08039522
sample estimates:
prop 1
prop 2
0.1388889 0.1923077
> xtabs(~wt.gain+smoke.status,data=hws)
smoke.status
wt.gain new smoker nonsmoker quitter smoker
no
2
142
4
42
yes
4
150
20
43
> prop.test(c(20,43), c(24,85))
2-sample test for equality of proportions with
continuity correction
data: c(20, 43) out of c(24, 85)
X-squared = 6.9395, df = 1, p-value = 0.008431
alternative hypothesis: two.sided
95 percent confidence interval:
0.1176301 0.5372719
sample estimates:
prop 1
prop 2
0.8333333 0.5058824
> xtabs(~wt.gain+smoke.status,data=hws)
smoke.status
wt.gain new smoker nonsmoker quitter smoker
no
2
142
4
42
yes
4
150
20
43
c
2007
Randall Pruim ([email protected])
52
March 30, 2007
Math 243 – Spring 2007
> x=c(4,42,20,43); dim(x) = c(2,2)
> x
[,1] [,2]
[1,]
4
20
[2,]
42
43
> fisher.test(x)
Fisher’s Exact Test for Count Data
data: x
p-value = 0.004687
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.04756274 0.69157777
sample estimates:
odds ratio
0.2075003
> summary(wt.chg~smoke.status,data=hws,fun=favstats)
wt.chg
N=407
+------------+----------+---+-----+------+-----+------+----+---------+---------+
|
|
|N |0%
|25%
|50% |75%
|100%|mean
|sd
|
+------------+----------+---+-----+------+-----+------+----+---------+---------+
|smoke.status|new smoker| 6| -2.0| 2.500| 6.25|10.000|13.0| 6.000000| 5.648008|
|
|nonsmoker |292|-32.0|-0.125| 5.00|10.625|63.0| 5.637671|10.794352|
|
|quitter
| 24| -3.5| 6.375|13.25|16.875|33.5|12.395833| 8.362675|
|
|smoker
| 85|-48.0| 1.000| 5.00|10.500|40.0| 5.597647|12.346572|
+------------+----------+---+-----+------+-----+------+----+---------+---------+
|Overall
|
|407|-48.0| 0.000| 5.50|11.000|63.0| 6.033170|11.043234|
+------------+----------+---+-----+------+-----+------+----+---------+---------+
> bwplot(wt.chg~smoke.status,hws)
> t.test(wt.chg~smoke.status,hws,subset=smoke.status %in% c(’smoker’,’quitter’))
Welch Two Sample t-test
data: wt.chg by smoke.status
t = 3.1333, df = 54.383, p-value = 0.002784
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.449030 11.147342
sample estimates:
mean in group quitter mean in group smoker
12.395833
5.597647
> qqmath(~wt.chg|smoke.status,hws)
c
2007
Randall Pruim ([email protected])
53
April 5, 2007
Math 243 – Spring 2007
54
Statistics with Two or More Variables
Explanatory and Response Variables
• Response variable (also called dependent variable):
• Explanatory variable (also called independent variable):
If we are interested in determining causal relationship, the ideal situation is to have an experiment
where the researchers determine the values of the explanatory variable(s) and measure the response, but the
terms explanatory and response are also used in observational studies.
Roadmap
Which statistical procedures we will use will depend upon the kind of variables (categorical vs. quantitative,
response vs. explanatory). We’re going to focus on 4 main situations (and some varitions on those themes):
Situation
two categorical varialbes
two quantitative variables
categorical explanatory, quantitative response
quantitative explanatory, categorical response
Procedure
Chi-square
simple linear regression
1-way ANOVA
logistic regression
Extensions
These methods can be extended to deal with more than 2 variables as well. In these situations we are
typically interested in one response variable and several explanatory variables.
Often (but not always) we are are primarily interested in one of the explanatory variables and the others
are included as covariates.
Some reasons for including covariates in a study:
•
•
•
Examples
response
explanatory
clicking speed
hand used
drug effect
dosage
getting disease or not
version of a gene
winning or losing a game
rating difference between teams
possible covariates
c
2007
Randall Pruim ([email protected])
April 5, 2007
Math 243 – Spring 2007
55
The Cartoon Guide to One-way ANOVA
The basic ANOVA situation
• Main Question: Do the means of the quantitative variable depend on which group (given by categorical
variable) the individual is in? Or are they all the same?
• If categorical variable has only 2 values:
• ANOVA allows for 3 or more groups/treatments/sub-populations
Example 1. Treating Blisters.
• Subjects: 25 patients with blisters
• Treatments: Treatment A, Treatment B, Placebo (P)
• Measurement: # of days until blisters heal
• Data [and means]:
◦ A: 5,6,6,7,7,8,9,10
[7.25]
◦ B: 7,7,8,9,9,10,10,11
[8.875]
◦ P: 7,9,9,10,10,10,11,12,13
[10.11]
Question: Are these differences significant? or would we expect differences this large just by random chance?
Ex_1
Ex_2
12
●
10
days
days
12
●
8
10
●
●
●
●
8
●
●
●
6
6
A
B
P
A
B
Whether differences between the groups are significant depends on
•
•
•
We need to develop a test statistic that takes these things into account.
c
2007
Randall Pruim ([email protected])
P
A
B
P
April 5, 2007
Math 243 – Spring 2007
Some notation for ANOVA
• n = number of individuals all together
• I = number of groups
• xij = value for individual j in group i
• x̄ = sample mean of quant. variable for entire data set (grand mean)
• s = sample s.d. of quant. variable for entire data set
Group i has
• ni = # of individuals in group i
• x̄i = sample mean for group i (group mean)
• si = sample standard deviation for group i
From our example (I = 3; 3 groups)
• n1 = 8, n2 = 8 n3 = 9
• x̄1 = 7.25, x̄2 = 8.875, x̄3 = 10.11
> summary(days~treatment,data=blisters,fun=favstats) # requires m243 stuff
days
N=25
+---------+-+--+--+----+---+-----+----+--------+--------+
|
| |N |0%|25% |50%|75% |100%|mean
|sd
|
+---------+-+--+--+----+---+-----+----+--------+--------+
|treatment|A| 8|5 |6.00| 7 | 8.25|10 | 7.25000|1.669046|
|
|B| 8|7 |7.75| 9 |10.00|11 | 8.87500|1.457738|
|
|P| 9|7 |9.00|10 |11.00|13 |10.11111|1.763834|
+---------+-+--+--+----+---+-----+----+--------+--------+
|Overall | |25|5 |7.00| 9 |10.00|13 | 8.80000|1.979057|
+---------+-+--+--+----+---+-----+----+--------+--------+
The ANOVA model
xij = µi + ij
|{z}
|{z}
|{z}
DATA FIT ERROR
ij ∼ N (0, σ)
Assumptions of this model
It is assumed that each group (subpopulation) . . .
• is normally distributed about its group mean (µi )
• has the same standard deviation
That is, Xij ∼ N(µi , σ)
c
2007
Randall Pruim ([email protected])
56
April 5, 2007
Math 243 – Spring 2007
57
Checking the assumptions
• Equal Variances Check.
◦ Rule of Thumb:
+---------+-+--+--+----+---+-----+----+--------+--------+
|
| |N |0%|25% |50%|75% |100%|mean
|sd
|
+---------+-+--+--+----+---+-----+----+--------+--------+
|treatment|A| 8|5 |6.00| 7 | 8.25|10 | 7.25000|1.669046|
|
|B| 8|7 |7.75| 9 |10.00|11 | 8.87500|1.457738|
|
|P| 9|7 |9.00|10 |11.00|13 |10.11111|1.763834|
+---------+-+--+--+----+---+-----+----+--------+--------+
• Overall Normality Check:
◦ uses residual = xij − xi (how different is each data value from its group mean?)
3
●
●
●●
●
1
●●
●●
●●
●●●●●
●
● ●●●
−1
−3
Sample Quantiles
Normal Q−Q Plot for Blister Residuals
●
● ●
●
−2
−1
0
1
2
Theoretical Quantiles
What does ANOVA do?
At its simplest (there are extensions) ANOVA tests the following hypotheses:
• H0 : The means of all the groups are equal. (µ1 = µ2 = · · · = µI )
• Ha : Not all the means are equal.
◦ doesn’t say how or which ones differ
◦ can follow up with “multiple comparisons” if we reject H0
A quick look at the ANOVA table
treatment
Residuals
Df
2
22
Sum Sq
34.74
59.26
Mean Sq
17.37
2.69
F value
6.45
Pr(>F)
0.0063
Conclusion:
c
2007
Randall Pruim ([email protected])
April 5, 2007
Math 243 – Spring 2007
58
How ANOVA works – some details
ANOVA measures two sources of variation in the data and compares their relative sizes
• variation BETWEEN groups (treatment effect)
◦ for each data value look at the difference between its group mean and the overall mean
X
X
(group mean − overall mean)2 =
(x̄i − x̄)2 = SST r
• variation WITHIN groups (“error”)
◦ for each data value we look at the difference between that value and the mean of its group
X
X
(individual measurement − group mean)2 =
(xij − x̄i )2 = SSE
• SST r and SSE are then adjusted to account for sample sizes and the number of groups:
◦ DF T r = I − 1 = degrees of freedom for numerator
◦ DF E = n − I = degrees of freedom for denominator
◦ M ST r = SST r/DF T r; M SE = SSE/DF E
The ANOVA F-statistic is a ratio of the Between Group Variaton (explained by FIT) divided by the Within
Group Variation (Residuals):
F =
Between
explained variation
M ST r
=
=
Within
unexplained variation
M SE
• A large value of F is evidence
H0 , since it indicates that
• If H0 is true (and the model assumptions are also true), then the sampling distribution for F has an
F-distribution.
• In fact, if H0 is true (and the model assumptions are also true), then
E(M ST r) = E(M SE) = σ 2
so F will tend to be about 1.
• When H0 is not true (but model assumptions are still true), then
E(M ST r) > E(M SE) = σ 2
so F will tend to be > 1.
c
2007
Randall Pruim ([email protected])
April 5, 2007
Math 243 – Spring 2007
59
Computing F (an example)
Let’s try an even smaller example. Suppose we have three groups
Group 1:
5.3, 6.0, 6.7
[x̄1 = 6.00]
Group 2:
5.5, 6.2, 6.4, 5.7
[x̄2 = 5.95]
Group 3:
7.5, 7.2, 7.9
[x̄3 = 7.53]
overall mean: 6.44; F = 2.5528/0.25025 = 10.21575
The ANOVA table revisited
The traditional way to summarize all these calculations is in an ANOVA table (which matches the inforation
above up to round-off error):
group
Residuals
Df
2
7
Sum Sq
5.13
1.76
Mean Sq
2.56
0.25
F value
10.22
Pr(>F)
0.0084
F value
Pr(>F)
0.0063
Now let’s fill in the ANOVA table for our blister example:
Df
treatment
Residuals
Sum Sq
34.74
59.26
Mean Sq
c
2007
Randall Pruim ([email protected])
April 5, 2007
Math 243 – Spring 2007
60
SST, MST, and R2
Let’s add another row to our ANOVA table for the blister example:
treatment
Residuals
Total
Df
2
22
24
Sum Sq
34.74
59.26
94.00
Mean Sq
17.37
2.69
3.92
F value
6.45
Pr(>F)
0.0063
√
DF T = 24
SST = 94
M ST = 3.92
> favstats(blisters$days)
0%
25%
50%
75%
100%
5.000000 7.000000 9.000000 10.000000 13.000000
R2 =
mean
8.800000
M ST = 1.98
sd
1.979057
SST r
= proportion of variation explained by groups
SST
Getting R to do ANOVA
For simple ANOVA, the key function is lm() (stands for linear model). The basic form is
model <- lm(response~explanatory,data=mydata) # saving results in a variable called model
Here’s how to analyze our blister data in R (We’ll learn some more ANOVA-related R commands next week).
> model <- lm(days~treatment,data=blisters)
> bwplot(days~treatment,data=blisters) # make side-by-side boxplots
> anova(model)
Analysis of Variance Table
Response: days
Df Sum Sq Mean Sq F value
Pr(>F)
treatment 2 34.736 17.368 6.4474 0.006256 **
Residuals 22 59.264
2.694
> plot(model) # some diagnostic plots
> qqmath(resid(model)) # normal quantile plot of residuals
> summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
7.2500
0.5803 12.494 1.83e-11 ***
treatmentB
1.6250
0.8206
1.980 0.06033 .
treatmentP
2.8611
0.7975
3.588 0.00164 **
Residual standard error: 1.641 on 22 degrees of freedom
Multiple R-Squared: 0.3695,Adjusted R-squared: 0.3122
F-statistic: 6.447 on 2 and 22 DF, p-value: 0.006256
c
2007
Randall Pruim ([email protected])
April 9, 2007
Math 243 – Spring 2007
61
Dot notation
• x·· = grand mean (all items in all groups)
• xi· = mean for group i
Where’s the difference?
If the ANOVA test gives a small p-value, the natural follow-up question is: which groups differ from which?
An easy idea is to do a bunch of 2-sample t tests (on each pair of groups) to answer this question. This
doesn’t quite work because of . . .
The problem of multiple comparisons
Example 1. If you do a study with 5 groups, how many pairs of groups are there?
Example 2. If you do 10 independent hypothesis tests with α = 0.05, and the null hypothesis is true for
each test, what is the probablity that at least one will be “significant at the α = 0.05 level” just by random
chance? What if we make 10 confidence intervals?
One solution to this problem was proposed by Tukey. We won’t go into details, but here are the key ideas:
• compare each pair of groups by forming confidence intervals
• adjust the critical value (t∗ ) to widen the intervals so that the probability that all the confidence
intervals formed by a random sample correctly contain the parameter is the desired confidence level
(95%, for example). [simultaneous confidence intervals]
• consider two groups to be significantly different if the confidence interval for that pair does not
include 0.
• if each group as J observations, the CI’s have the following form
p
xi· − xj· ± Q∗ M SE/J
where Q∗ is our adjusted critical value and J is the size of each group.
• distribution for Q is called the Studentized range distribution.
• this can be adjusted to work for groups of similar but not exactly equal size.
R can automate this:
• TukeyHSD(aov(response~explanatory,data=mydata))
• TukeyHSD(aov(model)) where model is the result of lm().
• simint() in the multcomp package can handle Tukey’s method as well as several other methods for simultaneous confidence intervals: simint(response~explanatory,data=mydata,type="Tukey")
c
2007
Randall Pruim ([email protected])
April 9, 2007
Math 243 – Spring 2007
62
Attracting Bugs
NumTrap
60
50
40
30
20
10
> getData(’m243/bugs’) -> bugs
> summary(NumTrap~Color,bugs,fun=favstats)
NumTrap
N=24
●
B
●
G
+-------+-+--+--+-----+----+-----+----+--------+---------+
|
| |N |0%|25% |50% |75% |100%|mean
|sd
|
+-------+-+--+--+-----+----+-----+----+--------+---------+
|Color |B| 6| 7|11.75|15.0|19.00|21 |14.83333| 5.344779|
|
|G| 6|15|26.75|34.5|38.50|41 |31.50000| 9.914636|
|
|W| 6|12|13.25|15.5|17.00|21 |15.66667| 3.326660|
|
|Y| 6|38|45.25|46.5|47.75|59 |47.16667| 6.794606|
+-------+-+--+--+-----+----+-----+----+--------+---------+
|Overall| |24| 7|14.75|21.0|39.50|59 |27.29167|14.947674|
+-------+-+--+--+-----+----+-----+----+--------+---------+
> bwplot(NumTrap~Color,bugs)
> bugs.lm <- lm(NumTrap~Color,bugs)
> bugs.aov <- aov(NumTrap~Color,bugs)
> summary(bugs.lm)
Median
0.1667
3Q
5.2083
●
W
Residuals:
Min
1Q
-16.5000 -2.9167
Max
11.8333
Residual standard error: 6.784 on 20 degrees of freedom
Multiple R-Squared: 0.8209,Adjusted R-squared: 0.794
F-statistic: 30.55 on 3 and 20 DF, p-value: 1.151e-07
●
●
●
Y
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.8333
2.7696
5.356 3.05e-05 ***
ColorG
16.6667
3.9168
4.255 0.000387 ***
ColorW
0.8333
3.9168
0.213 0.833671
ColorY
32.3333
3.9168
8.255 7.16e-08 ***
residual
10
5
0
−5
−10
> summary(bugs.aov)
Df Sum Sq Mean Sq F value
Pr(>F)
Color
3 4218.5 1406.2 30.552 1.151e-07 ***
Residuals
20 920.5
46.0
0
●
●
> anova(bugs.lm)
Analysis of Variance Table
●
●
●
5
Response: NumTrap
Df Sum Sq Mean Sq F value
Pr(>F)
Color
3 4218.5 1406.2 30.552 1.151e-07 ***
Residuals 20 920.5
46.0
●
●
●
●
●
●
●
●
●
●
15
Fit: aov(formula = NumTrap~Color, data = bugs)
10
●
order of observation
> plot(bugs.lm)
> plot(bugs.aov)
>
> TukeyHSD(bugs.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
●
$Color
●
diff
lwr
upr
p adj
G-B 16.6666667
5.703670 27.629663 0.0020222
W-B
0.8333333 -10.129663 11.796330 0.9964823
Y-B 32.3333333 21.370337 43.296330 0.0000004
W-G -15.8333333 -26.796330 -4.870337 0.0032835
Y-G 15.6666667
4.703670 26.629663 0.0036170
Y-W 31.5000000 20.537004 42.462996 0.0000006
●
●
20
●
●
●
●
25
> plot(TukeyHSD(bugs.aov))
> simint(NumTrap~Color,bugs,type=’Tukey’)
Simultaneous confidence intervals: Tukey contrasts
95 % confidence intervals
Estimate
2.5 % 97.5 %
ColorG-ColorB
16.667
5.704 27.630
ColorW-ColorB
0.833 -10.130 11.796
ColorY-ColorB
32.333 21.370 43.296
ColorW-ColorG -15.833 -26.796 -4.870
ColorY-ColorG
15.667
4.704 26.630
ColorY-ColorW
31.500 20.537 42.463
> plot(simint(NumTrap~Color,bugs,type=’Tukey’))
c
2007
Randall Pruim ([email protected])
April 9, 2007
Math 243 – Spring 2007
ColorG−ColorB
ColorW−ColorB
●
● ●
●
● ●
●
●
●
● ●
●
15
ColorY−ColorB
5 10
ColorW−ColorG
0
ColorY−ColorG
ColorY−ColorW
Residuals
−10
63
(
20
−20
●
(
)
(
)
●
●
35
(
20
(
Tukey contrasts
●
●
●
●
●9
(
Residuals vs Fitted
30
)
)
40
●
●
)
)
20 ●
●
●
●
●
23 ●
40
45
W−G
0
95 % two−sided confidence intervals
25
Fitted values
lm(formula)
Y−W
ChickWeight
W−B
−20
−10
0
10
20
Differences in mean levels of Color
30
95% family−wise confidence level
>
>
>
>
>
data(ChickWeight)
bwplot(weight~Diet,data=ChickWeight,subset=Time==21)
chick.lm <- lm(weight~Diet,data=ChickWeight,subset=Time==21)
xyplot(weight~Time|Chick,data=ChickWeight,type=’b’,lwd=2)
xyplot(weight~Time|Diet,groups=Chick,data=ChickWeight,type=’b’,lwd=2)
40
c
2007
Randall Pruim ([email protected])
April 10, 2007
Math 243 – Spring 2007
64
Checking Model Assumptions
The ANOVA assumptions are about the population, not about the samples, so we can’t directly check them.
But we can check to see if our data look like a reasonable sample from a population with
• normal distriubutions for each group
• the same variance in each group
We do this by looking at the resiuduals (xi j − xi· ). plot(model) – where model is the results of lm() –
will show some diagnostic plots.
Fixing Problems
Dealing with outliers
• Outliers that can be determined to be clear errors of some sort should be fixed, if possible.
• If we are sure the value is wrong, but have no information about what the correct value is, we may
choose to remove the value from the data.
• We can’t just remove outliers because that makes the statistics work better.
• Winsorizing: moving the top α of the data to the 1 − α-quantile and the bottom α of the data to
the α-quantile (where α is not too big) is a way to make extreme data values less influential.
Data transformations
• Sometimes applying a function to the data before analysis improves normality and/or homeoscadsticity (equal variance in the different groups). Common transformations include: powers, roots,
logarithms, and exponentials.
• Some trasformations are suggested by the context of the problem. Square root often works well for
Poisson data.
• Sometimes transformations to fix heteroscadasticity hurt normality and vice versa.
• If the assumptions of ANOVA don’t seem to be met and no (reasonable) transformation fixes the
problem, then we need to turn to other methods.
ANOVA with 2 groups vs. 2-sample t
ANOVA with two groups is exactly the same thing as a two-sample t-test that assumes a common variance
in the two groups. (Recall that this is not as robust as the two sample t-test that does not make this
assumption). In fact, in this case F = t2 .
c
2007
Randall Pruim ([email protected])
April 10, 2007
Math 243 – Spring 2007
65
A little linear algebra
The Big Picture
corrected observation : y − y··
fitted error: y − yi·
observation: y
fit: yi·
treatment
yi· − y··
model space
overall mean: y··
Key ideas:
• The important triangle is the red-green-blue one showing
y − y·· = yi· − y·· + y − yi·
• It is pretty easy to show that yi· − y·· ⊥ y − yi· .
• This means that (by the Pythagorean identity)
|y − y·· |2 = |yi· − y·· |2 + |y − yi· |2
• But this is the same as
P
ij (Yij
− Y ·· )2 =
SST
=
P
ij (Y ·j
− Y ·· )2 +
SST r
P
ij (Yij
+
− Y ·j )2
SSE
• Degrees of freedom are just dimensions of subspaces:
◦ Since the sum of the components in y−y·· must be 0 (why?) y−y·· lives in an (n−1)-dimensional
subspace. (Now you know why degrees of freedom is n − 1.)
◦ yi· lives in an I-dimensional space called the model space since there are I group means to specify.
yi· is the closest vector in the model space to y. (It can be obtained by projection onto the model
space.)
◦ This means that yi· − y·· lives in an (I − 1)-dimensional space. This explains why DF T r = I − 1.
◦ y − yi· lives in an (n − I)-dimensional space. This explains why DF E = n − I.
• Our test statistic is F =
SST r/DF T r
SSE/DF E
c
2007
Randall Pruim ([email protected])
April 10, 2007
Math 243 – Spring 2007
66
The primarily the distinctions between different types of ANOVA are
• how one further subdivides the green treatment vector, and
• what restrictions are placed on the model space (which affects the purple fit vector). These restrictions
will typically specify proposed (types of) relationships between the group means.
An ANOVA example using linear algebra
Here is a very small example.
1
2
3
4
5
6
pollution
124.00
110.00
107.00
115.00
126.00
138.00
location
Hill Suburb
Hill Suburb
Plains Suburb
Plains Suburb
Central City
Central City
The data consist of 6 measurements of air pollution – 2 each at 3 locations in a metropolitan area. In our
data, the largest values occur in the central city, but perhaps that is just do to random chance. We can use
the method just described to test the null hypothesis that there is no difference in air quality between the
3 locations.
This can be done easily in R.
> getCalvin(’m344’); pol <- getData(’m344/airpollution’)
> pol.lm <- lm(pollution~location,data=pol)
> anova(pol.lm)
Analysis of Variance Table
Response: pollution
Df Sum Sq Mean Sq F value Pr(>F)
location
2
468
234
3.48
0.17
Residuals 3
202
67
SSTr and SSE are located in the second column; these are divided by the appropriate degrees of freedom
(subspace dimension) in the third column to give MSTR and MSE, from which the test statistic F is
computed in the fourth column. The p-value is also listed. In this case the evidence is not strong enough to
reject the hypothesis that air quality is the same at all three locations. This isn’t too surprising given such
a small sample size. Notice that SST is missing from this table and that R uses the word ‘residuals’ instead
of ‘error’.
c
2007
Randall Pruim ([email protected])
April 10, 2007
Math 243 – Spring 2007
67
Show me the vectors!
A suitable set of vectors to use as a basis for R6 in this case is
  
 
 
 
 
1
1
1
1
0
0
 1   1   1   −1   0   0
  
 
 
 
 
 1   −1   1   0   1   0
 , 
, 
, 
, 
 
 1   −1   1   0   −1 ,  0
  
 
 
 
 
 1   0   −2   0   0   1
1
0
−2
0
0
−1








We would, of course, normalize to unit length to get our vectors U1 , . . . , U6 . A little matrix algebra gives

y



= 



124
110
107
115
126
138








 = 






120
120
120
120
120
120





 +










3
3
−3
−3
0
0












+ 



−6
−6
−6
−6
12
12








 + 






7
−7
0
0
0
0





 +










0
0
−4
4
0
0












+ 



0
0
0
0
−6
6








And our Pythagorean decomposition is
















124
110
107
115
126
138
124
110
107
115
126
138


2
3  3 


 −3 


= 

 −3 
 0 
0 +
670
=
+
670
=

 
 
 
−
 
 
 


 
 
 
−
 
 
 
120
120
120
120
120
120
120
120
120
120
120
120
2







2







=
36








−6
−6
−6
−6
12
12
2







432
468








−3
−3
−9
−9
12
12
+

2
7  −7 


 0 


 0 


 0 
0 +
98
+
2







Notice how these results correspond to the R output above.
c
2007
Randall Pruim ([email protected])

2
0  0 


 −4 


+ 

 4 
 0 
0 +

0
 0

 0

 0

 −6
6
+
+
72
32
202

2
7  −7 


 −4 


+ 

 4 
 −6 
6 2







Math 243 – Spring 2007
Friday the Thirteenth of April, 2007
68
A new situation
For ANOVA, we had a categorical explanatory variable and a quantitative response variable. Regression (at
least our first form of regression) deals with two quantitative variables.
Taking a look
As with ANOVA, we’ll begin with pictures:
situation
ANOVA
regression
plot
side-by-side boxplots
scatter plot
R command
bwplot(y~x, data=mydata)
xyplot(y~x, data=mydata)
xyplot(y~x|z, data=mydata)
xyplot(y~x, data=mydata, groups=z)
Examples
> data(iris) ; data(ToothGrowth) ; data(iris) # built-in datasets
> require(MASS) ; data(GAGurine) ; data(Animals) # datasets in package MASS
The simple linear regression model
The ANOVA model was quite simple – we were just trying to tell if a bunch of means were the same or
different. Now things are a bit more complicated. There are many different relationships that could exist
bewteen two quantitative variables.
The model for simple linear regression is
Yi = β0 + β1 xi + i
|{z}
| {z } |{z}
data
fit
i ∼ N (0, σ)
error
The assumptions for the linear regression model are
1.
2.
3.
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
Friday the Thirteenth of April, 2007
69
The best fit line
When this sort of linear relationship between our variables seems plausible, how do we decide which line is
the best fit for our data? In other words, how do we estimate the parameters β0 and β1 and how reliable
are those estimates?
The most common answer to this questions is the least squares line, that is the line that minimizes the
sum of the squared residuals.2
That is, we want to choose b0 and b1 so that
SSE =
X
[yi − (b0 + b1 xi )]2 =
i
X
residual2i
i
is as small as possible.
So to find b0 and b1 we need to solve an optimization problem. Fortunately, calculus provides a nice solution
to this problem. Take a couple partial derivatives, do a little algebra, and we learn . . .
• y = b0 + b1 x, so (x, y) is always on the regression line, and b0 = y − b1 x.
P (yi −y) (xi −x)
P (yi −y) (xi −x)
P
· sx
· sx
sy
Sxy
sy
sy
i sy
i sy
i − y)(xi − x)
i (y
P
P
·
• b1 =
=
=
=
·
=r
2
2
(xi −x)
i
Sxx
x)
s
n
−
1
s
s
(x
−
x
x
x
i i
2
sx
So for a given value of x, our regression line predicition for y is
ŷ = b0 + b1 x = βˆ0 + βˆ1 x
Correlation Coefficient
P
i
(yi −y) (xi −x)
· s
s
y
y
The expression r =
is called the correlation coefficient. Notice that the terms in the numerator
n−1
are products of z-scores for x and y.
1. The value of r is a measure of the strength and direction of the association between x and y.
2. r is symmetric in x and y, so it doesn’t depend on which variable you consider to be explanatory and
which to be response.
3. −1 ≤ r ≤ 1
P
P
P
SSR
SSE
=1−
, where SSR = i (ŷ − y)2 , SSE = i (yi − ŷ)2 , and SST = i (yi − y)2 (just
SST
SST
like in ANOVA).
4. r2 =
2
Linear algebra note: in terms of linear algebra, this is the closest point in the model space to the data point.
c
2007
Randall Pruim ([email protected])
Friday the Thirteenth of April, 2007
Math 243 – Spring 2007
70
Estimating σ 2
We now have our least squares estimates for β0 and β1 , what about for the other parameter in the model,
namely σ 2 ?
SSE
σ̂ =
=
n−2
2
P
− ŷ)2
n−2
i (yi
Why n − 2?
• This is the correct denominator to make the estimate unbiased: E(σ̂ 2 ) = σ 2 .
• This is the correct degrees of freedom3 , since (roughly) we lose a degree of freedom for estimating each
of β0 and β1 .
As usual, we will let s =
√
σ̂ 2 .
R and regression
The basic command for doing regression in R is the same as the one for doing ANOVA. In fact, ANOVA is
really just a special case of regression (more on that idea later).
Example 1. We want to predict ACT scores from SAT scores. We sample scores from 60 sutdents who
have taken both tests.
> getData("m243/act-sat") -> act
> xyplot(ACT~SAT,act,panel=panel.lm)
> act.lm <- lm(ACT~SAT,data=act)
> plot(act.lm)
> anova(act.lm)
Analysis of Variance Table
> summary(act.lm)
Response: ACT
Df Sum Sq Mean Sq F value
Pr(>F)
SAT
1 874.37 874.37 116.16 1.796e-15 ***
Residuals 58 436.56
7.53
Residual standard error: 2.744 on 58 degrees of freedom
Multiple R-Squared: 0.667,Adjusted R-squared: 0.6612
F-statistic: 116.2 on 1 and 58 DF, p-value: 1.796e-15
●
30
● ●
●
●●
lm.actAdj <- lm(ACT~SAT,actAdj)
25
●
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.427642
1.691981 -0.844
0.402
SAT
0.024498
0.001807 13.556
<2e-16 ***
ACT
> act[47,]
Student SAT ACT
47
47 420 21
> actAdj <- act[-47,];
> summary(lm.actAdj)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.626282
1.844230
0.882
0.382
SAT
0.021374
0.001983 10.778 1.80e-15 ***
●
●
15
Residual standard error: 2.333 on 57 degrees of freedom
Multiple R-Squared: 0.7632,Adjusted R-squared: 0.7591
F-statistic: 183.8 on 1 and 57 DF, p-value: < 2.2e-16
●
●
● ● ● ●
●●●● ● ●
●
●● ●
●
●
●
●
●
●
●
10
3
●
● ●
● ●● ●
●
●
●
● ●
20
●
●
● ●
●
●●
●● ●
●
●
●
400
600
linear algebra note: degrees of freedom equals dimension of subspace
c
2007
Randall Pruim ([email protected])
800
1000
SAT
1200
1400
Friday the Thirteenth of April, 2007
Math 243 – Spring 2007
71
Since the model is based on normal distributions and we don’t know σ . . .
1. Regression is going to be sensitive to outliers. Outliers with especially large or small values of the
independent variable are especially influential.
2. We can check if the model is reasonable by looking at our residuals:
(a) Histograms and normal quantile plots indicate overall normality.
We are looking for a roughly bell-shaped histogram or a roughly linear normal quantile plot.
(b) Plots of
• residuals vs x, or
• residuals vs. order, or
• residuals vs. fit
note: fit =
indicate whether the standard deviation appears to remain constant throughout.
We are looking to NOT see any clear pattern in these plots. A pattern would indicate something
other than randomness is influencing the residuals.
3. We can do inference for β0 , β1 , etc. using the t distributions, we just need to know the corresponding
eSE and degrees of freedom.
parameter
estimator
eSE
df
s
x̄2
1
+P
n
(xi − x̄)2
s
= pP
(xi − x̄)2
β0
βˆ0 = b0
eSEb0 = s
n−2
β1
βˆ1 = b1
eSEb1
n−2
We won’t ever compute these eSE’s by hand, but notice that they are made up of pieces that look
familiar (square roots, n in the denominator, square of differences from the mean, all the usual stuff.)
Furthermore, just by looking at the formulas, we can learn something about the behavior of the
confidence intervals and hypothesis tests involved.
eSEb0 and eSEb1 are easy to identify in the computer output. We can also find the (two-sided) p-values
for two hypothesis tests.
(a) H0 : β0 = 0 [usually not interesting]
(b) H0 : β1 = 0
This is much more interesting for two reasons. First, the slope is often a very interesting parameter
to know because
Second, this is a measure of how useful the model is for making predictions because β1 = 0 only
if
4. Confidence intervals for β0 (usually not interesting) and β1 (usually interesting) have a familiar form:
estimate ± t∗ eSE
c
2007
Randall Pruim ([email protected])
April 16, 2007
Math 243 – Spring 2007
72
Inference for Regression
Four inference situations: β0 (usually least interesting), β1 (and model utility), predicting the mean response
for a given value of the explanatory variable, predicting an individual response for a given value of the
explanatory variable. All of these are based on the normal and t- distributions (because of the model).
In each of the cases below
parameter
estimate − parameter
∼ tn−2
eSE
estimator
eSE
df
s
β0
βˆ0 = b0
β1
βˆ1 = b1
µy|x
µ̂y|x∗ = ŷ = βˆ0 + βˆ1 x∗
ŷ (individual prediction)
ŷ = ŷ = βˆ0 + βˆ1 x∗
1
x̄2
+P
n
(xi − x̄)2
s
eSEb1 = pP
(xi − x̄)2
s
1
x∗ − x̄
eSEµ̂ = s
+P
n
(xi − x̄)2
s
1
x∗ − x̄
eSEŷ = s 1 + + P
n
(xi − x̄)2
eSEb0 = s
Confidence and prediction intervals in R
> act.lm <- lm(ACT~SAT,data=act)
> predict(lm.act,newdata=data.frame(SAT=1000),interval=’confidence’)
fit
lwr
upr
[1,] 22.99997 22.21076 23.78917
> predict(lm.act,newdata=data.frame(SAT=1000),interval=’prediction’)
fit
lwr
upr
[1,] 22.99997 17.45177 28.54816
> xyplot(ACT~SAT,data=act,panel=panel.lmbands)
Model Checking for Regression
• If a line doesn’t fit, don’t fit a line. (Anscombe’s examples)
• Look at various residual plots.
Where does regression go from here?
• Regression with transformed data
• More than one explanatory variable
• Categorical explanatory variables
(ANOVA is regression)
• Robust regression
• Logistic regression
(allows for categorical response)
c
2007
Randall Pruim ([email protected])
n−2
n−2
n−2
n−2
April 16, 2007
Math 243 – Spring 2007
73
Example: Skin Thickness and Body Density
There are many reasons why one would like to the the fat content of a human body. The most accurate way to
estimate this is by determining the body density (weight per unit volume). Since fat is less dense than other
body tissue, a lower density indicates a higher relative fat content. The best way to estimate body density is
difficult to measure directly (the standard method requires weighing the subject underwater), so scientists
have looked for other measurements that can accurately predict body density. One such measurement we
will call skinfold thickness, and is actually the logarithm of the sum of four skinfold thicknesses measured
at different points on the body.
●
●
● ●
●
● ●
● ●●●●
●
●
●
●
●
●
●●
●●
●
●●
1.06
●
●
●
●
● ● ●●
● ● ●●●
●
●●
●
●
●
●●
●●●
● ●
●
1.04
●
●
● ●●
●●
●
●
●
●
●
●
●
●
●
1.0
1.2
1.4
1.6
1.8
● 70
●●
●
● ●
●● ● ●
●
●
●
●
●
● ●● ●
● ●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●●●
●●
●●
●
● ●
●●●
●●
●
●●
● ● ●●●
●●
●
●●
●●
●
●●
●
● ●
●
●
●
●
●●
●
●
●
● ●
1.03
2.0
1.05
1.09
−2
−1
0
1
2
Residuals vs Leverage
●
● 70
61 ●
9●
●
●
●
● ●
● ●● ●
●
●
● ●●
●●
●
●
●●●
●
●
●
● ●●
● ●
● ●
●●
●
●
●
●
●
●
●
●●
●●
●
●●
● ● ●● ● ●
● ●
●
●
●●
●
●●
●
●●● ●
●
●
●●
●
●
●
●
●
●
●
1.05
1.07
●
1.09
Fitted values
3Q
0.004949
●
Scale−Location
1.03
Residuals:
Min
1Q
Median
-0.018967 -0.005092 -0.000498
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●●
Theoretical Quantiles
1.5
1.0
0.5
1.07
●
61
●
70
●
●● 9
Fitted values
●
0.0
Standardized residuals
skthick
●
●
Normal Q−Q
0 1 2 3
density
●●
61 ●
9●
−2
●
●
0.02
●
●
0.00
1.08
Residuals vs Fitted
●
●
●
●●
−0.02
●● ●
Max
0.023679
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.16300
0.00656
177.3
<2e-16
skthick
-0.06312
0.00414
-15.2
<2e-16
Residual standard error: 0.00854 on 90 degrees of freedom
Multiple R-Squared: 0.72,
Adjusted R-squared: 0.717
F-statistic: 232 on 1 and 90 DF, p-value: <2e-16
Analysis of Variance Table
Df Sum Sq Mean Sq F value Pr(>F)
skthick
1 0.01691 0.01691
232 <2e-16
Residuals 90 0.00656 0.00007
c
2007
Randall Pruim ([email protected])
0 1 2 3
●
●
Residuals
●
0.5
●
70
● ●●
42
●
●
●● ● ● ●
●
●
●
●
●
●●●●
●
●
●
● ●● ●●
●●
●●
●
●
●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●● ● ●● ●
●
●
● ● ● ●
●
●
●
● ●●
●
●
●
●
●
●
●
Cook's distance
−2
●
●
Standardized residuals
●
Standardized residuals
Use the R output below to answer the question: How well does skinfold thickness predict body density?
23
0.00
0.04
Leverage
0.08
April 20, 2007
Math 243 – Spring 2007
74
Regression with transformed data
Reasons to transform data
• better fit
• better residual behavior
• theoretical model
Some common transformations
If we transform the data by x0 = f (x) and y 0 = g(y), then we can fit the model:
y 0 = β 0 + β 1 x 0 + ε0
and then back-transform to get a model in terms of the original variables.
transformation
linear form
back transformation
residuals
x0 = log(x), y 0 = y
y = β0 + β1 log(x)
y = β0 + β1 log(x) + ε0
ε = ε0 ∼ N(0, σ)
y 0 = log(y), x0 = x
log(y) = β0 + β1 x
y = β0 · eβ1 x ε
log(ε) = ε0 ∼ N(0, σ)
x0 = log(x), y 0 = log(y)
log(y) = β0 + β1 log(x)
y = β0 · xβ1 ε
log(ε) = ε0 ∼ N(0, σ)
x0 = 1/x, y 0 = y
y = β0 + β1 /x
y = β0 + β1 /x + ε
ε = ε0 ∼ N(0, σ)
..
.
..
.
..
.
..
.
Some examples
c
2007
Randall Pruim ([email protected])
April 23, 2007
Math 243 – Spring 2007
75
Logistic regression: Regression with a categorical response
The logistic regression model
• Idea: predict p(x) = probability of success for given value of explanatory variable x.
• Problem: p(x) ∈ [0, 1], but the range of β0 + β1 x is (−∞, ∞)
• A Fix: transform p(x) to something with range (−∞, −∞)
◦ odds:
p(x)
1−p(x)
◦ log odds: log
∈ [0, ∞)
p(x)
1−p(x)
[ assuming p(x) 6= 1 ]
∈ (∞, ∞)
[assuming p(x) 6= 0 and p(x) 6= 1]
• logistic model (fit and back transformation)
p(x)
log
= β0 + β1 x
1 − p(x)
eβ0 +β1 x
p(x) =
1 + eβ0 +β1 x
Fitting the logistic regression model
Logistic regression is not fit using a least squares method. Instead it uses the maximum likelihood
method.
• data: (1, S), (2, S), (3, F )
• likelihood: L(b0 , b1 ) = p(1) · p(2) · (1 − p(3)) =
eb1 +b0 ·2
eb1 +b0 ·3
eb1 +b0 ·1
·
·
(1
−
)
1 + eb1 +b0 ·1 1 + eb1 +b0 ·2
1 + eb1 +b0 ·3
• We want to find b1 and b0 that make L(b0 , b1 ) as large as possible.
◦ maximum likelihood estimates for logistic regression are approximated by numerical methods.
For simple linear regression, the least squares estimates and the maximum likelihood estimates are the same.
Using R
R is happy to fit a logistic regression model for you.
Example 1. Space Shuttle O-rings.
Here is an example using the space shuttle O-ring data (Example 13.6). Note: the data in the book do not
match the electronic data. As far as I can tell, the printed data in the book are incorrect.
c
2007
Randall Pruim ([email protected])
April 23, 2007
Math 243 – Spring 2007
76
> data(xmp13.06)
> glm(Failure~Temperature,data=xmp13.06,family=’binomial’) -> glm.shuttle
> summary(glm.shuttle)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.74641
6.02142
1.951
0.0511 .
Temperature -0.18843
0.08909 -2.115
0.0344 *
> predict(glm.shuttle,newdata=data.frame(Temperature=31))
[1] 5.905015
> predict(glm.shuttle,newdata=data.frame(Temperature=31),type=’response’)
[1] 0.9972817
>
> temps <- seq(30,100,by=2)
> xyplot(predict(glm.shuttle,type=’response’,newdata=data.frame(Temperature=temps))~temps)
Example 2. Baseball
How well does average margin of victory predict winning percentage for major league baseball teams?
> getData("m243/runswins04") -> bb
> bb$runmargin = (bb$R - bb$OR) / (bb$G)
# data has summarized data for each team, so different syntax here:
> glm(cbind(W,L)~runmargin,data=bb,family=’binomial’) -> glm.bb
> summary(glm.bb)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.001175
0.029057
-0.04
0.968
runmargin
0.447539
0.041646
10.75
<2e-16 ***
> bb$winP <- bb$W/bb$G
> bb$predWinP <- predict(glm.bb,newdata=data.frame(runmargin=bb$runmargin),type=’response’)
> bb[,c(1,22,23)]
TEAM
winP predWinP
1
Anaheim 0.5679012 0.5696953
2
Boston 0.6049383 0.6221895
3
Baltimore 0.4814815 0.5079932
4
Cleveland 0.4938272 0.5003968
.
.
.
28
29
30
>
New York 0.4382716 0.4672926
Montreal 0.4135802 0.4082120
Milwaukee 0.4161491 0.4150606
Note: In this particular example, a simple linear regression works quite well too because winning percentages
for baseball teams are quite close to .500, and the middle of the curves modeled by logistic regression are
pretty flat.
c
2007
Randall Pruim ([email protected])
April 26, 2007
Math 243 – Spring 2007
77
Interpreting the parameters in logistic regression
Once again, the more interesting parameter is β1 . According to the logistic regression model,


p(x+1)
p(x + 1)
p(x)
1−p(x+1)
β1 = log
− log
= log  p(x)  = log odds ratio
1 − p(x + 1)
1 − p(x)
1−p(x)
and
eβ1 = odds ratio
That is, for each increase of 1 unit for the explanatory variable, the odds increase by a factor of eβ1 .
We can use the output from R to construct a confidence interval for this odds ratio:
> glm(Failure~Temperature,data=xmp13.06,family=’binomial’) -> glm.shuttle
> summary(glm.shuttle)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.74641
6.02142
1.951
0.0511 .
Temperature -0.18843
0.08909 -2.115
0.0344 *
> coef(glm.shuttle)[2] + c(-1,1) * 0.08909 * qnorm(.975)
[1] -0.36304535 -0.01381897
> exp(coef(glm.shuttle)[2] + c(-1,1) * 0.08909 * qnorm(.975))
[1] 0.6955549 0.9862761
From which we determine the confidence intervals
95% CI for β1 : 0. − 19943 ± 1.96 · 0.08909 = (−0.363, −0.0138)
(e−0.363 , e−0.0138)
95% CI for odds ratio (eβ1 ):
=
(0.696, 0.986)
Relative risk
If we have two probabilties p and q (that don’t necesarily add to 1), then
• odds ratio: • relative risk:
p
1−p
q
1−q
p
q
Notice that the odds ratio satisfies
p
1−p
q
1−q
=
p 1−q
·
= So that relative risk and odds ratio are nearly the
q 1−p
same when 1−p
1−q ≈ 1. This is often the case when p and q are small – rates of diseases, for example. Since
relative risk is easier to interpret (I’m 3 times as likely . . . ) than odds ratio (my odds are 3 times greater
that . . . ), we can use relative risk to help us get a feeling for odds ratios in some situations.
c
2007
Randall Pruim ([email protected])
April 26, 2007
Math 243 – Spring 2007
78
Multiple Regression
Multiple regression handles more than one explanatory variable. A typical model with k explanatory variables (often called predictors or regressors) has the form
Y = β0 + βi xi + βi xi + · · · βi xi + ε
where ε ∼ N(0, σ). This model is called the general additive model.
Higher order terms and interaction
Many interesting regression models are formed by using transformations or combinations of explanatory
variables as predictors. Suppose we have two predictors. Here are some possible models
• First-order model (same as above):
Y = β0 + βi xi + βi xi + ε
• Second-order, no interaction:
Y = β0 + βi xi + βi xi + β3 x21 + β4 x22 + ε
• First-order, plus interaction:
Y = β 0 + β i xi + β i xi + β 3 x1 x2 + ε
• Complete Second-order model:
Y = β0 + βi xi + βi xi + β3 x21 + β4 x22 + β5 x1 x2 + ε
Categorical predictors
We can even handle categorical predictors! We do this by introducing dummy variables (less pejoratively
called indicator variables). If a variable v has only two possible values – A and B – we can build an
indicator variable for v as follows:
(
1 v=B
x=
0 v=
6 B
That is, we simply code the possbilities as 0 and 1.
If we have more than two possible values (levels), we introduce multiple dummy variables4 (one less than
the number of levels). For a variable with three levels (A, B and C), one standard encoding (and the one
that R uses by default) is
(
1
x1 =
0
v=B
v 6= B
(
1
x2 =
0
v=C
v 6= C
We don’t need to have 3 dummy variables, since the intercept term captures the effect of one of the levels.
Notice that ANOVA is simply a first-order linear model with this sort of encoding. R conveneintly takes
care of the recoding for us.
4
There are models that do other things. Coding with a single variable with values 0, 1, 2 is possible, for example, but is a
very different model. In this alternative there is an implied order among the groups and the effect of moving from category 0
to category 1 is the same size as the effect of moving from category 1 to category 2. This model should only be used if these
assumptions make sense.
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
April 26, 2007
79
Interpreting the parameters
• βi for i > 0 can be thought of as adjustments to the baseline affect given by β0 . (This is especially
useful when the predictors are categorical and the intercept has a natural interpretation.)
• When there are no interaction or higher order terms in the model, the parameter βi can be interpretted
as the amount we expect the response to change if we increase xi by 1 and leave all other predictors
fixed. The effects due to different predictors are additive and do not depend on the values of the other
predictors. Graphically this gives us parallel lines or planes.
• With higher order terms in the model, the dependence of the response on one predictor (with all other
predictors fixed) may not be linear.
• With interaction terms, the model is no longer additive: the effect of changing one predictor may
depend on the values of the other predictors. Graphically, our curves are no longer parallel.
c
2007
Randall Pruim ([email protected])
April 26, 2007
Math 243 – Spring 2007
80
Fitting the model
The model can again be fit using least squares (or maximum likelihood) estimators.5 As you can imagine,
the formulas for estimated standard errors become quite complicated, but statistical software will easily
output this information for you, so we will focus on using the output from R and interpreting the results.
Example 1. Let’s fit a second-order model with and without interaction for the data in Example 13.13.
The variables are strength of concrete, % limestone (x1), and water-cement ratio (x2).
> summary(lm(strength~x1*x2,xmp13.13))
(Intercept)
x1
x2
x1:x2
> summary(lm(strength~x1+x2,xmp13.13))
Estimate Std. Error t value Pr(>|t|)
6.217
30.304
0.205
0.8455
5.779
2.079
2.779
0.0389 *
51.333
50.434
1.018
0.3555
-9.357
3.461 -2.704
0.0426 *
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.8167
12.2415
6.929 0.000448 ***
x1
0.1643
0.1431
1.148 0.294673
x2
-79.6667
20.0349 -3.976 0.007313 **
Residual standard error: 2.423 on 5 degrees of freedom Residual standard error: 3.47 on 6 degrees of freedom
Multiple R-Squared: 0.8946,Adjusted R-squared: 0.8314 Multiple R-Squared: 0.7406,Adjusted R-squared: 0.6541
F-statistic: 14.15 on 3 and 5 DF, p-value: 0.00706
F-statistic: 8.565 on 2 and 6 DF, p-value: 0.01746
> anova(lm(strength~x1*x2,xmp13.13))
> anova(lm(strength~x1+x2,xmp13.13))
Response: strength
Df Sum Sq Mean Sq F value
Pr(>F)
x1
1 15.870 15.870 2.7037 0.161039
x2
1 190.403 190.403 32.4376 0.002328 **
x1:x2
1 42.903 42.903 7.3090 0.042605 *
Residuals 5 29.349
5.870
Response: strength
Df Sum Sq Mean Sq F value
Pr(>F)
x1
1 15.870 15.870 1.3179 0.294673
x2
1 190.403 190.403 15.8117 0.007313 **
Residuals 6 72.252 12.042
Model utility test
• H0 : βi = 0 for all i > 0 (all but the intercept coefficient).
◦ if H0 is true, then there is random variation about a overall mean response (β0 ), but this variation
is not explained by any of our predictors – so our model is useless to predict the response.
• Ha : βi 6= 0 for at least one i > 0.
• Test statistic: F =
M SM
M SE ,
where SSE =
P
P
(yi − ŷi )2 , SST = (yi − y)2 , and SSM = SST − SSE.
• degrees of freedom for model with k predictors (k + 1 parameters):
◦ intercept (β0 ) : 1
◦ model (β1 . . . βk ): k
◦ residuals
: n−1−k
5
If there are k predictors (k + 1 parameters including β0 ), then using the least squares approach will lead to a system of k + 1
equations in k + 1 unknowns, so standard linear algebra can be used to find the estimates.
c
2007
Randall Pruim ([email protected])
April 27, 2007
Math 243 – Spring 2007
81
Fitting the multiple regression model (cont’d)
Individual parameters
These should look familiar by now. Estimates and estimated standard errors for each parameter are part
of the summary() output. We can test hypotheses of the form H0 : βi = 0 or build confidence intervals for
βi from this information. Note that these tests depend on the model as well as the parameter being tested.
That is, we are testing whether or not βi = 0 in a model with a specific set of parameters. The parameter
βi may be significantly non-zero in some models, but not in others.
There are methods similar to the multiple comparisons method of Tukey (that we used in the ANOVA
context) that allow one to simultaneous confidence intervals or confidence ellipsoids for sets of parameters.
We won’t cover these in this course.
The other F tests
In our simple model (two first-order predictors, no interaction) the F -tests that appear in the ANOVA table
are equivalent to the t tests in the coefficients table. In fact, F = t2 and the degrees of freedom for the
denominator of F are the same as the degrees of freedom for t.
In more complex situations the F -tests and p-valus are not as easy to interpret and this table is not so
directly useful, but you can see how the degrees of freedom of the model utility test correspond to a dividing
up of SST . This output will be more interesting when we have two categorical predictors (stay tuned).
Model Comparison
Multiple regression has opened up a world in which there is (seemingly) no end to the number of models one
could choose to fit to a data set. How does one choose which one to actually use in a particular analysis?
The coeficient of determination: r2
As in simple linear regression r2 = SSM
SST measures the fraction of total variation explained by the model.
A larger value of r2 indicates that more of the variation is explained by the model, so generally bigger is
better. But there are some difficulties in interpreting this number:
• The absolute size of a “good” r2 is highly dependent on the area of application. If there is a lot of
variation in the population at each possible setting of the explanatory variables, then it is impossible
to get high r2 values.
• If we add parameters to a model, r2 will always increase. (Technically it could stay the same, but
it can’t decrease.) So when comparing models with different numbers of parameters, the model with
more parameters has an edge. The “adjusted r2 ” shown in R output is one way to compensate for
this. (See page 594 of Devore6 for the formula used.)
c
2007
Randall Pruim ([email protected])
April 27, 2007
Math 243 – Spring 2007
82
Individual parameters
If the p-value from a test for an individual parameter is not small, then we don’t have convincing evidence
that the parameter is non-zero. Such parameters become candidates for removal from the model.
Comparing nested models
Two models are said to be nested if the smaller model (the one with fewer parameters) is obtained from
the larger model by setting some of the parameters to 0. For this situation there is a formal test:
• H0 : βi = 0 for all i > l (the last k − l parameters in some ordering)
◦ if H0 is true, then the “reduced” model is correct.
• Ha : βi 6= 0 for at least one i > l.
◦ if Ha is true, then the “full” model is better because at least some of the extra parameters are
not 0.
• Test statistic: F =
M SEdif f
M SEf ull
The idea of this test is to take the unexplained variation from the reduced model and split it into two
pieces: the portion explained by the full model but not by the reduced model (SSEdif f = SSEreduced −
SSEf ull ) and the portion unexplained even in the full model (SSEf ull ). The degrees of freedom for the
numerator is the difference in the number of parameters for the two models. The degrees of freedom
for the denominator is the residual degrees of freedom for the full model.
> lm(strength~x1*x2,xmp13.13) -> lm.full
> lm(strength~x1+x2,xmp13.13) -> lm.noint
> anova(noint,full)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
6
2
5
strength ~ x1 + x2
strength ~ x1 + x2 + x1:x2
RSS Df Sum of Sq
F Pr(>F)
72.252
29.349 1
42.903 7.309 0.04260 *
Here RSS is what we have typically called SSE and stands for Residual Sum of Squares. Notice that
F =
(RSSreduced − RSSf ull )/dfdif f
42.903/1
=
= 7.309
RSSf ull /dff ull
29.349/5
In this case we get a small p-value because a large portion of the variation that the smaller model cannot
explain, is explained by the larger model. So it seems we should include the interaction term. Now let’s
compare the full model to a model without a term for x2:
> lm(strength~x1+x1:x2,xmp13.13) -> lm.nox2;
> summary(lm.nox2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.0167
1.6200 22.850 4.6e-07 ***
c
2007
Randall Pruim ([email protected])
April 27, 2007
Math 243 – Spring 2007
x1
x1:x2
3.7478
-5.9725
0.5863
0.9628
6.392
-6.203
0.00069 ***
0.00081 ***
Residual standard error: 2.43 on 6 degrees of freedom
Multiple R-Squared: 0.8728,Adjusted R-squared: 0.8304
F-statistic: 20.58 on 2 and 6 DF, p-value: 0.002058
83
> summary(lm.full) # see page 79
> anova(lm.nox2, lm.full)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
6
2
5
strength ~ x1
strength ~ x1
RSS Df Sum
35.430
29.349 1
+ x1:x2
+ x2 + x1:x2
of Sq
F Pr(>F)
6.081 1.036 0.3555
The evidence suggests that dropping x2 from the model (but leaving the interaction term x1:x2) is apropriate.
Relationship to 1-parameter tests
If we compare two models, one our (larger) model of interest and the other a model with one parameter
removed, we get another way to do our test for a single paramter. These tests are equivalent, and F = t2 .
> lm(strength~x1*x2,xmp13.13) -> lm.full
> lm(strength~x2+x1:x2,xmp13.13) -> lm.nox1
> anova(lm.nox1,lm.full)
> lm(strength~x1+x2,xmp13.13) -> lm.noint
> lm(strength~x2,xmp13.13) -> lm.x2
> anova(lm.x2,lm.noint)
Analysis of Variance Table
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
6
2
5
Model 1:
Model 2:
Res.Df
1
7
2
6
strength ~ x2 + x1:x2
strength ~ x1 * x2
RSS Df Sum of Sq
F Pr(>F)
74.694
29.349 1
45.345 7.7251 0.03893 *
strength ~ x2
strength ~ x1 + x2
RSS Df Sum of Sq
F Pr(>F)
88.122
72.252 1
15.870 1.3179 0.2947
> summary(lm.full)
> summary(lm.noint)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
6.217
30.304
0.205
0.8455
x1
5.779
2.079
2.779
0.0389 *
x2
51.333
50.434
1.018
0.3555
x1:x2
-9.357
3.461 -2.704
0.0426 *
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 84.8167
12.2415
6.929 0.000448 ***
x1
0.1643
0.1431
1.148 0.294673
x2
-79.6667
20.0349 -3.976 0.007313 **
Residual standard error: 2.423 on 5 degrees of freedom Residual standard error: 3.47 on 6 degrees of freedom
Multiple R-Squared: 0.8946,Adjusted R-squared: 0.8314 Multiple R-Squared: 0.7406,Adjusted R-squared: 0.6541
F-statistic: 14.15 on 3 and 5 DF, p-value: 0.00706
F-statistic: 8.565 on 2 and 6 DF, p-value: 0.01746
Notice that this indicates that it might be OK to drop x1 from the x1+x2 model, but not from the larger
model. Whether or not a parameter can be dropped depends on the rest of the model.
c
2007
Randall Pruim ([email protected])
April 27, 2007
Math 243 – Spring 2007
84
Diagnositics
The same sorts of diagnostic checks can be done for multiple regression as for simple linear regression. Most
of these involve looking at the residuals. As before plot(lm(...)) will make several types of residual plots.
Consideration of the normality and homeoscadasicity assumption of regression should also be a part of
selecting a model.
The final choice
So how do we choose a model? There are no hard and fast rules, but here are some things that play a role:
• A priori theory.
Some models are chosen by there is some scientific theory that predicts a relationship of a certain form.
Statistics is used to find the most likely parameters in a model of this form. If there are competing
theories, we can fit multiple models and see which seems to fit better.
• Previous experience.
Models that have worked well in other similar situations may work well again.
• The data.
Especially in new situations, we may only have the data to go on. Regression diagnostics, adjusted r2 ,
various hypothesis tests, and other methods like the commonly used information criteria AIC and BIC
can help us choose between models. In general, it is good to choose the simplest model that works
well.
There are a number of methods that have been proposed to automate the proces of searching through
many models to find the “best” one. One commonly used one is called stepwise regression. Stepwise
regression works by repeatedly dropping or adding a single term from the model until there are no
such single parameter changes that improve the model (based on some criterion; AIC is the default in
R.) The function step() will do this in R.
If the number of parameters is small enough, one could try all possible subsets of the parameters. This
could find a “better” model than the one found by stepwise regression.
AIC: Aikeke’s Information Criterion
AIC = 2k + n ln(RSS/n)
where k is the number of parameters and RSS is the residual sum of squares (SSE). Smaller is better. There
are theoretical reasons for this particular formula (base don likelihood methods), but notice that the first
addend increases with each parameter in the model and the second decreases; so this forms a kind of balance
betweeen the expected gains for adding new parameters against the costs of complication and potential for
over-fitting. The scale of AIC is only meaningful relative to a fixed data set.
There are several other criteria that have been proposed.
c
2007
Randall Pruim ([email protected])
April 30, 2007
Math 243 – Spring 2007
85
More regression examples
Example 1. Where regression got its name.
In the 1890’s Karl Pearson gathered data on over 1100 British families. Among other things he measured
the heights of parents and children. The analysis below comes from his data is for heights of mothers and
daughters. Can you see why this data (and analysis) led him to coin the phrase “regression toward the
mean”?
>
>
>
>
>
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.91744
1.62247
18.44
<2e-16 ***
Mheight
0.54175
0.02596
20.87
<2e-16 ***
getCalvin(’m243’)
require(alr3)
data(heights)
xyplot(Dheight~Mheight,heights)
summary(lm(Dheight~Mheight,heights))
Residual standard error: 2.266 on 1373 degrees of freedom
Multiple R-Squared: 0.2408,Adjusted R-squared: 0.2402
F-statistic: 435.5 on 1 and 1373 DF, p-value: < 2.2e-16
Example 2. Rats were given a dose of a drug proportional to their body weight. The rats were then
slaughtered and the amount of drug in the liver and the weight of the liver were measured.
> require(alr3); data(rat)
> summary(lm(y~BodyWt*LiverWt,rat))
Call:
lm(formula = y ~ BodyWt * LiverWt, data = rat)
Estimate Std. Error t value Pr(>|t|)
(Intercept)
1.573822
1.468416
1.072
0.301
BodyWt
-0.007759
0.008568 -0.906
0.379
LiverWt
-0.177329
0.198202 -0.895
0.385
BodyWt:LiverWt 0.001095
0.001138
0.962
0.351
Residuals:
Min
1Q
Median
-0.133986 -0.043370 -0.007227
Residual standard error: 0.09193 on 15 degrees of freedom
Multiple R-Squared: 0.1001,Adjusted R-squared: -0.07985
F-statistic: 0.5563 on 3 and 15 DF, p-value: 0.6519
Coefficients:
3Q
0.036389
Max
0.184029
> plot(lm(y~BodyWt*LiverWt,rat))
None of the parameters looks significantly non-zero. What is the interpretation?
c
2007
Randall Pruim ([email protected])
April 30, 2007
Math 243 – Spring 2007
86
Example 3. Home field advantage?
A professor from the University from Minnesota ran tests to see if adjusting the air conditioning in the
Metrodome could affect the distance a batted ball travels. Notice that Cond is categorical and indicates
whether there was a (artificial) headwind or tailwind.
> summary(lm.dome01)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 181.7443
335.6959
0.541 0.59252
Velocity
1.7284
0.5433
3.181 0.00357 **
Angle
-1.6014
1.7995 -0.890 0.38110
BallWt
-3.9862
2.6697 -1.493 0.14659
BallDia
190.3715
62.5115
3.045 0.00502 **
CondTail
7.6705
2.4593
3.119 0.00418 **
> require(alr3); data(domedata)
> summary(domedata)
Cond
Head:19
Tail:15
Velocity
Min.
:149.3
1st Qu.:154.1
Median :155.5
Mean
:155.2
3rd Qu.:156.3
Max.
:160.9
Angle
Min.
:48.30
1st Qu.:49.50
Median :50.00
Mean
:49.98
3rd Qu.:50.60
Max.
:51.00
...
> lm.dome01 <- lm(
Dist~Velocity+Angle+BallWt+BallDia+Cond,
data=domedata)
Residual standard error: 6.805 on 28 degrees of freedom
Multiple R-Squared: 0.5917,Adjusted R-squared: 0.5188
F-statistic: 8.115 on 5 and 28 DF, p-value: 7.81e-05
It looks like the angle and ball weight don’t matter much. Velocity and direction of wind do (this makes
sense). So does ball diameter; that’s a bit unfortunate. Let’s do some model comparisons.
> step(lm.dome01,direction="both")
Start: AIC= 135.8
Dist ~ Velocity + Angle + BallWt + BallDia + Cond
- Angle
<none>
- BallWt
- BallDia
- Cond
- Velocity
Df Sum of Sq
RSS
1
36.67 1333.24
1296.57
1
103.24 1399.81
1
429.46 1726.03
1
450.46 1747.03
1
468.68 1765.26
AIC
134.75
135.80
136.40
143.53
143.94
144.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.9887
83.9460 -0.333
0.7411
Velocity
2.4388
0.5404
4.513 8.63e-05 ***
CondTail
6.5118
2.5935
2.511
0.0175 *
Residual standard error: 7.497 on 31 degrees of freedom
Multiple R-Squared: 0.4513,Adjusted R-squared: 0.4159
F-statistic: 12.75 on 2 and 31 DF, p-value: 9.118e-05
Step: AIC= 134.75
Dist ~ Velocity + BallWt + BallDia + Cond
Df Sum of Sq
<none>
- BallWt
+ Angle
- BallDia
- Velocity
- Cond
1
1
1
1
1
111.05
36.67
408.92
481.14
499.48
RSS
1333.24
1444.30
1296.57
1742.16
1814.38
1832.72
> lm.dome02 <- lm(Dist~Velocity+Cond,data=domedata)
> summary(lm.dome02)
> anova(lm.dome02,lm.dome01)
Analysis of Variance Table
AIC
134.75
135.47
135.80
141.84
143.22
143.56
Model 1:
Model 2:
Res.Df
1
31
2
28
Dist ~ Velocity + Cond
Dist ~ Velocity + Angle + BallWt + BallDia + Cond
RSS Df Sum of Sq
F Pr(>F)
1742.45
1296.57 3
445.88 3.2096 0.03812 *
Call:
> lm.dome03 <- lm(Dist~Velocity+Cond+BallDia,data=domedata)
lm(formula = Dist ~ Velocity + BallWt + BallDia + Cond, > anova(lm.dome02,lm.dome03)
data = domedata)
Analysis of Variance Table
Coefficients:
(Intercept)
133.824
CondTail
7.990
Velocity
1.750
BallWt
-4.127
BallDia
184.842
Model 1:
Model 2:
Res.Df
1
31
2
30
Dist ~ Velocity + Cond
Dist ~ Velocity + Cond + BallDia
RSS Df Sum of Sq
F Pr(>F)
1742.45
1444.30 1
298.15 6.193 0.01860 *
c
2007
Randall Pruim ([email protected])
Math 243 – Spring 2007
April 30, 2007
87
Two-way ANOVA
Two-way ANOVA is used to measure the dependence of a quantitative response on two categorical predictors.
It can be thought of as multiple regression with 2 categorical variables and interaction. In the simplest
design (often attainable in experimental situations), we measure the response at every combination of the
two categorical predictors the same number of times. The number of times per combination must be at least
two, else the model has as many paramters as the data set has values, so although we can fit the model,
it will always fit perfectly and there will be no degrees of freedom left over to estimate the accuracy of the
predictions. If the number of replicates at each combination of explanatory variables is not equal, then a
more complicated analysis is necessary.
Example 4. More rats.
Rats were given three types of poison in 4 types of treatment. The time until death was recorded. An initial
fit reveals violations of our homeoscadisticity assumption, so we’ll try some transformations before fitting
the model.
>
>
>
>
>
>
lm(time~poison*treat,rats) -> lm.rats
plot(lm.rats) # reveals some problems
lm(log(time)~poison*treat,rats) -> lm.rats2
plot(lm.rats2) # better, but not perfect
lm(I(1/time)~poison*treat,rats) -> lm.rats3
plot(lm.rats3) # looking pretty good
> anova(lm.rats3)
Analysis of Variance Table
Response: I(1/time)
Df Sum Sq Mean Sq F value
Pr(>F)
poison
2 34.877 17.439 72.6347 2.310e-13
treat
3 20.414
6.805 28.3431 1.376e-09
poison:treat 6 1.571
0.262 1.0904
0.3867
> summary(lm.rats3)
Coefficients:
(Intercept)
poisonII
poisonIII
treatB
treatC
treatD
poisonII:treatB
poisonIII:treatB
poisonII:treatC
poisonIII:treatC
poisonII:treatD
poisonIII:treatD
Estimate Std. Error t value Pr(>|t|)
2.48688
0.24499 10.151 4.16e-12
0.78159
0.34647
2.256 0.030252
2.31580
0.34647
6.684 8.56e-08
-1.32342
0.34647 -3.820 0.000508
-0.62416
0.34647 -1.801 0.080010
-0.79720
0.34647 -2.301 0.027297
-0.55166
0.48999 -1.126 0.267669
-0.45030
0.48999 -0.919 0.364213
0.06961
0.48999
0.142 0.887826
0.08646
0.48999
0.176 0.860928
-0.76974
0.48999 -1.571 0.124946
-0.91368
0.48999 -1.865 0.070391
Residual standard error: 0.49 on 36 degrees of freedom
Multiple R-Squared: 0.8681,Adjusted R-squared: 0.8277
F-statistic: 21.53 on 11 and 36 DF, p-value: 1.289e-12
The final model has a nice interpretation since the reciprical of time is a rate of death. Let’s look at the
ANOVA table. It displays the results of three hypothesis tests.
• A test for interaction.
This is a model comparison between models with and without the (6) interaction terms. Since it is
not significant, we can proceed to look at main effects. If the p-value had been small, then we cannot
really interpret the main effects in isolation.
• Two “main effects” tests.
These show a hightly significant main effect for both factors: It matters both which poison and which
treatment method are used, and the effects of these are largely independent. (The “better” treatments
for one poison are better for the other poisons as well, etc.)
Although the table does not show a row for residuals, each of these tests uses a statistic of the form
MSeffect
F =
. Notice that we are dividing our variation into three pieces rather than just two.
MSresiduals
c
2007
Randall Pruim ([email protected])
April 30, 2007
Math 243 – Spring 2007
88
We can follow up by fitting a model without interaction and by making interaction plots. (The ineraction
plots could also have been used from the start as a quick look, much like we used side-by-side boxplots for
1-way ANOVA.)
> lm(I(1/time)~poison+treat,rats) -> lm.rats4
> summary(lm.rats4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
2.6977
0.1744 15.473 < 2e-16
poisonII
0.4686
0.1744
2.688 0.01026
poisonIII
1.9964
0.1744 11.451 1.69e-14
treatB
-1.6574
0.2013 -8.233 2.66e-10
treatC
-0.5721
0.2013 -2.842 0.00689
treatD
-1.3583
0.2013 -6.747 3.35e-08
Residual standard error: 0.4931 on 42 degrees of freedom
Multiple R-Squared: 0.8441,Adjusted R-squared: 0.8255
F-statistic: 45.47 on 5 and 42 DF, p-value: 6.974e-16
***
*
***
***
**
***
> with(rats,interaction.plot(treat,poison,1/time))
> with(rats,interaction.plot(poison,treat,1/time))
More Examples
Example 5. Gas pump emissions.
> require(alr3); data(sniffer)
Example 6.
> getData(’m243/ratsleep’) ->sleep;
Example 7.
> getData(’m243/drumbeating’) -> drum;
c
2007
Randall Pruim ([email protected])
May 7–9, 2007
Math 243 – Spring 2007
89
Chi-squared goodness of fit
Example 1. Golfballs
> golfballs <- c(137, 138, 107, 104)
> chisq.test(golfballs)
Chi-squared test for given probabilities
data: golfballs
X-squared = 8.4691, df = 3, p-value = 0.03725
# extra information displayed using tool from m243
> show.chisq.test(golfballs)
Chi-squared test for given probabilities
data: golfballs
X-squared = 8.4691, df = 3, p-value = 0.03725
137.00
(121.50)
[1.98]
< 1.41>
138.00
(121.50)
[2.24]
< 1.50>
107.00
(121.50)
[1.73]
<-1.32>
104.00
(121.50)
[2.52]
<-1.59>
key:
observed
(expected)
[contribution to X-squared]
<residual>
Example 2. M&M’s
> mm <- c(114,123,137,84,79,74)
> nullProb <- c(.24,.20,.16,.14,.13,.13)
> show.chisq.test(mm,p=nullProb)
Chi-squared test for given probabilities
data: mm
X-squared = 23.4223, df = 5, p-value = 0.0002802
114.00
(146.64)
[ 7.2652]
<-2.695>
123.00
(122.20)
[ 0.0052]
< 0.072>
137.00
( 97.76)
[15.7506]
< 3.969>
84.00
( 85.54)
[ 0.0277]
<-0.167>
79.00
( 79.43)
[ 0.0023]
<-0.048>
key:
observed
(expected)
[contribution to X-squared]
<residual>
Chi-squared for two-way tables
Example 3. Smoking
> getData(’m243/familySmoking’) -> smoke
> names(smoke)
[1] "Student" "Parents"
> xtabs(~Student+Parents,data=smoke)
Parents
Student
BothSmoke NeitherSmokes OneSmokes
DoesNotSmoke
1380
1168
1823
Smokes
400
188
416
> xtabs(~Student+Parents,data=smoke) -> smokeTab
> show.chisq.test(smokeTab)
Pearson’s Chi-squared test
data: smokeTab
X-squared = 37.5663, df = 2, p-value = 6.96e-09
1380.00
(1447.51)
[ 3.1488]
<-1.774>
1168.00
(1102.71)
[ 3.8655]
< 1.966>
1823.00
(1820.78)
[ 0.0027]
< 0.052>
400.00
( 332.49)
[13.7086]
< 3.703>
188.00
( 253.29)
[16.8288]
<-4.102>
416.00
( 418.22)
[ 0.0118]
<-0.109>
key:
observed
(expected)
[contribution to X-squared]
<residual>
c
2007
Randall Pruim ([email protected])
74.00
( 79.43)
[ 0.3712]
<-0.609>
May 7–9, 2007
Math 243 – Spring 2007
90
Example 4. On time arrival of airlines
> getData(’m243/airlineArrival’)-> air
> xtabs(~Result+Airline,air) -> airTab
> airTab
Airline
Result
Alaska AmericaWest
Delayed
501
787
OnTime
3274
6438
# row percentages are less interesting (why?)
> col.perc(airTab)
Airline
Result
Alaska AmericaWest
Delayed 0.1327152
0.1089273
OnTime 0.8672848
0.8910727
> chisq.test(airTab)
Pearson’s Chi-squared test with Yates’ continuity
correction
data: airTab
X-squared = 13.3426, df = 1, p-value = 0.0002594
> show.chisq.test(airTab,correct=F)
Pearson’s Chi-squared test
data: airTab
X-squared = 13.5717, df = 1, p-value = 0.0002296
501.00
( 442.02)
[7.87]
< 2.81>
787.00
( 845.98)
[4.11]
<-2.03>
3274.00
(3332.98)
[1.04]
<-1.02>
6438.00
(6379.02)
[0.55]
< 0.74>
key:
observed
(expected)
[contribution to X-squared]
residual
Chi-squared applied to 2 × 2 tables tends to give p-values that a too small when the cell counts are small.
Yates proposed a “continuity correction” that subtracts 0.5 from (observed - expected) before squaring. His
correction tends to produce p-values that are a bit too large. When cell counts are large, there is very little
difference, but the Yates p-value will always be larger than the uncorrected p-value.
This is another example that shows Simpson’s paradox:
> col.perc(xtabs(~Result+Airline,air,
subset=Airport=="LosAngeles"))
Airline
Result
Alaska AmericaWest
Delayed 0.1109123
0.1442663
OnTime 0.8890877
0.8557337
> col.perc(xtabs(~Result+Airline,air,
subset=Airport=="Phoenix"))
Airline
Result
Alaska AmericaWest
Delayed 0.05150215 0.07897241
OnTime 0.94849785 0.92102759
> col.perc(xtabs(~Result+Airline,air,
subset=Airport=="SanFrancisco"))
Airline
Result
Alaska AmericaWest
Delayed 0.1685950
0.2873051
OnTime 0.8314050
0.7126949
> col.perc(xtabs(~Result+Airline,air,
subset=Airport=="Seattle"))
Airline
Result
Alaska AmericaWest
Delayed 0.1421249
0.2328244
OnTime 0.8578751
0.7671756
> col.perc(xtabs(~Result+Airline,air,
subset=Airport=="SanDiego"))
Airline
Result
Alaska AmericaWest
Delayed 0.0862069
0.1450893
OnTime 0.9137931
0.8549107
c
2007
Randall Pruim ([email protected])
May 7–9, 2007
Math 243 – Spring 2007
91
A mosaic plot provides a visual representation of what is happening in the data.
> levels(air$Airport)
[1] "LosAngeles"
"Phoenix"
"SanDiego"
[4] "SanFrancisco" "Seattle"
> levels(air$Airport)[4] <- "SanF"
> levels(air$Airport)[3] <- "SanD"
> levels(air$Airport)[1] <- "LAX"
> levels(air$Airline)
[1] "Alaska"
"AmericaWest"
> levels(air$Airline)[2] = "AmWest"
> mosaic(~Airport+Result+Airline,air,shade=T)
> levels(air$Airline)[1] = "AK"
> levels(air$Airline)[2] = "AW"
> levels(air$Result)
[1] "Delayed" "OnTime"
> levels(air$Result)[1] = "Lt"
> levels(air$Result)[2] = "OT"
> levels(air$Result)[1] = "Lt"
> require(vcd); mosaic(~Airport+Airline+Result,air)
> mosaicplot(~Airport+Airline+Result,air)
Example 5. Gun Deaths
guns <- getData(’m243/gunDeaths’)
> names(guns)
[1] "Death"
"Firearm"
> xtabs(~Death+Firearm,guns) -> gunsTab
> gunsTab
Firearm
Death
Handgun Rifle Shotgun Unknown
Homicides
468
15
28
13
Suicides
124
24
22
5
> gunsTab[,-4]
Firearm
Death
Handgun Rifle Shotgun
Homicides
468
15
28
Suicides
124
24
22
> show.chisq.test(gunsTab[,-4])
Pearson’s Chi-squared test
data: gunsTab[, -4]
X-squared = 42.6264, df = 2, p-value = 5.544e-10
468.00
(444.22)
[ 1.27]
< 1.13>
15.00
( 29.26)
[ 6.95]
<-2.64>
28.00
( 37.52)
[ 2.41]
<-1.55>
124.00
(147.78)
[ 3.83]
<-1.96>
24.00
( 9.74)
[20.90]
< 4.57>
22.00
( 12.48)
[ 7.26]
< 2.69>
key:
observed
(expected)
[contribution to X-squared]
residual
Entering a table by hand
> x <- c(468,15,28,124,24,22)
> matrix(x,nrow=2,byrow=T) -> xt
> xt
[,1] [,2] [,3]
[1,] 468
15
28
[2,] 124
24
22
> chisq.test(xt)
Pearson’s Chi-squared test
data: xt
X-squared = 42.6264, df = 2, p-value = 5.544e-10
c
2007
Randall Pruim ([email protected])