Math 141 - Lecture 9: Inference

Math 141
Lecture 9: Inference
Albyn Jones1
1 Library 304
[email protected]
www.people.reed.edu/∼jones/courses/141
Albyn Jones
Math 141
Inference: Main Topics
Estimation What’s our best guess? (last time!)
Albyn Jones
Math 141
Inference: Main Topics
Estimation What’s our best guess? (last time!)
Hypothesis Testing Are the data consistent with a
hypothesized model?
Albyn Jones
Math 141
Inference: Main Topics
Estimation What’s our best guess? (last time!)
Hypothesis Testing Are the data consistent with a
hypothesized model?
Confidence Intervals How accurate is our
estimate?
Albyn Jones
Math 141
Inference: Main Topics
Estimation What’s our best guess? (last time!)
Hypothesis Testing Are the data consistent with a
hypothesized model?
Confidence Intervals How accurate is our
estimate?
We will develop these concepts in the context of
the binomial distribution first...
Albyn Jones
Math 141
Motivation: Hypothesis tests
Imagine I tell you I have tossed a coin 10 times, and got 10
Heads in the 10 tosses. What are the possible explanations?
Albyn Jones
Math 141
Motivation: Hypothesis tests
Imagine I tell you I have tossed a coin 10 times, and got 10
Heads in the 10 tosses. What are the possible explanations?
The coin really isn’t a fair coin: P(H) 6= 1/2.
Albyn Jones
Math 141
Motivation: Hypothesis tests
Imagine I tell you I have tossed a coin 10 times, and got 10
Heads in the 10 tosses. What are the possible explanations?
The coin really isn’t a fair coin: P(H) 6= 1/2.
The coin is fair, and we observed a rare event.
Albyn Jones
Math 141
Motivation: Hypothesis tests
Imagine I tell you I have tossed a coin 10 times, and got 10
Heads in the 10 tosses. What are the possible explanations?
The coin really isn’t a fair coin: P(H) 6= 1/2.
The coin is fair, and we observed a rare event.
Other possibilities: the tosses weren’t
independent, I made it all up, etc.
Albyn Jones
Math 141
Motivation: Hypothesis tests
Imagine I tell you I have tossed a coin 10 times, and got 10
Heads in the 10 tosses. What are the possible explanations?
The coin really isn’t a fair coin: P(H) 6= 1/2.
The coin is fair, and we observed a rare event.
Other possibilities: the tosses weren’t
independent, I made it all up, etc.
The first two possibilities lie at the heart of
hypothesis testing.
Albyn Jones
Math 141
The problem: No Certainty!
0.00
0.05
0.10
0.15
0.20
The distribution for X ∼ Binomial(10, 1/2).
0
1
2
3
4
5
6
7
8
9
10
There are no impossible outcomes, though some are
improbable.
Albyn Jones
Math 141
Terminology
Null Hypothesis H0
the specification of a probability model corresponding to the
substantive question we would like to answer. Often it is a
hypothesis we hope to reject!
Example:
A researcher tests a new treatment for anorexia, and wants to
know if it has any effect on subjects’ gain or loss of weight. Let
X be the number of subjects who gain weight after treatment.
Assuming subjects are independent, we might wish to test the
null hypothesis that the treatment does no better than chance:
the probability of weight gain is p = 1/2, and thus
X ∼ Binomial(n, 1/2).
Albyn Jones
Math 141
Decisions, Decisions!
The standard solution: choose a Rejection Region, a set of
outcomes that you consider to be evidence that the specified
probability model should be rejected.
Important: getting an outcome in the rejection region does not
prove that the model is wrong, since the outcomes in the
rejection region are not impossible, just improbable.
We reject hypotheses, rather than proving them false: either we
have observed a rare event, or the null hypothesis is false.
Albyn Jones
Math 141
More Terminology!
Significance Level, α
The Significance Level, or α-level, also called the size of the
test, is the probability of getting an outcome in the rejection
region given that the null hypothesis is correct.
If X is the test statistic, and RR the rejection region, then
P(X ∈ RR|H0 ) = α
α is a conditional probability: how often do we reject H0 when
H0 is really correct?
Albyn Jones
Math 141
Example: Testing H0 : p = 1/2
The rejection region is colored red
0.00
0.05
0.10
0.15
0.20
Binomial( 10 , 0.5 ), alpha= 0.021
0
1
2
3
4
Albyn Jones
5
6
Math 141
7
8
9
10
Computing α
H0 : p = 1/2, for 10 coin tosses, vs. H1 : p 6= 1/2. Let the
Rejection Region be {0, 1, 9, 10}. What is α?
RR <- c(0,1,9,10)
sum(dbinom(RR,10,.5))
[1] 0.02148438
α ≈ .02
Albyn Jones
Math 141
Two-tailed vs One-tailed tests
Suppose X ∼ Binomial(10, p).
The choice of rejection region depends on the alternative
hypothesis.
Albyn Jones
Math 141
Two-tailed vs One-tailed tests
Suppose X ∼ Binomial(10, p).
The choice of rejection region depends on the alternative
hypothesis.
Two tailed test: H0 : p = .5 vs. H1 : p 6= .5;
for α ≈ .02, choose
RR = {0, 1, 9, 10}
Albyn Jones
Math 141
Two-tailed vs One-tailed tests
Suppose X ∼ Binomial(10, p).
The choice of rejection region depends on the alternative
hypothesis.
Two tailed test: H0 : p = .5 vs. H1 : p 6= .5;
for α ≈ .02, choose
RR = {0, 1, 9, 10}
One tailed test: H0 : p ≥ .5 vs. H1 : p < .5;
for α ≈ .01, choose
RR = {0, 1}
Albyn Jones
Math 141
Two-tailed vs One-tailed tests
Suppose X ∼ Binomial(10, p).
The choice of rejection region depends on the alternative
hypothesis.
Two tailed test: H0 : p = .5 vs. H1 : p 6= .5;
for α ≈ .02, choose
RR = {0, 1, 9, 10}
One tailed test: H0 : p ≥ .5 vs. H1 : p < .5;
for α ≈ .01, choose
RR = {0, 1}
One tailed test: H0 : p ≤ .5 vs. H1 : p > .5;
for α ≈ .01, choose
RR = {9, 10}
Albyn Jones
Math 141
Testing H0 : p ≥ 1/2 vs. H1 : p < 1/2
The rejection region is colored red
0.00
0.05
0.10
0.15
0.20
Binomial( 10 , 0.5 ), alpha= 0.011
0
1
2
3
4
Albyn Jones
5
6
Math 141
7
8
9
10
Still More Terminology: P-Values
p-value
For a two-sided test: the probability of getting an outcome
at least as unlikely as the observed outcome, given that
the null hypothesis is correct.
Albyn Jones
Math 141
Still More Terminology: P-Values
p-value
For a two-sided test: the probability of getting an outcome
at least as unlikely as the observed outcome, given that
the null hypothesis is correct.
For a one-sided test: the probability of getting an outcome
X ≤ Xobs
if H1 : p < p0 ,
X ≥ Xobs
if H1 : p > p0 .
or
In other words, the probability of getting an outcome at
least as inconsistent with H0 as the observed outcome.
Albyn Jones
Math 141
Another version: P-Values
p-value
The significance level (size) of the smallest rejection region
containing the observed statistic, defined by the alternative
hypothesis with probabilities computed under H0 : p = p0 .
Albyn Jones
Math 141
H0 : p = 1/2, two sided p-value
Outcomes contributing to the p-value are colored red
Suppose we toss a coin n = 10 times to test H0 vs. the
alternative p 6= 1/2. We observe X = 7. What is the p-value?
0.00
0.05
0.10
0.15
0.20
H0: X ~ Binomial( 10 , 0.5 ), p−value= 0.344
0
1
2
3
4
Albyn Jones
5
6
7
Math 141
8
9
10
Testing H0 : p ≤ 1/2 vs H1 : p > 1/2
Outcomes contributing to the p-value are colored red
Suppose we toss a coin n = 10 times to test H0 vs. the
alternative p > 1/2. We observe X = 7. What is the p-value?
0.00
0.05
0.10
0.15
0.20
H0: X ~ Binomial( 10 , 0.5 ), p−value= 0.172
0
1
2
3
4
Albyn Jones
5
6
7
Math 141
8
9
10
Testing H0 : p ≥ 1/2 vs H1 : p < 1/2
Outcomes contributing to the p-value are colored red
Suppose we toss a coin n = 10 times to test H0 vs. the
alternative p < 1/2. We observe X = 7. What is the p-value?
0.00
0.05
0.10
0.15
0.20
H0: X ~ Binomial( 10 , 0.5 ), p−value= 0.945
0
1
2
3
4
Albyn Jones
5
6
7
Math 141
8
9
10
An Argument for Two-sided Tests
In almost all settings, two-sided tests are the standard ritual:
Albyn Jones
Math 141
An Argument for Two-sided Tests
In almost all settings, two-sided tests are the standard ritual:
The p-value for a one-sided test can be half that for a two
sided test with the same observed outcome.
Albyn Jones
Math 141
An Argument for Two-sided Tests
In almost all settings, two-sided tests are the standard ritual:
The p-value for a one-sided test can be half that for a two
sided test with the same observed outcome.
What if we get a result that would allow us to reject H0 with
a one-sided test, but not with the two-sided test we initially
conducted?
Albyn Jones
Math 141
An Argument for Two-sided Tests
In almost all settings, two-sided tests are the standard ritual:
The p-value for a one-sided test can be half that for a two
sided test with the same observed outcome.
What if we get a result that would allow us to reject H0 with
a one-sided test, but not with the two-sided test we initially
conducted?
Rude Question: Did the researcher change the alternative
hypothesis after computing the p-value for the two-sided
alternative?
Albyn Jones
Math 141
An Argument for Two-sided Tests
In almost all settings, two-sided tests are the standard ritual:
The p-value for a one-sided test can be half that for a two
sided test with the same observed outcome.
What if we get a result that would allow us to reject H0 with
a one-sided test, but not with the two-sided test we initially
conducted?
Rude Question: Did the researcher change the alternative
hypothesis after computing the p-value for the two-sided
alternative?
Some journals and federal agencies will not accept
one-sided hypothesis tests!
Albyn Jones
Math 141
Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we reject H0 ? Here are some standard expressions
you will hear or see in journals:
Albyn Jones
Math 141
Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we reject H0 ? Here are some standard expressions
you will hear or see in journals:
We reject H0 at α = .05.
Albyn Jones
Math 141
Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we reject H0 ? Here are some standard expressions
you will hear or see in journals:
We reject H0 at α = .05.
We reject H0 , p = .043. (p here refers to the p-value, not
the binomial parameter p!)
Albyn Jones
Math 141
Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we reject H0 ? Here are some standard expressions
you will hear or see in journals:
We reject H0 at α = .05.
We reject H0 , p = .043. (p here refers to the p-value, not
the binomial parameter p!)
p̂ is statistically significantly different from p0 .
Albyn Jones
Math 141
Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we reject H0 ? Here are some standard expressions
you will hear or see in journals:
We reject H0 at α = .05.
We reject H0 , p = .043. (p here refers to the p-value, not
the binomial parameter p!)
p̂ is statistically significantly different from p0 .
The result is statistically significant.
Albyn Jones
Math 141
Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we reject H0 ? Here are some standard expressions
you will hear or see in journals:
We reject H0 at α = .05.
We reject H0 , p = .043. (p here refers to the p-value, not
the binomial parameter p!)
p̂ is statistically significantly different from p0 .
The result is statistically significant.
The difference is large enough that it is unlikely to have
occurred by chance.
Albyn Jones
Math 141
Significance
It is important to distinguish statistical significance from
substantive significance:
statistical significance: the difference is large enough to
detect with this sample size.
substantive significance: the difference is large enough that
we care about it!
Albyn Jones
Math 141
More Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we fail to reject H0 ? Here are some standard
expressions you will hear or see in journals:
Albyn Jones
Math 141
More Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we fail to reject H0 ? Here are some standard
expressions you will hear or see in journals:
We fail to reject H0 at α = .05.
Albyn Jones
Math 141
More Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we fail to reject H0 ? Here are some standard
expressions you will hear or see in journals:
We fail to reject H0 at α = .05.
The observed difference is no bigger than would be
expected due to chance.
Albyn Jones
Math 141
More Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we fail to reject H0 ? Here are some standard
expressions you will hear or see in journals:
We fail to reject H0 at α = .05.
The observed difference is no bigger than would be
expected due to chance.
The result is marginally significant at p = .07. We expect a
larger replication of the experiment would demonstrate
statistical significance.
Albyn Jones
Math 141
More Terminology
Suppose we have tested some null hypothesis H0 . What do we
say when we fail to reject H0 ? Here are some standard
expressions you will hear or see in journals:
We fail to reject H0 at α = .05.
The observed difference is no bigger than would be
expected due to chance.
The result is marginally significant at p = .07. We expect a
larger replication of the experiment would demonstrate
statistical significance.
Your paper is not accepted for publication. The editors
agree that your result would be interesting if it were
confirmed, and encourage you to resubmit after collecting
more data.
Albyn Jones
Math 141
Interpretation
Some consider p-values to measure the strength of evidence
against H0 . This is dangerous, since p-values only measure the
probability of the observed data assuming H0 is true, and:
P(X |H0 ) 6= P(H0 |X )
Albyn Jones
Math 141
Example: Screening Tests
Recall the example of screening tests we studied with Bayes’
Theorem. Let H0 be !D, i.e. ‘no disease’, P is ‘tests positive’.
Albyn Jones
Math 141
Example: Screening Tests
Recall the example of screening tests we studied with Bayes’
Theorem. Let H0 be !D, i.e. ‘no disease’, P is ‘tests positive’.
Sensitivity: is power, the probability we reject H0 when it is
false.
P(P|D) = .99
Albyn Jones
Math 141
Example: Screening Tests
Recall the example of screening tests we studied with Bayes’
Theorem. Let H0 be !D, i.e. ‘no disease’, P is ‘tests positive’.
Sensitivity: is power, the probability we reject H0 when it is
false.
P(P|D) = .99
Specificity: is the significance level or, if the test is positive,
the p-value
P(P|!D) = .01
Albyn Jones
Math 141
Example: Screening Tests
Recall the example of screening tests we studied with Bayes’
Theorem. Let H0 be !D, i.e. ‘no disease’, P is ‘tests positive’.
Sensitivity: is power, the probability we reject H0 when it is
false.
P(P|D) = .99
Specificity: is the significance level or, if the test is positive,
the p-value
P(P|!D) = .01
Prevalence: typically not available for hypothesis tests!
P(D) = .01
Albyn Jones
Math 141
Example: Screening Tests
Recall the example of screening tests we studied with Bayes’
Theorem. Let H0 be !D, i.e. ‘no disease’, P is ‘tests positive’.
Sensitivity: is power, the probability we reject H0 when it is
false.
P(P|D) = .99
Specificity: is the significance level or, if the test is positive,
the p-value
P(P|!D) = .01
Prevalence: typically not available for hypothesis tests!
P(D) = .01
P(P|H0 ) = .01 6= .5 = P(H0 |P)
Albyn Jones
Math 141
Summary
Null Hypothesis (H0 ): specifies a probability model.
Rejection Region: the set of values that we feel cast
serious doubt on H0 .
Significance Level (α): The probability of getting an
outcome in the rejection region, assuming H0 is true.
p-value: For a two sided alternative, the set of values at
least as unlikely as the observed value.
Language Matters! Statistical significance is not the same
as substantive significance!
Albyn Jones
Math 141