Document

Hypothesis
Tests:
An Overview
Copyright (c) 2008 by The McGraw-Hill Companies. This material is intended solely for
educational use by licensed users of LearningStats. It may not be copied or resold for profit.
Hypothesis Formulation
H0: The null hypothesis
H1: The alternative hypothesis
Tip 1 H0 and H1 must be mutually exclusive (both cannot be
true) and collectively exhaustive (cover all possibilities). H0 is a
belief that we try to reject, using sample evidence.
Tip 2 H0 generally represents the status quo, a historical
situation, or a benchmark. H1 generally represents a changed
state of affairs that we suspect may exist because of new
circumstances or an experimental condition.
Scientific Method
• Formulate a pair of hypotheses
• Provisionally assume that H0 is true
• Choose an appropriate statistical test
• See if the sample data contradict H0
Your decision:
Reject H0
or
Don’t Reject H0
Accept? Don’t Reject?
• Sample evidence can disprove H0
• But failure to reject H0 does not prove H0
Informally people may
say “Accept H0”
But remember: When someone says “accept” H0, this is
imprecise terminology, because you could reject H0 if
new sample evidence comes along.
Does E=mc2?
Albert Einstein proposed his general theory of relativity in 1905.
Scientists have been testing it rigorously and trying to reject it
ever since. For about 100 years, the scientific evidence has not
seriously contradicted Einstein’s main predictions.
So why don’t scientists
just say “O.K., Einstein
was right! We accept his
theory.”
Because you can’t prove a
theory. You can only
fail to disprove it.
Why Is Science Tentative?
Question Many non-scientists dislike the
caution that is fundamental to science. They
equate “tentative” with “timid.” They ask
“Why don’t you just get it right?”
Answer Science requires an open
mind and the ability to discard a
favorite belief if new ideas and/or
evidence do arise. And they do!
Type I and II Error
When in reality …
If your decision is:
Reject H 0
Don't reject H 0
H 0 is true
H 0 is false
Type I error
No error
No error
Type II error
Three Definitions
Type I error
P(Reject H0 | H0 is true) = a
Type II error
P(Accept H0 | H0 is false) = b
Power
P(Reject H0 | H0 is false) = 1b
An Ideal Test
The ideal statistical test has:
• Low a (small chance of Type I error)
• Low b (small chance of Type II error)
• High power (1b)
Perspective The statistician cannot eliminate risk, but can
(1) help a decision-maker consider the consequences of
Type I and II error, (2) show how to quantify the a and b
risk, and (3) advise on optimal sample sizes to control
these risks, subject to budget and time constraints.
Examples
Example 1: Cancer Screening
H0: No cancer exists
Answer
It depends on
your tolerance
for a and b
risk.
H1: Cancer exists
Type I error (false positive) unnecessarily
alarms the patient. Type II error (false
negative) fails to warn the patient.
Example 2: Quality Control
H0: Defect rate is normal
H1: Defect rate has risen
Type I error is to shut down production
unnecessarily. Type II error is to fail to stop
production to make adjustments.
Question Which risk
is most to be feared?
More Examples
Example 3: Athlete Drug Testing
H0: No drug usage by athlete
H1: Drug usage by athlete
Answer
Because
information is
always
imperfect.
Type I error unfairly disqualifies a drug-free athlete.
Type II error lets a drug user get away with it.
Example 4: Airport Security
H0: Suitcase has no bomb
H1: Suitcase has a bomb
Question Why can’t
we simply avoid both
types of error?
Type I error is to delay a flight unnecessarily. Type II
error is to endanger the entire flight.
Still More Examples
Example 5: Criminal Trial
H0: Defendant is innocent
H1: Defendant is guilty
Question In each scenario,
how would you assess the
cost of Type I error? Type
II error?
Type I error wrongly convicts the innocent. Type II
error lets a criminal back on the street.
Example 6: Biometric ATM Identification System
H0: You are yourself
H1: Someone is pretending to be you
Type I error is to deny you access to your money.
Type II error is to let someone else get your money.
Medical Tests
Sensitivity The fraction of those with the disease
correctly identified as positive by the test
Specificity The fraction of those without the disease
correctly identified as negative by the test
Positive predictive value The fraction of people with
positive tests who actually have the condition
Negative predictive value The fraction of people with
negative tests who don’t have the condition
Medical Tests
Likelihood Ratio = Sensitivity/(1-Specificity)
Interpretation
If you have a positive test, how many times more likely are
you to have the disease than someone with a negative test?
For example, a likelihood ratio of 5.0 means that those with a
positive test are six times more likely to have the disease.
Tradeoffs
For a given sample size, reducing a will increase b:
a   b  (ceteris paribus)
For as given sample size, there is a tradeoff between
a and b. The choice depends on the how much we
fear Type I and Type II hazards.
If we can increase the sample size, we can reduce
both a and b simultaneously:
n   a  and b 
Alternatively, a larger sample size can be used to
reduce either a or b, depending on how we weigh the
two types of risk.
Examples of
One-Sample Tests
Parameter
Left-Tail Test
Two-Tail Test
Right-Tail Test
Mean
H0: m  130
H1: m < 130
H0: m = 130
H1: m  130
H0: m  130
H1: m > 130
Proportion
H0: p  0.07
H1: p < 0.07
H0: p = 0.07
H1: p  0.07
H0: p  0.07
H1: p > 0.07
Variance
H0: s2  5.76
H1: s2 < 5.76
H0: s2 = 5.76
H1: s2  5.76
H0: s2  5.76
H1: s2 > 5.76
Examples of
Two-Sample Tests
Parameter
Left-Tail Test
Two-Tail Test
Right-Tail Test
2 Means
H0: m1  m2
H1: m1 < m2
H0: m1 = m2
H1: m1  m2
H0: m1  m2
H1: m1 > m2
2 Proportions
H0: p1  p2
H1: p1 < p2
H0: p1 = p2
H1: p1  p2
H0: p1  p2
H1: p1 > p2
2 Variances
H0: s12  s22
H1: s12 < s22
H0: s12 = s22
H1: s12  s22
H0: s12  s22
H1: s12 > s22
k-Sample Tests
Method
Model and Hypotheses
One-Factor ANOVA
(completely
randomized)
Xij = m + tj + eij
Two-Factor ANOVA
(randomized block)
Xijk = m + tj + fk + eijk
Two-Factor ANOVA
(factorial model with
interaction)
Xijk = m + tj + fk + tfjk + eijk
H0: tj = 0 H1: tj  0
H0: tj = 0 H1: tj  0 (treatment effect)
H0: fk = 0 H1: fk  0 (block effect)
H0: tj = 0 H1: tj  0 (row main effect)
H0: fk = 0 H1: fk  0 (col. main effect)
H0: tfjk = 0 H1: tfjk  0 (interaction)
Nonparametric Tests
Method
Model and Hypotheses
Runs Test (e.g., is the
sequence
TTTTFFFFFFTT
random?)
H0: Sequence is random
H1: Sequence is not random
Kruskal-Wallis Test
H0: m1 = m2 = … = mk (medians are equal)
H1: At least one mj differs from the others
Rank Correlation Test for
Paired Ranks (e.g., to see
whether two experts agree
on the severity of illness
in n patients)
H0: r = 0 (ranks are uncorrelated)
H1: r  0 (ranks are correlated)
Test Statistic
Statisticians know the sampling distributions for many
types of statistics (e.g., means, proportions, variances).
If we assume that the null hypothesis is true, a sample
result can be compared with the hypothesized parameter to
produce a test statistic. This is compared with a critical
value (from a table or from Minitab, Excel, or another
computer package) that specifies its likely range if the null
hypothesis were true.
If the test statistic lies within this range (the acceptance
region) the null hypothesis is not rejected. Outside the
likely range is the rejection region. If the test statistic is
in that range, we conclude that the null hypothesis is false.
Critical Value
Right-tail test
Reject H0 if the test statistic exceeds the upper-tail critical value that
designates a given right-tail area a under the sampling distribution.
Left-tail test
Reject H0 if the test statistic is smaller that the lower-tail critical value
that designates a given left-tail area a under the sampling distribution.
Two-tail test
Reject H0 if the test statistic exceeds the upper-tail critical value that
designates a given area a under the sampling distribution, or is smaller
that the lower-tail critical value that designates a given left-tail area a
under the sampling distribution. .
Level of Significance
The level of significance is the desired probability of Type
I error (the probability of rejecting a true null hypothesis).
Commonly, this is set at a = 0.10, 0.05, 0.025, 0.01, or
0.001, depending on the situation.
For example, when someone says that a test is significant
at the 5% level, they mean that the null hypothesis can be
rejected with less than a 5% chance of Type I error.
The level of significance dictates the critical value. The
smaller the level of significance, the stronger the sample
result must be in order to reject the null hypothesis.
Common Right-Tailed z Values
z = 1.282
a = 0.10
z = 1.960
a = 0.025
z = 1.645
a = 0.05
z = 2.326
a = 0.01
Common Two-Tailed z Values
z = 1.645
a = 0.10
z = 1.960
a = 0.05
z = 2.326
a = 0.02
z = 2.576
a = 0.01
Choosing a
Question
The choice of a is up to the researcher.
So why not always choose a very low
level of significance such as 0.001?
Answer
For a given sample size, smaller a implies a
larger b. So when we reduce a, we also reduce
the power of the test.
The Old Way: Setting a
Step 1. Specify the desired a level (called the level of significance).
Commonly this would be 0.10, 0.05, 0.01, or 0.001.
Step 2. Set up a decision rule that defines a rejection region, based on
the distribution of the test statistic under the null hypothesis.
Step 3. Make the decision. If the test statistic falls in the rejection
region, the null hypothesis is rejected (not due to chance).
Problem We must specify a in advance. This could tempt
the researcher to manipulate the decision by specifying an
a which would produce one result or the other.
The New Way: p-Values
Step 1. Calculate the test statistic in the usual way.
Step 2. Find the probability that a test statistic as extreme as the one
observed would occur due to chance, if the null hypothesis were true
Step 3 Report the p-value, and let others decide on its significance,
using whatever criteria they wish.
Note This method was impossible until the advent of fast computers
and precise algorithms to find exact areas under the common
theoretical distributions (e.g., z, t, chi-square, and F).
p-Values: An Example
Problem Two drugs to treat migraines are compared. The proportion
of patients experiencing relief is calculated from matched samples of
patients. A right-tail test is used to test whether drug A yields a higher
proportion of patients experiencing relief than drug B. The resulting
p-value for the difference of proportions is p = 0.092.
Interpretation This p-value would lead to rejection of the
hypothesis at a = 0.10 (though just barely) but not at a = 0.05 or
any smaller a. A researcher might consider this a marginal result.
However, patients who suffer migraines might want to try drug A
anyway, because there is at least some evidence that it is better
than drug B.
Meta-Analysis
Scenario One study of an experimental drug’s effect on one-year poststroke mortality showed a reduction of 1.78% relative to a control group
(p = 0.0779). A similar study in a different hospital showed a one-year
post-stroke mortality reduction of 2.39% relative to a control group
(p = 0.1080). Neither result is significant at a = 0.05, since each pvalue exceeds 0.05. Would combining the studies improve the power?
Meta-Analysis Combining the two samples shows a mortality
reduction of 1.96% (a weighted average of the two studies).
However, the larger sample size of the combined studies improves
the power of the test (p = 0.0341) and leads to the conclusion that
the mortality reduction is statistically significant at a = 0.05.
Note Meta-analysis assumes that
the studies are comparable.
Power Curve
Comment Slight
departures from H0 are
hard to detect.
Interpretation This is a power curve for a two-tailed test of the
hypothesis that the true proportion of side effects for a drug is
15% using a sample of n = 225 patients. The graph shows that if
the true proportion side effects is 9%, the power of the test is only
47%. In general, power is low if the true proportion is near the
hypothesized 15%, but rises quickly the more the true proportion
differs from the hypothesized 15%.
Power and
True Parameter Value
Comment Large
departures from H0 are
easier to detect.
Interpretation This is the same power curve for a two-tailed test
of the hypothesis that the true proportion of side effects for a drug
is 15% using a sample of n = 225 patients. The graph shows that
if the true proportion of side effects is 7%, the power of the test is
86%. The test has more power because 7% is farther from the
hypothesized 15% than in the previous diagram.
Sample Size and Power
Comment Large
sample size improves
power.
Interpretation This diagram shows a family of power curves for
three different sample sizes, using the same two-tailed test of the
hypothesis that the true proportion of side effects for a drug is
15%. For any value along the X axis, the power of the test is
higher when n is large.
Alpha and Power
Comment Larger a
means more power.
Why?
a   b   power 
since power = 1b.
Interpretation This diagram shows a family of power curves for
three different a levels, using the same two-tailed test of the
hypothesis that the true proportion of side effects for a drug is 15%.
For any value along the X axis, the power of the test increases when
a is increased. But should we sacrifice a to gain power?