Hypothesis Tests: An Overview Copyright (c) 2008 by The McGraw-Hill Companies. This material is intended solely for educational use by licensed users of LearningStats. It may not be copied or resold for profit. Hypothesis Formulation H0: The null hypothesis H1: The alternative hypothesis Tip 1 H0 and H1 must be mutually exclusive (both cannot be true) and collectively exhaustive (cover all possibilities). H0 is a belief that we try to reject, using sample evidence. Tip 2 H0 generally represents the status quo, a historical situation, or a benchmark. H1 generally represents a changed state of affairs that we suspect may exist because of new circumstances or an experimental condition. Scientific Method • Formulate a pair of hypotheses • Provisionally assume that H0 is true • Choose an appropriate statistical test • See if the sample data contradict H0 Your decision: Reject H0 or Don’t Reject H0 Accept? Don’t Reject? • Sample evidence can disprove H0 • But failure to reject H0 does not prove H0 Informally people may say “Accept H0” But remember: When someone says “accept” H0, this is imprecise terminology, because you could reject H0 if new sample evidence comes along. Does E=mc2? Albert Einstein proposed his general theory of relativity in 1905. Scientists have been testing it rigorously and trying to reject it ever since. For about 100 years, the scientific evidence has not seriously contradicted Einstein’s main predictions. So why don’t scientists just say “O.K., Einstein was right! We accept his theory.” Because you can’t prove a theory. You can only fail to disprove it. Why Is Science Tentative? Question Many non-scientists dislike the caution that is fundamental to science. They equate “tentative” with “timid.” They ask “Why don’t you just get it right?” Answer Science requires an open mind and the ability to discard a favorite belief if new ideas and/or evidence do arise. And they do! Type I and II Error When in reality … If your decision is: Reject H 0 Don't reject H 0 H 0 is true H 0 is false Type I error No error No error Type II error Three Definitions Type I error P(Reject H0 | H0 is true) = a Type II error P(Accept H0 | H0 is false) = b Power P(Reject H0 | H0 is false) = 1b An Ideal Test The ideal statistical test has: • Low a (small chance of Type I error) • Low b (small chance of Type II error) • High power (1b) Perspective The statistician cannot eliminate risk, but can (1) help a decision-maker consider the consequences of Type I and II error, (2) show how to quantify the a and b risk, and (3) advise on optimal sample sizes to control these risks, subject to budget and time constraints. Examples Example 1: Cancer Screening H0: No cancer exists Answer It depends on your tolerance for a and b risk. H1: Cancer exists Type I error (false positive) unnecessarily alarms the patient. Type II error (false negative) fails to warn the patient. Example 2: Quality Control H0: Defect rate is normal H1: Defect rate has risen Type I error is to shut down production unnecessarily. Type II error is to fail to stop production to make adjustments. Question Which risk is most to be feared? More Examples Example 3: Athlete Drug Testing H0: No drug usage by athlete H1: Drug usage by athlete Answer Because information is always imperfect. Type I error unfairly disqualifies a drug-free athlete. Type II error lets a drug user get away with it. Example 4: Airport Security H0: Suitcase has no bomb H1: Suitcase has a bomb Question Why can’t we simply avoid both types of error? Type I error is to delay a flight unnecessarily. Type II error is to endanger the entire flight. Still More Examples Example 5: Criminal Trial H0: Defendant is innocent H1: Defendant is guilty Question In each scenario, how would you assess the cost of Type I error? Type II error? Type I error wrongly convicts the innocent. Type II error lets a criminal back on the street. Example 6: Biometric ATM Identification System H0: You are yourself H1: Someone is pretending to be you Type I error is to deny you access to your money. Type II error is to let someone else get your money. Medical Tests Sensitivity The fraction of those with the disease correctly identified as positive by the test Specificity The fraction of those without the disease correctly identified as negative by the test Positive predictive value The fraction of people with positive tests who actually have the condition Negative predictive value The fraction of people with negative tests who don’t have the condition Medical Tests Likelihood Ratio = Sensitivity/(1-Specificity) Interpretation If you have a positive test, how many times more likely are you to have the disease than someone with a negative test? For example, a likelihood ratio of 5.0 means that those with a positive test are six times more likely to have the disease. Tradeoffs For a given sample size, reducing a will increase b: a b (ceteris paribus) For as given sample size, there is a tradeoff between a and b. The choice depends on the how much we fear Type I and Type II hazards. If we can increase the sample size, we can reduce both a and b simultaneously: n a and b Alternatively, a larger sample size can be used to reduce either a or b, depending on how we weigh the two types of risk. Examples of One-Sample Tests Parameter Left-Tail Test Two-Tail Test Right-Tail Test Mean H0: m 130 H1: m < 130 H0: m = 130 H1: m 130 H0: m 130 H1: m > 130 Proportion H0: p 0.07 H1: p < 0.07 H0: p = 0.07 H1: p 0.07 H0: p 0.07 H1: p > 0.07 Variance H0: s2 5.76 H1: s2 < 5.76 H0: s2 = 5.76 H1: s2 5.76 H0: s2 5.76 H1: s2 > 5.76 Examples of Two-Sample Tests Parameter Left-Tail Test Two-Tail Test Right-Tail Test 2 Means H0: m1 m2 H1: m1 < m2 H0: m1 = m2 H1: m1 m2 H0: m1 m2 H1: m1 > m2 2 Proportions H0: p1 p2 H1: p1 < p2 H0: p1 = p2 H1: p1 p2 H0: p1 p2 H1: p1 > p2 2 Variances H0: s12 s22 H1: s12 < s22 H0: s12 = s22 H1: s12 s22 H0: s12 s22 H1: s12 > s22 k-Sample Tests Method Model and Hypotheses One-Factor ANOVA (completely randomized) Xij = m + tj + eij Two-Factor ANOVA (randomized block) Xijk = m + tj + fk + eijk Two-Factor ANOVA (factorial model with interaction) Xijk = m + tj + fk + tfjk + eijk H0: tj = 0 H1: tj 0 H0: tj = 0 H1: tj 0 (treatment effect) H0: fk = 0 H1: fk 0 (block effect) H0: tj = 0 H1: tj 0 (row main effect) H0: fk = 0 H1: fk 0 (col. main effect) H0: tfjk = 0 H1: tfjk 0 (interaction) Nonparametric Tests Method Model and Hypotheses Runs Test (e.g., is the sequence TTTTFFFFFFTT random?) H0: Sequence is random H1: Sequence is not random Kruskal-Wallis Test H0: m1 = m2 = … = mk (medians are equal) H1: At least one mj differs from the others Rank Correlation Test for Paired Ranks (e.g., to see whether two experts agree on the severity of illness in n patients) H0: r = 0 (ranks are uncorrelated) H1: r 0 (ranks are correlated) Test Statistic Statisticians know the sampling distributions for many types of statistics (e.g., means, proportions, variances). If we assume that the null hypothesis is true, a sample result can be compared with the hypothesized parameter to produce a test statistic. This is compared with a critical value (from a table or from Minitab, Excel, or another computer package) that specifies its likely range if the null hypothesis were true. If the test statistic lies within this range (the acceptance region) the null hypothesis is not rejected. Outside the likely range is the rejection region. If the test statistic is in that range, we conclude that the null hypothesis is false. Critical Value Right-tail test Reject H0 if the test statistic exceeds the upper-tail critical value that designates a given right-tail area a under the sampling distribution. Left-tail test Reject H0 if the test statistic is smaller that the lower-tail critical value that designates a given left-tail area a under the sampling distribution. Two-tail test Reject H0 if the test statistic exceeds the upper-tail critical value that designates a given area a under the sampling distribution, or is smaller that the lower-tail critical value that designates a given left-tail area a under the sampling distribution. . Level of Significance The level of significance is the desired probability of Type I error (the probability of rejecting a true null hypothesis). Commonly, this is set at a = 0.10, 0.05, 0.025, 0.01, or 0.001, depending on the situation. For example, when someone says that a test is significant at the 5% level, they mean that the null hypothesis can be rejected with less than a 5% chance of Type I error. The level of significance dictates the critical value. The smaller the level of significance, the stronger the sample result must be in order to reject the null hypothesis. Common Right-Tailed z Values z = 1.282 a = 0.10 z = 1.960 a = 0.025 z = 1.645 a = 0.05 z = 2.326 a = 0.01 Common Two-Tailed z Values z = 1.645 a = 0.10 z = 1.960 a = 0.05 z = 2.326 a = 0.02 z = 2.576 a = 0.01 Choosing a Question The choice of a is up to the researcher. So why not always choose a very low level of significance such as 0.001? Answer For a given sample size, smaller a implies a larger b. So when we reduce a, we also reduce the power of the test. The Old Way: Setting a Step 1. Specify the desired a level (called the level of significance). Commonly this would be 0.10, 0.05, 0.01, or 0.001. Step 2. Set up a decision rule that defines a rejection region, based on the distribution of the test statistic under the null hypothesis. Step 3. Make the decision. If the test statistic falls in the rejection region, the null hypothesis is rejected (not due to chance). Problem We must specify a in advance. This could tempt the researcher to manipulate the decision by specifying an a which would produce one result or the other. The New Way: p-Values Step 1. Calculate the test statistic in the usual way. Step 2. Find the probability that a test statistic as extreme as the one observed would occur due to chance, if the null hypothesis were true Step 3 Report the p-value, and let others decide on its significance, using whatever criteria they wish. Note This method was impossible until the advent of fast computers and precise algorithms to find exact areas under the common theoretical distributions (e.g., z, t, chi-square, and F). p-Values: An Example Problem Two drugs to treat migraines are compared. The proportion of patients experiencing relief is calculated from matched samples of patients. A right-tail test is used to test whether drug A yields a higher proportion of patients experiencing relief than drug B. The resulting p-value for the difference of proportions is p = 0.092. Interpretation This p-value would lead to rejection of the hypothesis at a = 0.10 (though just barely) but not at a = 0.05 or any smaller a. A researcher might consider this a marginal result. However, patients who suffer migraines might want to try drug A anyway, because there is at least some evidence that it is better than drug B. Meta-Analysis Scenario One study of an experimental drug’s effect on one-year poststroke mortality showed a reduction of 1.78% relative to a control group (p = 0.0779). A similar study in a different hospital showed a one-year post-stroke mortality reduction of 2.39% relative to a control group (p = 0.1080). Neither result is significant at a = 0.05, since each pvalue exceeds 0.05. Would combining the studies improve the power? Meta-Analysis Combining the two samples shows a mortality reduction of 1.96% (a weighted average of the two studies). However, the larger sample size of the combined studies improves the power of the test (p = 0.0341) and leads to the conclusion that the mortality reduction is statistically significant at a = 0.05. Note Meta-analysis assumes that the studies are comparable. Power Curve Comment Slight departures from H0 are hard to detect. Interpretation This is a power curve for a two-tailed test of the hypothesis that the true proportion of side effects for a drug is 15% using a sample of n = 225 patients. The graph shows that if the true proportion side effects is 9%, the power of the test is only 47%. In general, power is low if the true proportion is near the hypothesized 15%, but rises quickly the more the true proportion differs from the hypothesized 15%. Power and True Parameter Value Comment Large departures from H0 are easier to detect. Interpretation This is the same power curve for a two-tailed test of the hypothesis that the true proportion of side effects for a drug is 15% using a sample of n = 225 patients. The graph shows that if the true proportion of side effects is 7%, the power of the test is 86%. The test has more power because 7% is farther from the hypothesized 15% than in the previous diagram. Sample Size and Power Comment Large sample size improves power. Interpretation This diagram shows a family of power curves for three different sample sizes, using the same two-tailed test of the hypothesis that the true proportion of side effects for a drug is 15%. For any value along the X axis, the power of the test is higher when n is large. Alpha and Power Comment Larger a means more power. Why? a b power since power = 1b. Interpretation This diagram shows a family of power curves for three different a levels, using the same two-tailed test of the hypothesis that the true proportion of side effects for a drug is 15%. For any value along the X axis, the power of the test increases when a is increased. But should we sacrifice a to gain power?
© Copyright 2026 Paperzz