19. Inference about a population proportion The Practice of Statistics in the Life Sciences Third Edition © 2014 W.H. Freeman and Company Objectives (PSLS Chapter 19) Inference for a population proportion Conditions for inference on proportions The sample proportion p̂ (phat) The sampling distribution of p̂ Significance test for a proportion Confidence interval for p Sample size for a desired margin of error Conditions for inference on proportions Assumptions: 1. The data used for the estimate are a random sample from the population studied. 2. The population is at least 20 times as large as the sample. This ensures independence of successive trials in the random sampling. 3. The sample size n is large enough that the shape of the sampling distribution is approximately Normal. How large depends on the type of inference conducted. The sample proportion p̂ We now study categorical data and draw inference on the proportion, or percentage, of the population with a specific characteristic. If we call a given categorical characteristic in the population “success,” then the sample proportion of successes, pˆ p̂(phat) is: count of successes in the sample count of observatio ns in the sample We treat a group of 120 Herpes patients with a new drug; 30 get better: p̂ = (30)/(120) = 0.25 (proportion of patients improving in sample) Sampling distribution of p̂ The sampling distribution of 𝑝 is never exactly Normal. But for large enough sample sizes, it can be approximated by a Normal curve. The mean and standard deviation (width) of the sampling distribution are both completely determined by p and n. Significance test for p When testing: H0: p = p0 (a given value we are testing) p0 (1 p0 ) n If H0 is true, the sampling distribution is known The test statistic is the standardized value of 𝑝 z pˆ p0 p0 (1 p0 ) n p0 pˆ This is valid when both expected counts — expected successes np0 and expected failures n(1 − p0) — are each 10 or larger. P-value for a one or two sided alternative The P-value is the probability, if H0 was true, of obtaining a test statistic like the one computed or more extreme in the direction of Ha. Aphids evade predators (ladybugs) by dropping off the leaf. An experiment examined the mechanism of aphid drops. “When dropped upside-down from delicate tweezers, live aphids landed on their ventral side in 95% of the trials (19 out of 20). In contrast, dead aphids landed on their ventral side in 52.2% of the trials (12 out of 23).” Is there evidence (at significance level 5%) that live aphids land right side up (on their ventral side) more often than chance would predict? Here, “chance” would be 50% ventral landings. So we test: H 0 : p 0.5 versus H a : p 0.5 z pˆ p0 0.95 0.5 4.02 p0 (1 p0 ) n (.5 .5) 20 The expected counts of success and failure are each 10, so the z procedure is valid. The test P-value is P(z ≥ 4.02). From Table B, P = 1 – P(z < 4.02) < 0.0002, highly significant. We reject H0. There is very strong evidence (P < 0.0002) that the righting behavior of live aphids is better than chance. Mendel’s first law of genetic inheritance states that crossing dominant and recessive homozygote parents yields a second generation made of 75% of dominant-trait individuals. When Mendel crossed pure breeds of plants producing smooth peas and plants producing wrinkled peas, the second generation (F2), was made of 5474 smooth peas and 1850 wrinkled peas. Do these data provide evidence that the proportion of smooth peas in the F2 population is not 75%? The sample proportion of smooth peas is: pˆ 5474 0.7474 5474 1850 We test: H 0 : p 0.75 versus H a : p 0.75 z pˆ p0 0.7474 0.75 0.513 p0 (1 p0 ) n (.75 .25) 7324 From Table B, we find P = 2P(z < –0.51) = 2 x 0.3050 = 0.61, not significant. We fail to reject H0. The data are consistent with a dominant-recessive genetic model. Confidence interval for p When p is unknown, both the center and the spread of the sampling distribution are unknown problem. We need to “guess” a value for p. Our options: * Use the sample proportion p^ This is the “large sample method”. It performs poorly. * Use an improved p^ , ~p This is the “plus four method”. It is reasonably accurate. Always use with caution Large-sample confidence interval for p Confidence intervals contain the population proportion p in C % of samples. For an SRS of size n drawn from a large population and with sample proportion p̂ calculated from the data, an approximate level C confidence interval for p is CI : pˆ m , with ˆ z * pˆ (1 pˆ ) n m z * SE Use this method when the number of successes and the number of failures are both at least 15. C m -z* m z* C is the area under the standard normal curve between -z* and z*. Medication side effects Arthritis is a painful, chronic inflammation of the joints. An experiment on the side effects of pain relievers examined arthritis patients to find the proportion of patients who suffer side effects. What are some side effects of ibuprofen? Serious side effects (seek medical attention immediately): Allergic reactions (difficulty breathing, swelling, or hives) Muscle cramps, numbness, or tingling Ulcers (open sores) in the mouth Rapid weight gain (fluid retention) Seizures Black, bloody, or tarry stools Blood in your urine or vomit Decreased hearing or ringing in the ears Jaundice (yellowing of the skin or eyes) Abdominal cramping, indigestion, or heartburn Less serious side effects (discuss with your doctor): Dizziness or headache Nausea, gaseousness, diarrhea, or constipation Depression Fatigue or weakness Dry mouth Irregular menstrual periods. We compute a 90% confidence interval for the population proportion of arthritis patients who suffer some "adverse symptoms." What is the sample proportion p̂ ? pˆ 23 0.052 440 For a 90% confidence level, z* = 1.645. Using the large sample method: df m z * pˆ (1 pˆ ) n m 1.645* 0.052(1 0.052) / 440 Confidence level C 0.50 0.60 0.70 0.80 0.90 0.95 0.96 z* 0.674 0.841 1.036 1.282 1.645 1.960 2.054 90% CI for p : pˆ m 0.052 0.017 m 1.645*0.0106 0.017 With 90% confidence level, between 3.5% and 6.9% of arthritis patients taking this pain medication experience some adverse symptoms. “Plus four” confidence interval for p The “plus four” method gives reasonably accurate confidence intervals. We act as if we had four additional observations, two successes and two failures. Thus, the new sample size is n + 4 and the count of successes is X + 2. The “plus four” estimate of p is: ~ p counts of successes 2 count of all observatio ns 4 An approximate level C confidence interval is: CI : ~ p m , with ~ m z * SE z * ~ p (1 ~ p ) (n 4) Use this method when C is at least 90% and sample size is at least 10. We want a 90% CI for the population proportion of arthritis patients who suffer some “adverse symptoms.” 23 2 25 What is the value of the “plus four” estimate of p? ~ p 0.056 440 4 444 An approximate 90% confidence interval for p using the “plus four” method is: m z* ~ p (1 ~ p ) (n 4) m 1.645 * 0.056(1 0.056) / 444 m 1.645 * 0.011 0.018 90% CI for p : p m 0.056 0.018 With 90% confidence, between 3.8% and 7.4% of the population of arthritis patients taking this pain medication experience some adverse symptoms. df Confidence level C 0.50 0.60 0.70 0.80 0.90 0.95 0.96 z* 0.674 0.841 1.036 1.282 1.645 1.960 2.054 0.98 2.326 0.99 2.576 0.995 2.807 0.998 3.091 0.999 3.291 Sample size for a desired margin of error You may need to choose a sample size large enough to achieve a specified margin of error. Because the sampling distribution of p̂ is a function of the unknown population proportion p this process requires that you guess a likely value for p: p*. p ~ N p, p(1 p ) n 2 z* n p * (1 p*) m Make an educated guess, or use p* = 0.5 (most conservative estimate). What sample size would we need in order to achieve a margin of error no more than 0.01 (1 percentage point) with a 90% confidence level? We could use 0.5 for our guessed p*. However, since the drug has been approved for sale over the counter, we can safely assume that no more than 10% of patients should suffer “adverse symptoms” (a better guess than 50%). For a 90% confidence level, z* = 1.645. 2 2 z* 1.645 n p * (1 p*) (0.1)(0.9) 2434.4 m 0.01 To obtain a margin of error no more than 0.01 we need a sample size n of at least 2435 arthritis patients.
© Copyright 2026 Paperzz