Questions about the Assignment Understanding Inference: Confidence Intervals II Part I The z-score is not the same as the percentile. (e.g., a z-score of .98 does not equal the 98th percentile) Confidence intervals The z-score is the number of standard deviations the value is above or below the mean. Other levels of confidence Bootstrap distribution Part II Some variables appear to be quantitative, but they are really categorical (e.g. year born, income bracket, occupation code). Good quantitative variables include phrases such as how many, how often, etc. and the respondent answers with a number. Who was asked the question (everyone or a subset of the sample). Summary The Problem (From Last Class) To create a plausible range of values for a population parameter: 1. Take many random samples from the population, and compute the sample statistic for each sample. 2. Compute the standard error (i.e., the standard deviation of all these statistics). We have one random sample of the population. We need multiple (i.e. >1,000) random samples from the population to calculate the standard error of the sample statistic. We cannot afford to conduct 1,000+ random samples. 3. The plausible range of values = sample statistic ± 2 × standard error Is there a way to use the information from the one random sample we have, to simulate 1,000+ random samples? One small problem… Often we only have one sample! How can we calculate the variation in sample statistics, if we only have one sample? Random Sample Population (if the sample size is sufficiently large) Yes, the Bootstrap simulation method enables us to generate a sampling distribution. Random Sample Population (if the sample size is sufficiently large) When we conducted our study yesterday, we treated this bag of Reese’s Pieces as the population and we drew random samples of size 10 from this population. Is the proportion of orange pieces in this bag pretty close to the proportion of the entire population of Reese’s Pieces? If our sample size was 500 (the total number of pieces in the bag) we would know the exact proportion of the orange pieces in the population. Because this bag is a random sample of the entire population of all Reese’s Pieces produced by Hershey’s. Why? N = 500 n = 10 Hershey Factory N = 1 Billion+ N = 500 1 Random Sample Population (if the sample size is sufficiently large) Hershey Factory N = 1 Billion+ p ≅.52 Population (if the sample size is sufficiently large) n = 500 ̂ = .52 Random Sample Population Thus, we can draw 1,000+ samples (n=500) from this “population” to generate the equivalent of 1,000+ random samples from the population. Reaching into this bag to draw more samples, it’s equivalent to reaching into the vat of Reese’s Pieces at the Hershey Factory. To generate a sampling distribution, we can take repeated random samples from this “population”. Our goal is to determine the population proportion of orange pieces. We can use this sample to generate multiple samples. Let’s say that we’ve determined that a sample size of 500 is sufficiently large enough to adequately estimate the population proportion. Because this bag is a random sample of the entire population, it contains approximately the same proportion of orange Reese’s Pieces as the entire population. This bag is the one random sample (n=500) of the population we have. It gives us one sample statistic, but we need multiple sample statistics. (if the sample size is sufficiently large) Drawing additional samples from this bag is equivalent to drawing additional samples from the vat of Reese’s Pieces at the Hershey Factory? Population (if the sample size is sufficiently large) Because this bag is a random sample of the entire population, it contains approximately the same proportion of orange Reese’s Pieces as the entire population. Random Sample Random Sample n = 500 Sampling with Replacement n = 500 Random Sample Population (if the sample size is sufficiently large) But won’t all of the subsequent samples of size 500 we draw from this bag produce the same sample statistic as the original sample? Yes, unless we sample with replacement. n = 500 Why “bootstrap”? After we sample a unit, we put it back into the “population” such that each unit can be selected more than once. By sampling with replacement, we can ensure that the “population” in the bag retains the same proportion of orange pieces that are in the true population. This type of sampling process is known as bootstrapping. “Pull yourself up by your bootstraps” Lift yourself up into the air simply by pulling up on the laces of your boots. A metaphor for accomplishing an “impossible” task without any outside help. 2 Bootstrapping Terms Bootstrap sample: A random sample taken with replacement from the original sample. It needs to be the same size as the original sample. Bootstrap sample statistic: The sample statistic computed on the bootstrap sample. Bootstrap sampling distribution: The sampling distribution of many bootstrap sample statistics. Original Sample Sample Statistic StatKey Bootstrap Sample Bootstrap Sample Statistic Bootstrap Sample Bootstrap Sample Statistic . . . . . . Bootstrap Sample Bootstrap Sample Statistic Bootstrap Sampling Distribution Standard Error www.lock5stat.com\statkey The variability of the bootstrap statistics is similar to the variability of the sample statistics. count = 260 n = 500 The standard error of our sample statistic can be estimated using the standard deviation of the bootstrap sampling distribution. Reese’s Pieces Bootstrap Distribution Based on this sample, give a 95% confidence interval for the true proportion of Reese’s Pieces that are orange. A. (0.50, 0.54) B. (0.48, 0.56) C. (0.48, 0.52) D. (0.46, 0.54) You have a sample of size n = 500. You sample with replacement 1000 times to get 1000 bootstrap samples. What is the sample size of each bootstrap sample? 0.52 2 × 0.02 Standard Deviation = .02 Sample Mean = .52 A. 500 B. 1,000 Bootstrap samples are the same size as the original sample. 3 Bootstrap Distribution Bootstrap Distribution You have a sample of size n = 500. You sample with replacement 1000 times to get 1000 bootstrap samples. You have a sample of size n = 500. You sample with replacement 1000 times to get 1000 bootstrap samples. How many bootstrap sample statistics will you have? How many dots will be in a dotplot of the bootstrap sampling distribution? A. 1 B. 500 C. 1,000 Each bootstrap sample yields one bootstrap sample statistic. A. 50 B. 1,000 C. 50,000 Atlanta Commutes Each dot in the bootstrap sampling distribution corresponds to one bootstrap sample statistic. Random Sample of 500 Commutes What’s the mean commute time for workers in metropolitan Atlanta? Dot Plot CommuteAtlanta The Original Sample n = 500 ̅ = 29.11 minutes s = 20.72 minutes 20 40 60 80 100 120 140 160 180 Time This dotplot is… A. the sample distribution of commute times. B. the sampling distribution of sample statistics. Random Sample of 500 Commutes Random Sample of 500 Commutes Dot Plot CommuteAtlanta Dot Plot CommuteAtlanta The Original Sample The Original Sample n = 500 ̅ = 29.11 minutes s = 20.72 minutes n = 500 ̅ = 29.11 minutes s = 20.72 minutes Hint: CI = sample statistic ± 2 × standard error 20 40 60 80 100 120 140 160 180 Time The confidence interval for the point estimate is… 29.11 ± 2×20.72 is the interval that contains 95% of the A. 29.11 20.72 B. 29.11 2 20.72 commute times in the original sample. The variability of the sample statistic ̅ is C. cannot be determined with the data available not known. 20 40 60 80 100 120 140 160 180 Time How can we determine the variability of the sample statistic so that we can calculate the confidence interval for the population parameter ( )? Generate a Bootstrap Distribution using The Original Sample 4 Bootstrap Distribution www.lock5stat.com/statkey/ The Beauty of Bootstrapping We can use bootstrapping to assess the uncertainty surrounding any sample statistic. The 95% Confidence Interval point estimate ± the margin of error If we have sample data, we can use bootstrapping to estimate a 95% confidence interval for population parameter. sample statistic ± 2 × standard error sample statistic ± 2 × sd of the bootstrap sampling distribution 29.11 ± 2 × 0.915 The 95% confidence interval for the average commute time is A. (28.2, 30.0) B. (27.3, 30.9) C. (26.6, 31.8) Obama’s Approval Rating Obama’s Approval Rating http://www.gallup.com/poll/113980/Gallup-Daily-Obama-Job-Approval.aspx http://www.gallup.com/poll/113980/Gallup-Daily-Obama-Job-Approval.aspx Gallup surveyed 1,500 Americans between June 9th-11th 2012 and 49% of these people approved of the job Barack Obama is doing as president. Sample statistic: (sample proportion) ̂ = .49 Calculate a 95% CI for the sample proportion. www.lock5stat.com/statkey Count = 735 n = 1,500 Obama’s Approval Rating www.lock5stat.com/statkey CI = (.464, .516) Count = 735 N = 1500 Middle 95% of the bootstrap statistics 0.464 CI = original sample proportion ± 2 ×= standard Count 735 error = .49 ± 2 × 0.013 N = 1500 = .49 ± .026 (remember Gallup’s margin of error was ± .03 = (.464, .516) We are 95% confident that the true percentage of all Americans that approve of Obama’s job performance is between 46.4% and 51.6% Two Methods for Calculating a 95% CI Count = 735 The Standard Error Method or The Percentile Method N = 1500 CI = sample statistic ± 2 × standard error = (.464, .516) Middle 95% of the bootstrap statistics We are 95% confident that the true percentage of all Americans that approve of Obama’s job performance is between 46.4% and 51.6% 0.516 0.464 0.516 5 Other Levels of Confidence What if we want to be more than 95% confident? How might you produce a 99% confidence interval for the point estimate? Percentile Method For a P% confidence interval, keep the middle P% of bootstrap statistics For a 99% confidence interval, keep the middle 99%, leaving 0.5% in each tail. The 99% confidence interval would be: (0.5th percentile, 99.5th percentile) where the percentiles refer to the bootstrap distribution. www.lock5stat.com/statkey Level of Confidence The Effects of Sample Size Which is wider, a 90% confidence interval or a 95% confidence interval? A. 90% CI B. 95% CI Are these bootstrap distributions the same? n = 1500 n = 100 A 95% interval captures the middle 95%, which is a wider range than the middle 90% SE = .05 0.39 0.49 0.59 SE = .013 0.464 0.49 0.516 A bootstrap using a sample size of 1500 generates a standard error that is much smaller than the standard error generated from a sample size of 100. The margin of error decreases from 0.200 to 0.052. Assignment Finding Sample Proportions from the GSS Part I: Graded Problems 3.74, and 3.76(a,c,d, and e) Part II: Goto http://sda.berkeley.edu/cgi-bin/hsda?harcsda+gss10 Enter the variable name here For the following 3 categorical variables calculate the confidence interval for the point estimate (i.e., sample proportion) using: 1. The Standard Error Method (show your work) 2. The Percentile Method (print/email your screenshot from www.lock5stat.com/statkey ) GENDER calculate the confidence interval for the proportion who are female DIVORCE calculate the confidence interval for the proportion who’ve been divorced GUNLAW calculate the confidence interval for the proportion who favor gun laws Check “Column” Check “Confidence Intervals” Check “Unweighted” Click on “Run Table” 6 Finding Sample Proportions from the GSS StatKey and the Percentile Method On the StatKey home page, click on “CI for Single Proportion” to get to this page. This is the sample proportion This is the confidence interval This is the number of respondents who indicated being female This is the total number of respondents Click on “Edit Data” and a window will pop up to enter: n (the sample size) count (the # of respondents who are in the category you are interested in e.g., female) Click here to generate 1000 bootstrap samples. Click on “Two-Tail” to get the 95% confidence interval. These values represent the 95% confidence interval. Summary The standard error of a statistic is the standard deviation of the sampling distribution, which can be estimated from a bootstrap distribution. Confidence intervals for population parameter estimates can be calculated using the standard error or the percentiles of a bootstrap distribution. Confidence intervals can be calculated this way as long as the bootstrap distribution is approximately symmetric and continuous. 7
© Copyright 2025 Paperzz