Midterm Exam Instructor: Tessa Childers-Day Stat 20 12 July 2012 Please write your name and student ID below, and circle your section. With your signature, you certify that you have not observed poor or dishonest conduct on the part of your classmates. You also certify that you have not been a party to poor or dishonest conduct, and that the work on this exam is solely your own. Name: Student ID: Signature: Date: Section: 101 (2pm-3pm) 102 (3pm-4pm) Answer the questions in the spaces provided. There are questions on the front and back of each page. This midterm covers the material from Lectures 1 through 13, and Homeworks 1 through 6. Show your work, including labeling quantities (such as z-scores). The clearer that your work is, the easier it is to award partial or full credit. If you do not show your work, you will not receive credit. You are welcome to leave your answers as fractions. If you use decimals, please round all answers to two significant figures, and hold your rounding until the final calculation. Question Points 1 8 2 2 3 6 4 3 5 4 6 20 7 7 8 5 9 5 Total: 60 Score Stat 20 Midterm Exam, Page 2 of 12 12 July 2012 1. I am interested in exploring rates of condom usage by college students in the United States, including factors that may influence usage. I want to design a survey and collect student responses. Please address each of the following issues, being as specific as possible. (a) (2 points) (Survey or Census) A fellow investigator believes that we should not do a survey, and instead should perform a census (questioning all college students). How would you explain why a survey is preferable to a census in this case? Or is the other investigator correct? Solution: A survey is preferable to a census because a census is very expensive in both money and time. Interviewers must be trained or questionnaires mailed, and non-responses must be followed up on. (b) (3 points) (Sample Design) I propose to create a list of all colleges in the United States, by state. From each state, I will choose the 2 largest colleges. From each college, I will sort students alphabetically, and mail every 100th student a survey. What kind of sampling plan is this? Comment on strengths and weaknesses of this plan, including what sort of biases, if any, I should worry about. Solution: Many answers could be satisfactory, this is just one. This is a multistage plan, including stratification into states (exhaustive and distinct) and systematic (every 100th student) sampling. A strength is the good geographical coverage (stratification into states guarantees coverage of the whole country). A weakness is that by using the two largest schools, I exclude small schools, such as religious institutions, which may lead to selection bias. I should also be worried about non-response bias (students who fail to return the survey). (c) (3 points) (Survey Design) Recalling that the purpose of the study is to explore rates of condom usage and factors that influence usage, I decide to ask how often a student uses a condom, when engaging in intercourse. Is this all I should ask about, or are there other variables that I should collect data about? If there are other variables, please give several examples. How should questions be asked? What sort of biases, if any, should I worry about? Solution: Many answers could be satisfactory, this is just one. We should ask about other variables, like gender, age, whether they are in a serious relationship, usage of other forms of birth control, religious/cultural views on birth control, whether the cost of birth control is an influencing factor, etc. Because this is a potentially embarrassing subject, we should use a survey that maintains anonymity, and the questions should be neutrally and non-judgementally worded, to avoid interviewer/response bias. 2. (2 points) The table below could be used to generate a histogram showing the distribution of the scores on a Stat 20 quiz (though the histogram is not shown here). There were 8 possible points, and due to partial credit, no students scored a 0. The table below gives the intervals and heights of the bars. Intervals contain the right endpoint but not the left. Fill in the height for the missing interval. score (in points) 0-3 3-4 4-5 5-6 height (% per point) 3.33 17 19 21 Solution: 6-8 Stat 20 Midterm Exam, Page 3 of 12 12 July 2012 score (in points) 0-3 3-4 4-5 5-6 6-8 height (% per point) 3.33 17 19 21 16.505 Stat 20 Midterm Exam, Page 4 of 12 12 July 2012 3. Suppose that, on any given day, if the temperature reaches 70◦ (or above), I open my office window. If the temperature does not reach 70◦ F, I do not open my office window. All temperatures are measured in degrees Fahrenheit. I collect data on 100 days of temperatures, as well as the status of my office window. I record the information about the window as “open” or “closed”. Later, I decide to arbitrarily assign numeric codes to window status, coding “1” if the window is open and “0” if the window is closed. A sample of 5 days of data are shown below. Temperature 71◦ 69◦ 66◦ 66◦ 70◦ .. . Window Status open closed closed closed open .. . Window Status (coded) 1 0 0 0 1 .. . Circle the correct answer: (a) (1 point) The level of measurement for temperature is A. Nominal B. Ordinal C. Interval D. Ratio (b) (1 point) The level of measurement for window status is A. Nominal B. Ordinal C. Interval D. Ratio (c) (1 point) The level of measurement for window status (coded) is A. Nominal B. Ordinal C. Interval D. Ratio (d) (3 points) I tell my office-mate, “Since we always open the window if the temperature is 70◦ or higher, knowing the temperature perfectly predicts the status of our window. Thus, the correlation between temperature and window status is +1.” She disagrees, claiming to have calculated the correlation, and found that it is 0.83. Is she right, am I right, or are we both wrong? How do you explain this contradiction? (Hint: Drawing a picture may help) Solution: We are both wrong. In this case, one shouldn’t calculate correlation–correlation measures linear relationship, this isn’t linear, not football shaped. Perfectly predictable, but not linear at all. Also, order of encoding matters–0 for closed or 0 for open. Treating the window status numerically doesn’t make much sense, so calculating a correlation doesn’t make much sense either. Stat 20 Midterm Exam, Page 5 of 12 12 July 2012 4. (3 points) I own 10 pairs of socks. They can be described by two attributes–coloring and pattern. 4 pairs are brightly colored, while 6 pairs are striped. 3 pairs are both brightly colored and striped. What is the chance I wear brightly colored or striped socks three days in a row? (Assume that all of my socks are clean at the beginning of the first day, and that I neither do laundry nor re-wear socks). Solution: Note that color and pattern are NOT mutually exclusive. Thus, P(brightly colored OR striped) = P(brightly colored) + P(striped) - P(brightly colored AND striped) 4 6 3 7 = + − = = 0.7 10 10 10 10 We are looking at 3 draws, without replacement, from a box with 7 tickets that we desire (brightly colored or striped) and 3 tickets that we do not desire (neither brightly colored nor striped). Thus: P(three days) = P(1st day brightly colored OR striped) × P(2nd day brightly colored OR striped|1st day brightly colored OR striped) × P(3rd day brightly colored OR striped|2 days brightly colored 6 5 210 7 × × = = 0.292. OR striped) = 10 9 8 720 5. Below is a histogram containing the number of drivers killed in the UK for each month, from January 1969 through December 1984. 0.010 0.000 0.005 Density 0.015 Histogram of Driver Deaths per Month in UK: 1969−1984 60 80 100 120 140 160 180 200 Number of Drivers Killed (a) (2 points) Sketch and label the approximate locations of the mean, median, and mode on the histogram. No calculations are necessary. Solution: We are only concerned about the rough locations of these quantities, and their ordering. Stat 20 Midterm Exam, Page 6 of 12 12 July 2012 Histogram of Driver Deaths per Month in UK: 1969−1984 0.010 0.000 0.005 Density 0.015 mean median mode 60 80 100 120 140 160 180 200 Number of Drivers Killed (b) (2 points) There are 4 boxplots shown. One of them corresponds to the same data set as the histogram in (a) (the other three do not). Indicate which boxplot corresponds to the histogram in (a). (Labels are shown below the corresponding boxplot) (b) Boxplot 3 Solution: Boxplot 3 is the correct box plot. Boxplot 1 is too symmetric, with too many outliers, indicating heavy tails. Boxplot 2 is too far from symmetric, with too many outliers on the right tail. Boxplot 4 is skewed in the wrong direction. Boxplot 3 shows the appropriate right tail, two possible outliers, and a median nearly in them middle of the data. Stat 20 Midterm Exam, Page 7 of 12 Boxplot of Driver Deaths per Month in UK: 1969−1984 200 200 Boxplot of Driver Deaths per Month in UK: 1969−1984 ● ● ● ● 180 ● ● 180 160 ● ● ● ● 140 120 100 120 100 80 60 80 ● ● ● ● ● ● ● ● ● ● ● 60 ● ● ● ● ● ● ● Number of Drivers Killed 160 ● ● ● ● ● ● ● 140 Number of Drivers Killed ● ● ● ● ● ● Boxplot #1 Boxplot #2 Boxplot of Driver Deaths per Month in UK: 1969−1984 200 Boxplot of Driver Deaths per Month in UK: 1969−1984 200 12 July 2012 ● 160 140 120 60 80 100 Number of Drivers Killed 140 120 100 80 60 Number of Drivers Killed 160 180 180 ● Boxplot #3 Boxplot #4 Stat 20 Midterm Exam, Page 8 of 12 12 July 2012 6. Measurements of the weight of 44 snowy plover eggs (along with the weight of the 44 chicks after they are hatched) were collected by BLSS: The Berkeley Interactive Statistical System of Abrahams and Rizzardi. The weight of the egg and of the newly hatched chick are measured in grams (gm). A scatterplot of chick weight versus egg weight is shown below, along with means, SDs, and the correlation coefficient. Assume that the scatterplot is “football shaped”, and thus that egg weight and chick weight have approximately normal histograms. Chick Weight vs. Egg Weight ● 7.0 ● ● ● ● ● ● ● ● 6.0 Chick Weight (gm) 6.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5.5 ● ● ● ● 7.5 8.0 8.5 9.0 9.5 10.0 Egg Weight (gm) Egg Weight: mean = 8.63 gm sd = 0.48 gm Chick Weight: mean = 6.15 gm sd = 0.41 gm correlation: r = 0.85 (a) (3 points) Fill in the blank: About 40% of the chicks have weights between 5.8 grams and grams. Solution: A chick that weighs 5.8 grams has a z-score of -0.85 grams, and is thus at the 20th percentile. The 60th percentile is then the point which we are looking for. The 60th percentile corresponds to a z-score of 0.25, which is 6.15 + 0.25 × 0.41 = 6.253 grams. (b) (2 points) One of the chicks is at the 75th percentile for weight. Predict the weight of the egg it hatched from. Include units. Stat 20 Midterm Exam, Page 9 of 12 12 July 2012 Solution: Here, x = chick weight, y = egg weight. Since the chick has z-score of 0.85, y − ȳ x − x̄ =r SDy SDx y − 8.63 = 0.85 (0.65) 0.48 y = 0.85 (0.65) (0.48) + 8.63 = 8.90 We predict that the egg weighs 8.90 grams. (c) (2 points) The prediction in (b) is likely to be off by how much? Include units. Solution: The RMS error for regression tells us how far off a prediction is likely to be. The RMS error for predicting egg weight from chick weight features egg weight as the y variable, so p p RMS Error = 1 − r2 SDy = 1 − .852 (0.48) = 0.25. Our prediction is likely to be off by 0.25 grams or so. (d) (4 points) Among all chicks at the 75th percentile for weight, approximately what percent hatched from eggs heavier than 75% of all eggs? Solution: Chicks at the 75th percentile for weight come from eggs that weigh, on average, 8.90 grams, give or take 0.25 grams or so. These are our new mean and new SD for the normal curve inside the strip. We want to know what percentage of these eggs are heavier than the 75th percentile for all eggs. The 75th percentile for all eggs is at z = 0.65, or 0.65 × 0.48 + 8.63 = 8.94 grams. Thus, we need the area above z= 8.94 − 8.90 0.04 = = 0.16. 0.25 0.25 100 − 11.92 = 44.04% of 2 percentile for weight hatched from eggs heavier than 75% of all eggs. The area corresponding to z = 0.16 is 11.92%. Thus approximately chicks at the 75th (e) (2 points) Using the regression method, and predicting chick weight from egg weight, we find that eggs weighing 8 grams, produce chicks weighing 5.69 grams, on average. True or false, and explain your answer: using the regression method, and predicting egg weight from chick weight, we would find that chicks weighing 5.69 grams hatch from eggs weighing 8 grams, on average. Solution: False. There are 2 regression lines, one that predicts chick weight from egg weight, and one that predicts egg weight from chick weight. The first one tells us that a chick from an egg of weight 8 grams will weigh 5.69 grams, on average. The second one tells us that chicks that weigh 5.69 grams hatch from eggs that weigh 8.17 grams, on average (NOT 8 grams). You did not need to calculate the estimate for egg weight based on chick weight. The explanation was sufficient. (f) (4 points) Somehow, the data on chicks weighing more than 6.2 grams is lost. The correlation between egg weight and chick weight for the remaining data points will be: A. About the same (r ≈ 0.85) 0.85) B. Somewhat less (r < 0.85) C. Somewhat greater (r > Stat 20 Midterm Exam, Page 10 of 12 12 July 2012 Circle the correct choice, and explain why it is correct. Solution: The correlation decreases, because by cutting off part of our data, we lose some of the linearity. The remaining cloud of data points looks more round, and less football shaped, so our correlation decreases. This is called attenuation. (g) (3 points) One of your classmates observes that chick that hatch from small eggs tend to be larger than the regression line predicts, while chicks that hatch from large eggs tend to be smaller than the regression line predicts. They attribute this to small eggs having thin shells, and large eggs having thick shells (i.e., the weight is due not to the chick, but to the shell). Does this make sense? Or is something else at work here? Solution: This is actually the regression effect at work–nothing special is going on here. Your classmate is committing the regression fallacy by believing that a simple phenomenon that occurs in all regression situations is actually due to a significant/important cause. Stat 20 Midterm Exam, Page 11 of 12 12 July 2012 7. There are 20 coins in a jar. Of these, 8 are quarters, 5 are dimes, 3 are nickels, and 4 are pennies. 8 coins are drawn at random, without replacement from the jar. (a) (2 points) What is the chance that the first coin is a quarter and the second coin is a dime? 8 Solution: P(1st quarter and 2nd dime) = P(1st quarter) × P(2nd dime |1st quarter) = × 20 5 = 0.105 19 (b) (2 points) What is the chance that the fourth coin is a quarter and the eighth coin is a dime? Solution: Because we know nothing about any of the other coins, this is the same as the probability above. P(4th quarter and 8th dime) = P(1st quarter and 2nd dime) = P(1st quarter) × P(2nd dime 5 40 8 × = = 0.105 |1st quarter) = 20 19 380 (c) (3 points) What is the chance that the last two coins are of the same denomination? Solution: Since we know nothing about any of the other coins, this is the same as the first two draws being of the same denomination. P(last two are the same) = P(first two are the same) = P(first two are quarters or first two are dimes or first two are nickels or first two are pennies). All of these events are mutually exclusive, thus P(last two are the same) = P(first two are quarters) two are dimes)+ P(first twoare + P(first 8 7 5 4 3 2 4 3 nickels) + P(first two are pennies) = × + × + × + × = 20 19 20 19 20 19 20 19 94 = 0.247 380 Stat 20 Midterm Exam, Page 12 of 12 12 July 2012 8. A fair die is rolled 5 times. Find the chances of the following events. (a) (2 points) All “6”s Solution: P(five “6”s) = 5 1 1 = = 0.00013 6 7776 (b) (3 points) At least two “6”s 1 Solution: Note that a dice roll can fit the binomial situation, where P(6) = and P(not 6) = 6 5 . P(at least two “6”s) = 1 - P(fewer than two “6”s) = 1 - P(zero “6”s or one “6”) = 1 - [P(zero 6 " 5 0 4 1 # 5! 5! 5 1 5 1 “6”s) + P(one “6”)] = 1− + = 1−[0.402 + 0.402] = 0.196 5!0! 6 6 4!1! 6 6 9. A fair die is rolled 10 times. Find the chances of the following events. (a) (2 points) Exactly five “5”s Solution: This is just a single binomial term P (five “500 s) = 10! 5!5! 5 5 1 5 = 0.013 6 6 (b) (3 points) At most one “6” in the 10 rolls, given there are no “6”s in the first five rolls Solution: Since there are no “6”s in the first five rolls, we must roll at most one “6”s in the second five rolls. The two sets of five rolls are independent of each other. Thus, we simply need the probability of rolling zero “6”s or one “6” in 5 rolls. This was found in (b). P(zero “6”s or one “6” in five rolls | zero “6”s or one rolls) = P(zero “6”s or #one " “6” infive 4 1 5 0 5! 5 1 5! 5 1 “6” in five rolls) = P(zero “6”s) + P(one “6”) = + = 5!0! 6 6 4!1! 6 6 [0.402 + 0.402] = 0.804
© Copyright 2026 Paperzz