Sample solutions to Assignment 2 5.40 Suppose a positive relationship has been found between each of the following sets of variables. For each set, discuss possible reasons why the connection may not be causal. a) Number of deaths from automobiles and soft drink sales for each year from 1950 to 2000. The population has increased in that period and one would expect a corresponding increase in both variables simply as a result of population increase. Thus the relationship does not necessarily imply that one variable causes the other. Comparing rates per capita, instead of absolute numbers, would adjust for changes in population but not for other possible confounding factors that change over the period. b) Amount of daily walking and quality of health for men over 65 years old. Factors that could contribute to the relationship other than a causal effect of walking on health include: a) men who are healthier to begin with may have a higher propensity to walk, b) "over 65" includes a broad range of ages. Age could be a confounding factor if both health and the propensity to walk decrease with age. c) other factors such as social class, culture, location of residence could act as confounding factors. c) Number of ski accidents and average wait time for the ski lift for each day during one winter at a ski resort. Average wait time will be highly correlated with the number of skiers on the slope on any given day as will the number of accidents even if the probability of an accident for each skier remains the same from day to day. Other confounding factors might include skiing experience which may tend to be lower on average on busy days; and the density of skiers on the slopes which will be higher on busy days. However ski accidents could cause longer wait times if the use of lifts becomes preempted by ski patrols in the course of performing rescues. 5.46 The equation is = yˆ 0.25 + 0.384 x where y is foot length in cm and x is height in in. a) How much does average foot length increase for each 1-inch increase in height? 0.384 cm/in b) Predict the difference in foot lengths for men whose height differs by 10 in. 10 × 0.384 = 3.84 cm c) Suppose Max is 70 in. tall and has a foot length of 28.5 cm. Based on the regression equations what is the predicted foot length for Max? What is the value of the prediction error (residual) for Max? Predicted value: yˆ| x =70 = 0.25 + 0.384 × 70= 27.13 cm Residual: y − yˆ= 28.5 − 27.13= 1.37 cm 5.48 r2 = −0.362 = 0.1296 Using hours of study as a predictor for hours of sleep 'explains' 12.96% of the variance. I.e. the 'unexplained variance' in hours of sleep (the variance of the residuals after predicting hours of sleep from hours of study) is 87.04% of the overall variance in hours of sleep. 5.49 The winning time in the Olympic men's 500-meter speed skating race over the years 1924 to 1992 can be described by the regression equation: Winning time = 255 − 0.1094 Year a) Is the correlation between winning time and year positive or negative? Explain. The correlation is negative because the correlation has the same sign as the regression coefficient since: regression coefficient correlation = Std .Dev. of x × Std .Dev. of y The standard deviations in the denominator are positive so the correlation and the regression coefficient must have the same sign. b) In 1994, the actual winning time for the gold medal was 36.33 seconds. Use the regression equation to predict the winning time for1994, and compare the prediction to what actually happened. Predicted time: yˆ|1994 = 255 − 0.1094 ×1994 = 35.8564 so the actual winning time was 36.33 − 36.8564 = −0.5264 , i.e. 0.5264 seconds faster than predicted. c) Explain what the slope of -0.1094 indicates in terms of how winning times change from year to year. The slope means that the winning time improves, on average, by 0.1094 seconds per year over the period from 1924 to 1992. d) Why should we not use this regression equation to predict the winning time in the 2050 Winter Olympics? The regression assumes a linear relationship which may be a reasonable approximation for a period of time but it certainly cannot be extrapolated very far otherwise the winning time would, eventually, be predicted to be below what is humanly possible, or even, physically possible. 480 500 520 540 Verbal 560 580 5.59 a) 20 40 60 PctTook 80 b) Assuming that a straight line represents the relationship between average SAT per state and the percentage of students who take the SAT, the predicted average SAT between states goes down by 1.081 for each percent increase in the participation rate. c) The scatterplot suggests two major clusters: states in which fewer that 20% take the test and states in which 50% or more take the test with just a few states scattered between. Among the states in which 50% or more take the test, the expected score seems relatively constant, around 500. Between 10% and 50%, a line seems reasonable. d) The intercept represents the mean Verbal score when the percent taking the test is 0. However the mean of nothing is not defined. 5.60 580 560 540 480 500 520 Verbal 480 500 520 540 560 580 600 Math a) The relationship is linear with a correlation of about .9 (visually). b) The most notable outlier is #12 (Hawaii) where Verbal score is much lower than the Math score. Similar outliers, but less marked, are California and Illinois. Three states have Verbal scores that are somewhat higher than expected in relationshiop to Math scores: Mississippi, Arkansas and West Virginia. Hawaii is the only state in with two official languages, English and Hawaian, reflecting a demographic composition in which performance on a verbal test in the English language might be hindered due to reduced fluency in English. Illinois and California might have large groups of more recent immigrants whose performance might be hindered in a test on the English language. The positive outliers seem to be states with possibly low immigration and a large proportion of native English speaker who may have a relative advantage with a test in the English language. 0 200 400 b) 600 Salary 800 1000 5.62 40 50 60 Age 70 The correlation is 0.13 and r-squared is 0.0169. c) At higher ages there is a tendency to have higher salaries. An increase in one standard deviation of age is associated with an increase of 0.13 standard deviations in Salary. The relationship is weak. Chapter 6. 6.7 a) The row percents indicate the percentage of students, for each sex, who fall in each of the three categories when asked their perception of their weight. b) You might have done the following by hand or with Excel but here's how it 'could' have been done with R‼ c) About Right Overweight Underweight Male Female 0 20 40 60 80 Freq d) More males than females consider themselves 'about right' (77% versus 67%). Of those who don't consider themselves about right, women tend to consider themselves overweight (30% overweight vs 2% underweight) while men consider thenselves underweight (4% overweight vs 19% underweight). e) The sample consisted of Penn State statistics students. With respect to weight attitudes they may have been representative of Penn State students in general although we can't be sure. They are probably not representative of students in general and almost certainly not of the population as a whole. 6.22 Using the terminology of this chapter, what name applies to each of the boldface numbers in the following quotes. a) "Fontham found increased risks of lung cancer with increasing exposure to secondhand smoke, whether if took place at home, at work, or in a social setting. A spouses's smoking alone produced an overall 30 percent increase [relative increase in risk] in lung-cancer risk" b) "What they found was that women who smoked had a risk [of getting lung cancer] 27.9 times as great [relative risk] as non-smoking women; in contrast, the risk for men who smoked regularly was only 9.6 times greater [relative risk] than that for male non-smokers" c) One student in five [risk] reports abandoning safe-sex practices when drunk. 6.27 a,b) Tax rate in each income category and overall: 1974 Under 5,000 5.4% 5K to 9,999 9.3% 10K to 14,999 11.1% 15K to 99,999 16% 100,000 + 38.4% Total 14.1% 1978 3.5% 7.3% 10% 15.9% 38.3% 15.2% c) they decreased within each income category d) the overall tax rate increased e) This is an instance of Simpson's Paradox. Although the tax rate decreased in each category from 1974 to 1978, incomes shifted upwards so that in 1978 a larger proportion of incomes were in higher categories with the result that the overall rate in 1978 gives more weight to the higher rates in higher income categories. 6.29 a) The new treatment was the more successful treatment in Hospital A. The percent surviving with the new treatment was 100/1000 or 10%, compared to 5/100 or 5% with the standard treatment. b) The new treatment was the more successful treatment in Hospital B. The percent surviving with the new treatment was 95/100 or 95%, compared to 500/1000 or 50% with the standard treatment. c. The combined contingency table is: survive standard 505 = 5 + 500 new 195 = 95 + 100 total 600 die 595 = 95+500 905 = 900 + 5 1600 total 1100 1100 2200 The standard treatment is more successful than the new treatment in the combined data set. The percent surviving with the standard treatment is (505/1100) 100% = 45.9%, compared to (195/1100) 100% = 17.7%. This is an example of Simpson’s Paradox. 6.43 Step 1: H0: There is no relationship between hormone therapy and death from CHD Ha: There is a relationship between hormone therapy and death from CHD Step 2: Expected counts: Step 3: From above: p-value = 0.4679 Step 4: Fail to reject H0 at the 5% level of significance because the p-value is greater than 0.05 Step 5: Conclude there is not enough evidence to say that there is a relationship between hormone therapy and risk of death from CHD. 6.56 Since the p-value of 0.0005 is less than 0.001 we conclude that there is very strong evidence of a relationship between Anger and Heart Disease. 6.57 In the "no anger" group, 191 men were free of heart disease while 8 men had heart disease yielding odds of 191 to 8 or, dividing both counts by 8, shows that the odds are 23.875 to 1, or about 24 to 1. In the "most anger" group, 500 men were free of heart disease while 59 men had heart disease yielding odds of 500 to 59, i.e. 8.475 to 1, or approximately 8.5 to 1. 6.58 The odds ratio is odds for no anger 191/ 8 = = 2.817 odds for most anger 500 / 597 The odds of remaining free of heart disease versus getting heart disease for men with no anger are about 2.8 times the odds of those events for men with the most anger. 6.62 c) Based on the p-value, we conclude that there is evidence of a relationshiop between Sex and feelings about weight. d) There appears to be a reversal in that more men than women feel "about right" at Penn State in contrast with UC Davis where more women than men feel about right. The difference may not be significant however. Among those who are not satisfied with their weight, the women at both Penn State and UC Davis were particularly concerned with being overweight. However, in contrast with Penn State where men were primarily concerned with being underweight, the men at UC Davis were more equally concerned with being overweight and being underweight. Chapter 7 Exercises Notes: 1. Numbers in red need to be done for Assignment 2. 2. The numbers shown in the text all have the form '7.x' where 'x' is the number of the question within chapter 7. In the following lists I only show 'x'. Chapter 7, pp. 240--247: 1. Random Circumstances & 2. Interpretations of Probability 2, 5, 7, 16 3. Probability Definitions and Relationships 18, 19, 20 4. Basic Rules for Finding Probabilities 34, 35, 36, 42 5. Strategies for Finding Complicated Probabilities 44, 45, 46, 47, 50, 54 6. Using Simulation to Estimate Probabilities none 7. Coincidences and Intuitive Judgments about Probability 64, 68, 72, 76 (similar question likely to be on test) Chapter Exercises 82, 83, 84, 85, 91 to 98 (sequence of exercises on same problem). 7.34 a) 1/2 b) 1/2 c) 1/2 x 1/2 = 1/4 d) 1/2 + 1/2 – 1/4 = 3/4 7.36 7.42 a) The issue involved is whether any of the traits of driving a red pickup truck, being a smoker and having blond hair are related. It seems reasonable to assume the three traits are independent, but we can't know for certain. Perhaps people who drive red vehicles or people who drive pickup trucks are more likely to be risk takers, as are people who smoke. b) P(red pickup truck and smokes and blond)= (1/50)(.30)(.20) = .0012. Use the multiplication rule for independent events (Rule 3b extension). c) Number fitting the description of the criminal = 10,000(.0012) = 12. The value of the proportion (.0012) was determined in part (b). d) Probability = 11/12 = .917 that the driver arrested by the police is innocent. There are 12 vehicle owners who fit the description. Assuming the description is accurate (and that the perpetrator comes from this town), one of these 12 is guilty and the other eleven are innocent. e) The answer to part (d) suggests an argument against the prosecutor's reasoning. While the evidence narrows the possibilities down to only twelve people, eleven of these twelve people are innocent. Conditional on the given evidence (and only the given evidence), the probability is high (.917) that the arrested person is innocent. 7.45 a) (.6)(100) = 60. b) (.2)(60) = 12. You could also solve this as (.6)(.2)(100) = 12. c) Senior Non-senior Science 12 48 Liberal Arts 12 28 Total 24 76 d) 24/100 = .24. Therefore, 24% of the students are seniors. Total 60 40 100 7.46 7.54 7.64 7.68 a) The probability that the person actually carries the virus is 1/11 = .0909 because in the lowrisk population, for every infected person who tests positive there are 10 people who test positive and do not carry the virus. b) Although the probability of testing positive for those with the disease is high, the reverse is not true. If there are a very large number of people who do not have the disease, then even if only a small percent of those test positive, the result will be a large number of positive tests in healthy people. Here is a table that conveys this idea. Notice that everyone with the disease tests positive, and almost 90% of those without the disease correctly test negative, yet of every 11 people who test positive, only one has the virus. Test positive HIV 10 Non HIV 100 Total 110 Test negative 0 890 890 Total 10 990 1000 7.72 No, she is not correct. Assuming birth outcomes are independent, having 3 consecutive boys does not change the probability that the fourth child will be a boy. 7.82 a) Percent = 40%, and P(get A | regular attendance) = .40 b) Percent = 10%, and P(get A | not regular attendance) = .10 c) Students who get an A either regularly attend or do not. Over the whole class: P(get A) = P(regularly attend and get A) + P(do not regularly attend and get A) The two components on the right side of the previous equation are: P(regularly attend and get A) = P(regularly attend) × P(get A | regularly attend) = (.7)(.4) = .28 P(do not regularly attend and get A) = P(do not regularly attend) × P(get A | do not regularly attend) = (.3)(.1) = .03 So, P(get A) = .28 + .03 = .31, which is 31%. 7.84 7.85
© Copyright 2024 Paperzz