MATH 232: Engineering Mathematics Problem Set 20, Final version Due Date: Mon., May 3, 2010 First, here are some summary notes on inference procedures for sample means. In general, we need random samples from normal populations, though our procedures are robust against the latter (normality) assumption when sample sizes are large enough to invoke the magic of the CLT. Hypothesis Testing The test statistic takes the form t0 or z0 = (estimate) - (hypothesized value) , std. error where the relevant quantities for various scenarios are: Parameter of Interest Variances known? Test stat Estimate µ µ yes no z0 t0 x̄ x̄ µ2 − µ1 yes z0 x̄2 − x̄1 µ2 − µ1 no† t0 x̄2 − x̄1 µ2 − µ1 no‡ t0 x̄2 − x̄1 Standard Error √ σ/ n √ s/ n q 2 σ1 σ22 n1 + n2 q 2 s1 s22 n1 + n2 r 1 n1 + 1 n2 Distribution Norm(0, 1) t (df = n − 1) Norm(0, 1) t (df = (n1 −1)s2 +(n2 −1)s2 2 1 n1 +n2 −2 (s21 /n1 + s22 /n2 )2 (s21 /n1 )2 n1 −1 t (df = n1 + n2 − 2) † Formulas provided in text, and employed by R’s t.test() command. ‡ Formulas used in (recent/current(?)) Fundamentals of Engineering documents. The P-value, if the alternate hypothesis is 1-sided, is the area in the appropriate tail beyond the test statistic in the appropriate distribution. (See figure at right.) If the alternate hypothesis is 2-sided, the P-value is twice that tail area. + (s22 /n2 )2 n2 −1 Norm(0,1) or t(specified df) 0 z0 or t0 ) MATH 232 Problem Set 20 2 100(1 − α)% Confidence Intervals A 2-sided confidence interval for one of the parameters in the table above is given by Norm(0,1) or t(specified df) (estimate) ± (zα/2 or tα/2 )(std. error), where the standard error and corresponding distribution are as given in the table, and the critical value is chosen as pictured. Area = γ 0 zγ or tγ For a one-sided 100(1 − α)% confidence bound, we replace tα/2 (or zα/2 ) with tα (or zα ) and compute just the lower/upper endpoint of the resulting CI as appropriate. In R One of R’s built-in datasets is called sleep, and contains the results of a sleep experiment. Specifically, two different soporifics (sleep-aids) were tried on subjects and the amount of extra sleep that each subject attained was recorded. The question is whether one soporific worked better than another. This data was used in a famous “Student” (William Sealy Gossett) paper of 1908 addressing the issue of inference on two means. > data(sleep) > names(sleep) [1] "extra" "group" > sleep$extra [1] 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 [16] 4.4 5.5 1.6 4.6 3.4 > sleep$group [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 Levels: 1 2 0.0 2.0 1.9 0.8 1.1 One may store the “extra” time slept in vectors, one for each group, and then run the t.test() command using both as inputs, as in > gp1 = sleep$extra[sleep$group == 1] > gp2 = sleep$extra[sleep$group == 2] > t.test(gp1, gp2) PS20—Final version 0.1 -0.1 MATH 232 Problem Set 20 3 Welch Two Sample t-test data: gp1 and gp2 t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean of x mean of y 0.75 2.33 The resulting output contains both a 95% CI for the difference of means µ1 − µ2 , as well as the test statistic, # of degrees of freedom, and P-value for the hypothesis test H0 : µ1 − µ2 = 0, Ha : µ1 − µ2 , 0. Study this output carefully so as to understand all that it reports. Another way to get this same output without first having to create vectors like gp1 and gp2 is by using a formula: > t.test(extra ˜ group, data = sleep) Notice that the hypothesized value of zero for µ1 − µ2 is both inside the 95% CI and not rejected at the significance level 5%. Day 1 (Monday) • Read Sections 5.1–5.3. You may give little attention to the part of Section 5.3.1 devoted to the case σ21 = σ22 = σ2 , beginning about a third of the way down p. 225, and ending about halfway through p. 228. You may similarly ignore that case discussed for a page and a half at the start of Section 5.3.3. The reasons for this are theoretical (it is not likely, if you do not know σ1 and σ2 , that you might still know them to be equal), and practical (it seems the writers of FOE materials do not consider this case necessary to know). • Section 5.3: Do Problems 16, 20 part (a), 22, 26 and 27. Day 2 (Thursday) • Read Sections 5.4 and 6.1 from MRH4e. • Section 5.4: Do Problems 30, 31, 32, 33, 36, 38 and 39. PS20—Final version MATH 232 Problem Set 20 4 Day 3 (Friday) • Problem ?39: In class, we used R code like > lm.results = lm(y ˜ x) to obtain the linear model and store it in a variable. We get a more detailed summary of the model with > summary(lm.results) We can do other things on the results from lm. In particular, we can obtain predicted and residual values via > predict(lm.results) > residuals(lm.results) Use such commands as needed, and data on the 2003 American League Baseball season, found in the file http://www.calvin.edu/˜stob/data/al2003.csv, to investigate the following. (a) Suppose we wish to predict the number of runs (R) a team will score on the year given the number of homeruns (HR) the team will hit. Write a linear relationship between these two variables. (b) Use this linear relationship to predict the number of runs a team will score given it hits 200 homeruns on the year. (c) Are there any teams for which the linear relationship does a poor job in predicting runs from homeruns? • Problem ?40: From data on the 2003 American League Baseball season, found in the file http://www.calvin.edu/˜stob/data/al2003.csv, suppose that we wish to predict the number of games a team will win (W) from the number of runs the team scores (R). (a) Write a linear relationship for W in terms of R. (b) How many runs must a team score to win 81 games according to this relationship? • Problem ?40: Suppose that we wish to fit a linear model without a constant: i.e., y = bx. Find a general formula for the value of b that minimizes the sums of squares P of residuals, ni=1 (yi − bxi )2 , in this case. (Hint: there is only one variable here, b, so this is a straightforward Mathematics 161 max-min problem.) PS20—Final version MATH 232 Problem Set 20 5 • Problem ?41: In R, if we wish to fit a line y = bx without the constant term, we use lm(y∼x-1). (The (-1) in the formula notation in this context tells R to omit the constant term.) Using data on the 2003 American League Baseball season, found in the file http://www.calvin.edu/˜stob/data/al2003.csv, define new variables for W − L and R − OR. (For example, define wl=s$W-s$L where s is the data frame containing your data.) (a) Write W − L as a linear function of R − OR without a constant term. (b) Why do you think it makes sense (given the nature of the variables) to omit a constant term in this model? • Problem ?42: The R dataset women (load it with the command data(women)) gives the average weight of American women by height. Find (using R) the regression line that predicts average weight from height. Use the xyplot command from the lattice package, employing the switch type=c(’p’,’r’), to view the data along with the regression line. Look also at a plot of the residuals vs. heights. (Include this plot in your write-up, along with the commands you use to produce it). Finally, calculate the correlation between these two variables. Based upon what you see, do you think that a linear relationship is the best way to describe the relationship between average weight and height? Explain why or why not. PS20—Final version
© Copyright 2024 Paperzz