September 5, 2006 Math 143 – Fall 2006 Introduction to Data • Data is 2-dimensional – individuals (cases, subjects, units) – variables • Different kinds of variables (categorical, quantitative) • Distribution of a variable: What values does a variable take and with what frequency? – can represent with tables – can represent with pictures (histograms and barcharts) c 2006 Randall Pruim ([email protected]) 1 September 7, 2006 Math 143 – Fall 2006 2 Graphical and Numerical Summaries of Data • Distribution of a variable (Review) – answers two questions: what values? and how often? – can be summarized with tables, graphs, and numbers – last time: barcharts & histograms • Graphical representations of categorical data – discussed bar chart last time – read about pie charts on your own • Stemplots (and Histograms) – Area = • Shape – key terms: symmetric, skewed (left/right), unimodal, bimodal, outlier – Examples: Income and education [eg02_01, eg02_05]; Old Faithful [faithful] • Numerical Summaries – Mean – Median – Percentile/Decile/Quantile/Five-number Summary – BoxPlot (picture of five-number Summary) – Standard Deviation (next time) • Time Plots – Example: college tuition [eg01_06], R Notes R has a wide range of utilities for generating numerical and graphical summaries of data. Some of these are included in R Commander. Tomorrow in the lab you will get a chance to try it out, but feel free to give it a try on your own ahead of time. Information about R can be found on the course web page: http://www.calvin.edu/~rpruim/courses/m143/F06/ Announcements Meet in the Mac Lab (NH 067) tomorrow. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 Goals: September 8, 2006 3 1. To begin learning how to use R and R Commander 2. To use numerical and graphcial summaries to describe distributions of variables. Getting Started with R and R Commander To use R on a Mac it is important to follow these steps in order. 1. Launch X11. (This is not needed on a Windows machine, but may be on a linux machine.) Click on the white rectangle with an ’X’ in it in the dock. A window will open, you may close it by clicking on the red dot in the upper left corner. 2. Launch R The dock icon has a blue ’R’ in an oval. This will open up the R console window. 3. Load the BPS3 package and Start R Commander. The BPS3 package has data to accompany our text. This can be done in two ways. Choose one of the following: (a) In the Package and Data menu choose Package Manager and then select the BPS3 and Rcmdr packages by clicking on the boxes to the left. (b) In the R Commander console window type library(BPS3) and then library(Rcmdr) That’s it. You should be ready to go. There is a link to a more detailed description of using R with R Commander on the course web page. Getting Data Into R Commander You can’t really do much with R Commander until you have an active data set. The easiest way to get data is from a loaded R package, like BPS3 (which has data sets from our textbook). Several other packages (like datasets) may also be available. • Use Data > Data in Packages > Load . . . and select eg02_01 from the BPS3 package. All the data sets supplied with your textbook should be available this way. eg is for example, ex for exercise, ta for table. There are also a few larger, named data sets. • If you like you can view the data set by clicking on the View Data Set button. • Statistics > Summaries > Active Data Set is also useful. Try it. You will get a summary for each variable in the data set if the data set has more than one variable. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 September 8, 2006 4 Graphical and Numerical Summaries Getting R Commander to make graphical and numerical summaries is quite easy. Look in the Summaries and Graphs menus and see if you can figure out how to do the things listed below. Try other things too. This is a session where you can simply experiment a bit. Note: menu items are grayed out when you don’t have the correct kind of data available. 1. Load datasets:faithful contains the Old Faithful eruption data we looked at in class. • Make a histogram of eruption time. Play around with the setings to try different numbers of bins. What number of bins do you like best? Why? • Make a stemplot of eruption times. (There are two ways to do this.) • See if you can get a useful stemplot. (You will need to play around with some of the settings to get a useful number of stems.) • Are longer waiting times associated with longer or shorter eruption times? Let’s make side-by-side boxplots to investigate this question. – First, we need to create a new variable using Data >Mangage Variables . . . >Bin Numeric Variable. . . to make a categrical variable out of the quantiative variable waiting. – Then make a Lattice bwplot and use your new variable for the Plot by groups. What do you conclude? 2. datasets:insectSprays has data from an agricultural experiment. Create side-by-side boxplots and decide which insecticide you think is best. Saving Your Results Graphs Use Graphs > Save graph to file. You will be able to select the file type. PDF is a good option. If that is problematic for you, try PNG or JPEG (under the bitmap menu). You will not get a graph that looks exactly like the one on the screen, it will be regenerated for the file. In partcular, you will be asked how large you want the image to be. Once you have saved the image you can import it into other files (like Word). Text Output You should be able to use copy and paste to move text output to a Word doc. On a Mac, <Apple>-C will copy and <Apple>-V will paste. Files on the Lab Machines are Temporary Only Files that you store on the lab machines will be deleted when the machines are rebooted. Files that you want to save longer should be stored either on a USB memory device or to your network storage space. You can access your network storage by clicking on one of the little spring-like icons in the dock (one is for legacy and one for limerick). c 2006 Randall Pruim ([email protected]) September 11, 2006 Math 143 – Fall 2006 5 A Bit More About Numerical Summaries • measures of center: median, mean, trimmed mean • measures of spread: range, interquartile range (IQR), variance, standard deviation • outliers and the 1.5 IQR rule • Which measure should I use? – resistant measures, outliers, and influence – mean and standard deviation are closely related to normal distributions (coming soon), and are most useful for describing distributions that look something like normal distributions (unimodal, approximately symmetric) – center and spread may not be the right ways to characterize a distribution (e.g., bimodal distributions) – always look at a plot, too – John Paulos article: It’s Mean to Ignore the Median Using Numerical and Graphical Summaries to Explore Data • Births by day – time plots • Bin Effects in Histograms: Old Faithful data again (next time) c 2006 Randall Pruim ([email protected]) September 12, 2006 Math 143 – Fall 2006 6 Density Curves A density curve is a mathematical model of a distribution. We can think if it as a “smooth histogram”. • As with a histogram, the key idea is: AREA = – density curves always above the horizontal axis – area under the curve is always 1 • In many situations, the histogram for data will look approximately like some density curve. • Median: • Mean: Normal Density Curves, Normal Distributions One density curve arises (for mathematical reasons) again and again, both in the natural world and in statistical analyses. Because it shows up so often, it is called the normal density curve and the distribution it describes is called the normal distrubution. • Actually, there is a whole family of normal distributions, but they are closely related. • Basic shape: • Mean (µ): • Standard Deviation (σ): c 2006 Randall Pruim ([email protected]) September 12 & 14, 2006 Math 143 – Fall 2006 7 68-95-99.7 Rule For any normal distribution: • approximately 68% of the distribution lies within 1 standard deviation of the mean • approximately 95% of the distribution lies within 2 standard deviations of the mean • approximately 99.7% of the distribution lies within 3 standard deviations of the mean Picture: Example 1. ACT scores in a certain school district are approximately normally distributed with mean 20 and standard deviation 5. Approximately what percentage of students scored • above 20? • below 15? • between 15 and 25? • between 25 and 30? Z-scores Because of the property above (and its generalizations), when working with normal distribugtions, all we really need to know is how many standard deviations away from the mean a data point is. We call this number a standardized value or a z-score: z= x−µ = number of standard deviations from mean σ Example 2. Heights of young men in the US are distributed approximately normally with a mean of 70 inches and a standard deviation of 2.8 inches. For women the mean is 64.3, and the standard deviation is 2.5. What is more unusual, a man who is at least 6’ 6” (78 inches) or a woman who is at least 6’ (72 inches)? c 2006 Randall Pruim ([email protected]) September 12 & 14, 2006 Math 143 – Fall 2006 8 Other Normal Probabilities Probabilities involving normal distributions and z-scores that are not 1, 2, or 3 can be done in a similar way, but instead of memorizing all the values, we will use a computer, calculator or table to look them up. It’s as easy as 1-2-3: 1. sketch density curve, label axes, and shade region of interest 2. convert to z-scale (standarization) 3. use table, calculator, or computer to get area/probability • in R: use pnorm(). • in Rcmdr: use Distributions > Continuous distributions > Normal . . . > Normal probabilities. • from Table A: look up the z-score in the margins and read the probability inside the table. Important Note. The tradition for computers and tables not to give the probability between two z-scores (like −1 and +1) but to give the probability of being below a given z-score. Picture: Example 3. Approximately what percentage of young men in the US are over 72 inches tall? Approximately what percentage of young men in the US are between 72 and 78 inches tall? Working Backwards We can also work from probabilities to z-scores to data values. • in R: use qnorm(). • in Rcmdr: use Distributions > Continuous distributions > Normal . . . > Normal quantiles. • from Table A: look for probability in the table, read z-score from the margins. Example 4. How tall are the tallest 10% of young men in the US? The shortest 25%? c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 September 15 & 18, 2006 9 Populations and Samples • Population: set of individuals (people, objects, or whatever) that we would like to know about • Sample: set of individuals for which we actually (attempt to) obtain data. • Statistical inference uses data from a sample to draw conclusions about a population – For this to work we need “good” samples, that is, samples that are in some sense representative of the population in question. • In a census, the sample is the same as the population, but often a census is not possible. Sampling Difficulties and Solutions • A study design is biased if it systematically favors certain outcomes. We want to avoid bias. • Some sources of bias: – convenience sampling and voluntary response samples almost never accurately represent the entire population – undercoverage occurs when some groups of the population will be missing from or underrepresented in the sample because of the sample design. – nonresponse is when someone chosen for a sample is unable or unwilling to be in the sample. – response bias occurs when the interaction of the interviewer with the subject leads to bias. (examples: question wording, environment, sensitive questions, selective memory, etc.) – ascertainment bias can occur in “case/control” studies where separate samples of “cases” and “controls” are made. Ideally the cases and controls should be similar in all respects other than case status, but this can be hard to achieve. • People are problematic: Many of these sources of bias are unique to or exaggerated in human subjects. (Manufactured parts never refuse to participate or lie, for example.) • Examples of issues related to public opinion surveys – In one study, only 13% of Americans thought we spend too much on “assistance for the poor”, but 44% though we spend too much on “welfare.” – In another study, 51% of Scots would vote for “independence for Scotland”, but only 34% for “an independent Scotland separate from the United Kingdom.” – “Pre-election surveys, even those taken just days before voters go to the polls, often substantially underestimate support for white candidates in races where the other candidate is African-American . . . Those who are reluctant to participate in telephone surveys – and therefore most likely to be missed in quick-turnaround polls that do not include callbacks and refusal conversions – are noticeably less sympathetic toward African-Americans.” (1998 Pew Report) – race-of-interviewer effects have been studied in many contexts • When reading results of a public opinion survey, it is important to know exactly what questions was asked and how. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 September 15 & 18, 2006 10 Randomized Sampling Methods • Simplest randomized sampling design: simple random sample (SRS) – every possible sample of a certain size from the population of interest is equally likely to be chosen. – if we have a list of the population, computers can easily generate random samples. – when we don’t have a complete list of the population, more complicated methods are required. • A probability sample is any sample that is chosen by chance. It can be studied if we know what samples are possible and the probabilities of each. • Stratified random sample – Divide population into sections called strata; then choose separate SRS in each stratum. (Example: select 10 students from each dorm on campus.) – The process can be iterated to get multistage sampling. (Example: large-scale surveys of US population might randomly select states, then counties, then “blocks”, then houses, generating necessary lists as they go along.) • Random samples will vary from sample to sample, but the variation can be studied and quantified mathematically (using probability). – We will study this beginning in chapter 13. Eventually we will learn that ∗ an SRS is unbiased ∗ bigger random samples give more accurate results in a quantifiable way Some data from the “Little Survey” buy ticket? lose money lose ticket Yes 11 5 preferred plan people saved people die A 6 4 No 7 10 Count 18 15 B 7 16 Count 13 20 sounds right? significantly more homwork fairly typical amount of homework Yes 11 16 No 3 3 c 2006 Randall Pruim ([email protected]) September 22, 2006 Math 143 – Fall 2006 11 Randomness and Probability Random means: 1. 2. Probability is the mathemaical study of Probability Models Consider some random phenomenon. • The set of all possible outcomes (the things that could happen). is called the sample space. It is often represented by the letter S. • An event is a set of outcomes (none, one, or multiple outcomes considered together). We often use letters like E or A or B for events. • A probability model assigns numbers (called probabilities) to events to indicate how likely they are to occur (in the long-run). Three types of probability models (three ways to assign these numbers to events): • • • We calculate empirical probability by The biggest problems with emprical probability are c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 September 22, 2006 12 Probability Rules These rules are satisfied by empirical probability and assumed when doing thoeretical probability. 1. A probability is always a number between and ≤ P(A) ≤ 2. The sum of the probabilities of all possible outcomes is always . P(S) = 3. The probability that an event does not occur is 1 minus the probability that it does occur. P(E doesn’t happen) = 1 − P(E does happen) 4. If two events can’t both happen at the same time (we call them or ), then the probability that at least one of these events happens is the sum of the probabilities for each event: P(A or B) = P(A) + P(B) provided A and B are Examples c 2006 Randall Pruim ([email protected]) September 25, 2006 Math 143 – Fall 2006 13 More Probability Examples Example 1. If we roll two fair dice, what is the probability that the sum will be 5? Example 2. If I have a loaded die weighted so that the numbers 1–5 come up equally often but 6 comes up twice as often, what is the probability of each outcome of a roll of this die? Random Variables A random variable is a variable whose values is the of a That is, a random variable is the result of doing something random and turning the result into a number. The probability distribution of a random variable X describes what values X can have and the probabilities of each. When the number of values is finite, it can be handy to describe a random variable in a table with the values in one row and the probabilities in a second row: ··· ··· Value of X Probability Example 3. Make probability tables for the following random variables a) Roll two fair dice and let X be the sum. b) Let Y be the result of rolling the loaded die from Example 1. Continuous random variables (like normal random variables) are described by giving their density curves. Probabilities are determined by taking areas under these curves. These probabilities will always follow the 4 probablity rules. Example 4. First Digits Consider the first non-zero digit of a “random number”. It seems obvious that each digit should equally likely. This means that if we let the random variable X be the first digit of a random number, then the distribution of X should be: Value of X Probability This is a rather than on 1 2 3 4 5 6 7 8 9 probability distribution because it is based on . In the late 1800’s Necombe noticed that some pages of the table of logarithms book he used were dirtier than others – presumably because they were used more frequently than other pages. Which page is used depends on the first non-zero digit in a number. That is, Newcombe observed data that seemed to suggest that the obvious conjecture that all first digits should be equally likely was wrong. Our theoretical distribution and the empirical distribution do not seem to match. c 2006 Randall Pruim ([email protected]) September 25, 2006 Math 143 – Fall 2006 A B C D E F G H I J K L M N O P Q R S T 14 Table 1: Emprical data collected by Benford (1938) [Source: Wikipedia] Title 1 2 3 4 5 6 7 8 9 Rivers, Area 31.0 16.4 10.7 11.3 7.2 8.6 5.5 4.2 5.1 Population 33.9 20.4 14.2 8.1 7.2 6.2 4.1 3.7 2.2 Constants 41.3 14.4 4.8 8.6 10.6 5.8 1.0 2.9 10.6 30.0 18.0 12.0 10.0 8.0 6.0 6.0 5.0 5.0 Newspapers Specific Heat 24.0 18.4 16.2 14.6 10.6 4.1 3.2 4.8 4.1 29.6 18.3 12.8 9.8 8.3 6.4 5.7 4.4 4.7 Pressure 30.0 18.4 11.9 10.8 8.1 7.0 5.1 5.1 3.6 H.P. Lost 26.7 25.2 15.4 10.8 6.7 5.1 4.1 2.8 3.2 Mol. Wgt. Drainage 27.1 23.9 13.8 12.6 8.2 5.0 5.0 2.5 1.9 47.2 18.7 5.5 4.4 6.6 4.4 3.3 4.4 5.5 Atomic Wgt. n−1 , n−1/2 , . . . 25.7 20.3 9.7 6.8 6.6 6.8 7.2 8.0 8.9 Design 26.8 14.8 14.3 7.5 8.3 8.4 7.0 7.3 5.6 7.5 7.1 6.5 5.5 4.9 4.2 Reader’s Digest 33.4 18.5 12.4 Cost Data 32.4 18.8 10.1 10.1 9.8 5.5 4.7 5.5 3.1 X-Ray Volts 27.9 17.5 14.4 9.0 8.1 7.4 5.1 5.8 4.8 32.7 17.6 12.6 9.8 7.4 6.4 4.9 5.6 3.0 Am. League Blackbody 31.0 17.3 14.1 8.7 6.6 7.0 5.2 4.7 5.4 28.9 19.2 12.6 8.8 8.5 6.4 5.6 5.0 5.0 Addresses Powers 25.3 16.0 12.0 10.0 8.5 8.8 6.8 7.1 5.5 27.0 18.6 15.7 9.4 6.7 6.5 7.2 4.8 4.1 Death Rate Average 30.6 18.5 12.4 9.4 8.0 6.4 5.1 4.9 4.7 N 335 3259 104 100 1389 703 690 1800 159 91 5000 560 308 741 707 1458 1165 342 900 418 1011 Later Benford and others came up with other probability models to explain this situation. The most famous of which is now called Benford’s Law. Benford’s Law says that under certain assumptions, the probability distribution should obey the following formula: P(X = k) = log10 (k + 1) − log10 (k) Here is that distribution described in table form (rounded to three digits). Leading digit Probability 1 .301 2 .176 3 .125 4 .097 5 .079 6 .067 7 .058 8 .051 9 .046 Notice how much better it matches the empirical distributions in Table 1. Sampling Distributions For statistics, we are interested primarily in the following type of random variable: • random process: take a random sample (SRS, if possible), and collect data from the sample • numerical result: calculate some number from the data (such a number is called a statistic) The distribution of a sample statistic is called its sampling distribution. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 September 26, 2006 15 Announcements • Test Approaching: Information is available on the course web page. • Problem Set 8 will be due next Monday (to allow time to get it back before the exam). Mean and Standard Deviation of a Random Vairable We can define the mean and standard deviation of a random variable. The definitions are similar to those used for computing the mean and standard deviation from data. We will only do this in the simplest case, namely when a random variable has only finitely many possible values. Example 1. We’ll motivate the definition of the mean of a random variable by first looking at a calculation of a mean of 10 quiz scores. Suppose your 10 scores are four 10’s, three 8’s, two 7’s and a 6. To get the mean quiz score we could add these all up and divide by 10. mean quiz score = 10 + 10 + 10 + 10 + 8 + 8 + 8 + 7 + 7 + 6 10 Of course, we can simplify this a little bit using some multiplication, if we like: mean quiz score = 10 · 4 + 8 · 3 + 7 · 2 + 6 · 1 10 And a little algebra transforms this to 4 3 2 1 + 8· + 7· + 6· mean quiz score = 10 · 10 10 10 10 Notice that each term in this sum has the following form quiz score × probability of getting that quiz score This idea leads us to the following definition of the mean of a random variable X: Definition of mean and variance of a random variable X mean of X = µX = 2 variance of X = σX = X X k · P(X = k) (k − µX )2 · P(X = k) Example 2. Let X be the a random variable described by the table below. Compute the mean and standard deviation of X. Value of X 0 10 100 1000 Probability .95 .04 .009 .001 c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 October 6, 2006 16 Statistical Inference Statistical inference is inferring information about a population from information about a sample. We’re generally talking about one of two things: 1. 2. Confidence intervals A confidence interval is the interval within which a population parameter is believed to lie with a measurable level of confidence. We want to be able to fill in the blanks in a statement like the following: We estimate the parameter to be between and contain the true value of the parameter approximately this method. and our intervals will % of the times we use What does confidence mean? Where do confidence intervals come from? Let’s think about a generic example. Suppose that we take a SRS of size n from a population with mean µ and standard deviation σ. We know that the sampling . distribution for sample means (x̄) is (approximately) By the 68-95-99.7 rule, about 95% of samples will produce a value for x̄ that is within about standard deviations of the true population mean µ. That is, about 95% of samples will produce a value for x̄ that is within of µ. (Actually, it is more accurate to replace 2 by , as we can see from Table A.) But if x̄ is within (1.96) √σn of µ, then µ is within (1.96) √σn of x̄. every sample, but it will be true of 95% of samples. So we are fairly confident that µ is between express this by saying that the 95% CI for µ is x̄ ± and This won’t be true of . We will 1.96 √σn Example 1. A sample of 25 Valencia oranges weighed an average (mean) of 10 oz per orange. The standard deviation of the population of weights of Valencia oranges is 2 oz. Find a 95% confidence interval (CI) for the population mean. c 2006 Randall Pruim ([email protected]) October 6, 2006 Math 143 – Fall 2006 17 Of course, we are free to choose any level of confidence we like. The only thing that will change is the critical value 1.96. We denote the critical value by z ∗ . If we want a confidence level of C, we choose the critical value z ∗ so that and then σ the level C confidence interval is x̄ ± z ∗ √ n σ z ∗ √ is called n Example 2. Now compute a 99% CI for the mean weight of the Valencia oranges based on our previous sample of size 25. (n = 25, x̄ = 10, σ = 2) z∗ = CI: Example 3. Now compute a 90% CI for the mean weight of the Valencia oranges based on our previous sample of size 25. (n = 25, x̄ = 10, σ = 2) z∗ = CI: Example 4. A random sample of 100 batteries has a mean lifetime of 15 hours. Suppose the standard deviation for battery life in the population of all batteries is a 2 hours. Give a 95% CI for the mean battery life. z∗ = n= σ= x̄ = What can we learn from the formula? 1. As n increases, the width of the CI 2. As σ increases, the width of the CI 3. As the confidence level increases, z ∗ , so the width c 2006 Randall Pruim ([email protected]) CI: October 9, 2006 Math 143 – Fall 2006 18 Determining the required sample size for a desired maregin of error Part of designing an experiment is determining what margin of error you can live with. For example, a manufacturer of toothpicks may not need to know as accurately the mean width of their product as a manufacturer of eyeglass screws. Example 1. A manufacturer of toothpicks wonders what the mean width of a toothpick is under a new manufacturing method. How many toothpicks must the manufacturer measure to be 90% confident that the sample mean is no farther from the true mean than 0.10 mm? Assume normality, and note that toothpicks produced under the old method had a standard deviation of 0.4 mm. z ∗ = z0.05 = 1.645. (Where did 1.645 come from?) If the manufacturer wants to be 99% confident, we must use z ∗ = z0.005 = . General Method: For margin of error no more than m, simply solve the equation σ m = z∗ √ n for n, so n = An unreasonable assumption Notice that in all of the confidence intervals we have seen we were needed to know σ, the standard deviation of the population distribution. In practice, one almost never knows σ. So what can we do? Fortunately, in many situations, we can estimate that σ is probably pretty close to s, the standard deviation of the sample. Once we use s in place of σ, however, the sampling distribution is no longer normal. Instead, we must work with a new distribution, called Student’s t-distribution. More on that soon. c 2006 Randall Pruim ([email protected]) October 10, 2006 Math 143 – Fall 2006 19 Introduction to Hypothesis Testing Our goal with tests of hypothesis testsing (also known as tests of significance) is to Example 1. Golfballs in the Yard. Toward a General Framework A hypothesis test is a formal procedure for comparing observed (sample) data with a hypothesis whose truth we want to ascertain. What do we mean by hypothesis? The hypothesis is a a or about the in . The results of a test are expressed in terms of a probability that measures how well the data and the hypothesis agree. In other words, it helps us decide if a hypothesis is reasonable or unreasonable based on the likelihood of getting sample data similar to our data. Just like confidence intervals, hypothesis testing is based on our knowledge of c 2006 Randall Pruim ([email protected]) October 10, 2006 Math 143 – Fall 2006 20 4 step procedure for testing a hypothesis 1. 2. 3. 4. Step 1: State the null and alternate hypotheses. State the hypothesis to be tested. It is called the null hypothesis, designated H0 . The alternate hypothesis describes what you will believe if you reject the null hypothesis. It is designated Ha . This is the statement that we hope or suspect is true instead of H0 . Example 2. Cholesterol Levels. The average LDL of healthy middle-aged women is 108.4. We suspect that smokers have a higher LDL (LDL is supposed to be low). Example 3. Free Throw Frank. Free Throw Frank says he has gotten much better and that he now makes 85% of his free throw attempts. We decide to have him shoot some free throws to see if we believe his claim. Example 4. Cola Sweeteners. A cola company wants to know if the artificial sweetener it uses loses sweetness over time. Step 2: Compute the test statistic A test statistic is a number computed from the data that measures how well the sample data Example 4 (continued). Ten trained taste testers sampled new cola and “aged cola” and rated each for sweetness. The decreases in sweetness reported by these testers were 2 0.4 0.7 2.0 -0.4 2.2 -1.3 1.2 1.1 2.3 c 2006 Randall Pruim ([email protected]) October 10, 2006 Math 143 – Fall 2006 21 Step 3: Compute the p-value The p-value (for a given hypothesis test and sample) is the probability, if the null hypothesis is true, of obtaining a test statistic as extreme or more extreme than the one actually observed. Example 4 (continued). How do we compute the p-value in our example? What is the general method here? 1. 2. 3. Step 4: State a conclusion A decision rule is simply a statement of the conditions under which the null hypothesis is or is not rejected. This condition generally means choosing a significance level α. • If the p-value is less than the significance level α, we reject the null hypothesis, and decide that the sample data do not support our null hypothesis and the results of our study are then called statistically significant at significance level α. • If the p-value is greater than α, then we do not reject the null hypothesis. This doesn’t necessarily mean that the null hypothesis is true, only that our data do not give strong enough evidence to reject it. It is rather like a court trial. If there is strong enough evidence, we convict (guilty), but if there is reasonable doubt, we find the defendant “not guilty”. This doesn’t mean the defendant is innocent, only that we don’t have enough evidence to claim with confidence that he or she is guilty. Looking back at our example, if we select a significance level of α = 0.05, what do we conclude? c 2006 Randall Pruim ([email protected]) October 13, 2006 Math 143 – Fall 2006 22 The problem with z-procedures There is a problem with the z-procedures that we have looked at so far: They require that we know σ, the standard deviation of the population. In practice, this is almost never the case. So we need to develop new methods that work without knowing σ. The obvious idea is to replace σ with s (the sample standard deviation) in all our formulas, since that would seem to give us the best information available about the population standard deviation. That is, instead of working with z= we want to try using t= x̄ − µ x̄ − µ √ = , SD σ/ n x̄ − µ x̄ − µ √ = SE s/ n Note the use of SE (standard error) in this expression. That is our new estimate for SD (the standard deviation of the sampling distribution). This idea works, provided we can figure out the sampling distribution of t. The bad news is that t is no longer normal. The good news is that the distribution is known. It is called Student’s t-distribution, or simply the t-distribution. The t-distributions Actually, there is a whole family of t-distributions. While the family of normal distributions is defined by the mean and standard deviation, the family of t-distributions is defined by degrees of freedom, usually abbreviated df. In many ways the t-distributions are a lot like the standard normal distribution. For example, all t-distributions are and with a mean and median both equal to But the density curve for the t distributions is and than the density curve for the normal distribution. Statisticians like to say that the t distributions have , which means more of the values are farther from the center (0). As df increases, the t-distribution becomes more and more like the standard normal distribution: c 2006 Randall Pruim ([email protected]) October 13, 2006 Math 143 – Fall 2006 23 t-Tables We can compute probabilities based on t-distributions just as easily as we can for normal distributions, but we need to use a computer or a different table (see Table C). Example 1. Suppose I have a random variable T that has a t-distribution with df = 11. • P (T > 1.36) = • P (T < 1.36) = • P (T > −1.36) = • P (T < 1.36) = • P (T > 2.2) = • P (T < −1.796) = • P (T > 2.00) = The critical value tdf,α , has an area to the right of it equal to α. Example 2. • t27,.005 = • t14,.975 = Sampling distribution of sample means For random samples from a normal population with mean µ and standard deviation σ, • x̄−µ √ σ/ n has a normal distribution with mean 0 and standard deviation 1. (N (0, 1)) • x̄−µ √ s/ n has a t-distribution with n − 1 degrees of freedom. (df = 1) But what if the population is not normal? If the population is not normal, then sampling distributions are approximately as indicated above provided the sample size is large enough. In practice, the approximation to the t-distributions is only good enough when one of the following holds 1. Small Sample Sizes: If n < 15, the population must be normal or quite close to normal (unimodal, symmetric). If we see that our data are clearly not normal, or if there are outliers in our data, and we don’t have good reason to believe that the population is normal, we should not use t-distributions. 2. Medium Sample Sizes: The t-procedures will be reasonably accurate for unimodal population distributions even if there is some skewing. As a general rule, we can use the t distributions so long as we do not see strong skewing or outliers in our data. 3. Large Sample Sizes: If n ≥ 40, the t-procedures can be used even if the population is strongly skewed. c 2006 Randall Pruim ([email protected]) October 13, 2006 Math 143 – Fall 2006 24 Confidence intervals revisited CI for µ if σ is known: CI for µ if σ is unknown: σ x̄ ± z ∗ √ n ∗ s x̄ ± t √ n Example 3. If we want to compute a 90% CI and n = 15 (df = 14), then t∗ = Example 4. Sixteen plots in a farmer’s field were randomly selected and measured for total yield. The mean yield of the sample was 100 kg with a standard deviation of 18 kg. Assuming yield per plot is normally distributed, find a 98% CI for the mean yield per plot in the farmer’s field. When the sample size is very large, the t-distributions and the standard normal distribution are nearly identical. In Table C, critical values for the standard normal distribution can be found in the row labeled z ∗ . We will use these values also when df is large. Example 5. If we want to compute a 90% CI and n = 200 (df = 199), then t∗ = Example 6. U.S. Public Health Service 10 year hypertension study. The mean change in blood pressure for the 171 people using medication was −9.85, with a standard deviation of 11.55. Give a 95% CI for the true mean change in blood pressure. Some cautions about CI: 1. Data must be from an SRS. 2. The formula is not correct for more complex designs (e.g., stratified random sampling). 3. Because the mean is not resistant, outliers can have a large effect on the CI. 4. If the sample size is small and the population not normal, we need different procedures. 5. The margin of error in a CI covers only random sampling errors, not errors in design or data collection. c 2006 Randall Pruim ([email protected]) October 13, 2006 Math 143 – Fall 2006 25 Hypothesis Testing Revisited The same four-steps apply as before: 1. State null and alternative hypotheses. Changes: none. 2. Compute a test statistic. Changes: Compte a t statistic instead of a z statistic. 3. Compute the p-value Changes: Compare to a t-distribution instead of a normal distribution. 4. Interpret the results/state conclusions. Changes: none. Suppose that an SRS of size n is drawn from a population having unknown mean µ. To test the hypothesis H0 : µ = µ0 based on an SRS of size n, compute the one-sample t statistic t= x̄ − µ0 √ s/ n The p-value for a test of H0 against • Ha : µ > µ0 is P (T > t) • Ha : µ < µ0 is P (T < t) • Ha : µ 6= µ0 is P (T > |t|) + P (T < −|t|) = 2P (T > |t|) where T is a random variable with a t distribution (df = n − 1). These p-values are exact if the population distribution is normal and are approximately correct for large enough n in other cases. The first two kinds of alternatives lead to one-sided or one-tailed tests. The third alternative leads to two-sided or two-tailed test. Example 7. Suppose that it costs $60 for an insurance company to investigate accident claims. This cost was deemed exorbitant compared to other insurance companies, and cost-cutting measures were instituted. In order to evaluate the impact of these new measures, a sample of 26 recent claims was selected at random. The sample mean and sd were $57 and $10, respectively. At the 0.01 level, is there a reduction in the average cost? Or can the difference of three dollars ($60-$57) be attributed to the sample we happened to pick? Hypotheses: Test statistic: p-value: Conclusion: c 2006 Randall Pruim ([email protected]) October 16, 2006 Math 143 – Fall 2006 26 More Hypothesis Test Examples Example 1. The amount of lead in a certain type of soil, when released by a standard extraction method, averages 86 parts per million (ppm). A new extraction method is tried, and they wondered if it would extract a significantly different amount of lead. 41 specimens were obtained, with a mean of 83 ppm lead and a sd of 10 ppm. Hypotheses: Test statistic: p-value: Conclusion: Example 2. The final stage of a chemical process is sampled and the level of impurities determined. The final stage is recycled if there are too many impurities, and the controls are readjusted if there are too few impurities (which is an indication that too much catalyst is being added). If it is concluded that the mean impurity level=.01 gram/liter, the process is continued without interruption. A sample of n = 100 specimens is measured, with a mean of 0.0112 and a sd of 0.005 g/l. Should the process be interrupted? Use a level of significance of .05. Hypotheses: Test statistic: p-value: Conclusion: Comparing significance tests and confidence intervals Example 3. The drained weights for a sample of 30 cans of fruit have mean 12.087 oz, and standard deviation 0.2 oz. Test the hypothesis that on average, a 12-oz. drained weight standard is being maintained. Now let’s compute a CI. What is the relationship between confidence intervals and two-sided hypothesis tests? c 2006 Randall Pruim ([email protected]) October 19, 2006 Math 143 – Fall 2006 27 Variations on the Inference Theme Recall that statistical inference is inferring information about a population from information about a sample. We’re generally talking about one of two things: 1. Estimating parameters (confidence intervals) 2. Testing hypotheses (significance tests) The same basic ideas that we used when computing confidence intervals and evaluating hypothesis tests for means of a quantitative variable can be applied in a number of related situations. The best way to think of these different situations is as variations on an inference theme. To make this easier, we will use a systematic notation scheme throughout: • population parameters – p, proportion (of a categorical variable)1 – µ, mean (of quantitative variable) – σ, standard deviation • sample statistics – – – – – n, sample size X, count (of a categorical variable) p̂ = X n , proportion (of a categorical variable) x̄, mean (of quantitative variable) s, standard deviation • sampling distribution – SD, standard deviation of the sampling distribution (σx̄ or σp̂ are also used for this) – SE, standard error of the sampling distribution (an estimate for SD) – µp̂ , µx̄ , mean of sampling distribution (for p̂ and x̄, respectively) Subscripts will be used to indicate The procedures involving the z (normal) and t distributions are all very similar. • To do a hypothesis test, compute t or z = data value − hypothesis value , SD or SE and compare with the appropriate distribution (using tables or a computer). • To compute a confidence interval, first determine the critical value for the desired level of confidence (z ∗ or t∗ ), then the confidence interval is data value ± (critical value)(SD or SE) . 1 Note that this violates our convention about using Greek letters for parameters. Some books will use π or θ instead of p, but we will stick with the notation that matches our textbook. c 2006 Randall Pruim ([email protected]) October 19, 2006 Math 143 – Fall 2006 28 Matched Pairs Inference for matched pairs really is hardly new because generally we combine the values of two quantitative variables to form one new quantitative variable, and then we apply our inference procedures exactly as before. The most common way to combine the variables is simply take the difference between the two variables. Example. Two varieties of oats were compared in an experiment to determine which variety had the higher yield. Since soil type also affects yield, the experimenter blocked out its effect by planting each variety of oats in seven different types of soil. With the data paired by soil types as given below, does it appear that variety A has the higher mean yield? H0 : Yield Soil type 1 2 3 4 5 6 7 A 71.2 72.6 47.8 76.9 42.5 49.6 62.8 mean(x) = 3.7286 B 65.2 60.7 42.8 73.0 41.7 56.6 57.3 x = A-B 6.0 11.9 5.0 3.9 0.8 -7.0 5.5 sd(x) = 5.7792 Ha : SE = p-value = 95% CI for difference: Example. To test the effect of continuous music on factory workers’ output, each of 7 workers was observed for one month with music and one month without. Given the following results, does music help? Find a 95% confidence interval for the mean difference in output. Average output with music Average output w/o music Diff mean(Diff) = -.257, 1 2 3 8.4 5.0 7.2 7.4 6.1 8.0 1.0 -1.1 -0.8 4 6.6 6.4 0.2 5 6 7 6.5 8.7 5.9 6.8 8.8 6.6 -.3 -.1 -.7 sd(Diff) = .709 c 2006 Randall Pruim ([email protected]) October 19, 2006 Math 143 – Fall 2006 29 Comparing Two Means A two-sample problem is one in which: 1. 2. The difference between two-sample problems and matched pairs problems is Graphical exploration We can begin to look at the difference between our two groups using graphs. The following graphs are useful for this: 1. 2. 3. What is the sampling distribution? In a two-sample problem we want to compare µ1 with µ2 . We do this by drawing a sample from each population and calculating x̄1 and x̄2 . Assuming each population has a normal distribution with means µi and standard deviations σi , we already know that √ √ x̄1 ∼ N (µ1 , σ1 / n1 ) and x̄2 ∼ N (µ2 , σ2 / n2 ) In order to work with two samples, we need to know the sampling distribution of In order to determine that distribution, we need will take a little detour. . . c 2006 Randall Pruim ([email protected]) October 19, 2006 Math 143 – Fall 2006 30 Mean and Standard Deviation of Combinations of Random Variables There are four rules that can help us compute means and variances of combinations of random variables. A fifth rule is an especially important property of normal distributions. These will be especially useful when we try to understand sampling distributions. Let X and Y be random variables and let a and b be numbers. Then 1. µaX+b = aµX + b 2. µX+Y = µX + µY 3. σaX+b = aσX 2 2 + σ 2 (Note: this rule uses 4. If X and Y are INDEPENDENT, σX+Y = σX Y variances.) 5. If X and Y are normally distributed and independent, then so are aX + b and X + Y Example. Do tall men and women tend to marry each other? Or are the heights of spouses independent. If the heights were independent, we could use the rules above to determine the probability that the husband is taller than the wife, P (M − W > 0). Applying the rules to our situation Using our rules for combining means and variances, we see that the distribution of x̄1 − x̄2 • • • Of course, we won’t usually know σ1 and σ2 , so we need to estimate SD using The obvious estimate to try is • SE = c 2006 Randall Pruim ([email protected]) and . October 20, 2006 Math 143 – Fall 2006 31 Pop Quiz 1. Charlie and Lucy are working on a statistics project and collect some data together on how fast their dog Snoopy can run an obstacle course in their back yard. Using the same data Charlie makes a 90% confidence interval, Lucy makes a 98% confidence interval, and Snoopy makes a 95% confidence interval. Whose cofidence interval will be widest? a) Charlie’s b) Lucy’s c) Snoopy’s d)can’t tell without more information. 2. The 95% confidence interval for the mean time (in seconds) it takes Snoopy to run the course is (32,40). Sally performs a hypothesis test using this same data. Her null hypothesis is that Snoopy can run the course in 30 seconds. What can you say about the p-value of her test? (a) it is at least 0.2 (b) it is between 0.1 and 0.2 (c) it is between 0.05 and 0.1 (d) it is below 0.05 (e) can’t tell without more information We have discussed three different varaitions on the the t-procedures theme. (Or maybe it is one theme and two variations.) A) One-Sample t B) Matched Pairs t C) Two-Sample t In each item below, a study design is outlined. For each design, answer A, B, C, or D (none of the above) to indicate which of the variations above is appropriate. 3. Choose 40 romantically attached couple in their mid-twenties. Interview the man and the woman separately and ask them about their romantic attachements at age 15 or 16. Compare the attitudes of men and women. 4. Choose a random sample of college students. On a questionaire ask students if they are smokers and what their high school gpa was. Compare the gpas of smokers and non-smokers. 5. Use random-digit dialing to contact a number of Michigan voters. Record their gender and their preference for governor in the updcoming election. Compare the responses of men to those of women. 6. To check a new lab method, obtain a reference sample from NIST. Measure the concentration in the specimin 20 times with the new method and compare the mean concentration to the concentration claimed by NIST. 7. To check a new lab method, prepare a specimin and measure the concentration in the specimin 20 times, ten times using the new method and ten times using an old method. Compare the mean concentration menasurement of the new method to the old method. c 2006 Randall Pruim ([email protected]) October 20, 2006 Math 143 – Fall 2006 32 The Two-Sample t-Procedures If we knew σ1 and σ2 , then the standard deviation of the sampling distribution would be s σ12 σ22 SD = + n1 n2 When we do not know σ1 and σ2 (the usual case), we simple replace each s SE = with , so s2 s21 + 2 n1 n2 Unfortunately the t statistic computed from this, t= (x̄1 − x̄2 ) − (µ1 − µ2 ) , SE does does not have a t-distribution as we might have hoped, even if the two populations are normal. Why? A t distribution replaces a N(0,1) distribution only when a single population standard deviation σ is replaced by a single sample standard deviation s. In this case, we replaced two standard deviations (σ1 and σ2 ) with their estimates (s1 and s2 ) and we will get different distributions depending on the relationship between σ1 and σ2 . The good news is that the distribution is almost a t distribution. Usually it is close enough to use in practice. The only tricky part is coming up with correct degrees of freedom for the approximation. There are two ways to estimate the degrees of freedom: 1. 2. Method 1 is typically used by computer software, but is a bit messy to be done by hand. Furthermore, the simpler method is conservative: the value it gives for df is always on the low side, which means that it will give CIs that are bit and p-values that are a bit Putting it all together Now that we know the distribution involved and a value for SE, we are all set to do hypothesis testing or to compute CIs. • CI: (x̄1 − x̄2 ) ± t∗ SE. • Hypothesis tests: t = (x̄1 − x̄2 ) − (µ1 − µ2 ) SE where SE is as above. c 2006 Randall Pruim ([email protected]) October 20, 2006 Math 143 – Fall 2006 33 Examples Example 1. An agronomist has developed a new plant food. She hopes it will improve yield. To find out, she treated 48 plants with the new food and obtained a mean yield of 24.4 lbs (s=4.8 lbs). 45 identical plants were untreated and had a mean yield of 22.3 lbs (s =2.3 lbs). Do the data provide sufficient evidence to determine that the new plant food is better than no treatment? Example 2. The weather bureau measured the ozone level at 5 random locations in Orange City before a cool front moved through and at 5 different random locations afterward. Test whether there is a significant drop in the ozone level after the front has moved through. Time Before front After front n 5 5 mean 0.122 .094 variance .00067 .00016 How could the design of this study need to be changed to make it a matched pairs situation? Which design would be better? Why? Example 3. The scores of two groups of prison inmates on a rehabilitation test are summarized below: First offenders Mean 300 Variance 20 Sample size 16 Repeat offenders 305 4 13 c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 October 23, 2006 34 Robustness of Two Sample Procedures The two-sample t procedures are actually somewhat more robust than the one-sample t procedures. • When the sizes of the two samples are equal and the distributions of the two populations being compared have similar shapes, p-values from the t-distributions are quite accurate for a broad range of distributions when the sample sizes are as small as n1 = n2 = 5. • When the two population distributions have different shapes, larger samples are needed. • The guidelines we gave before (sample sizes < 15, 15 – 40, ≥ 40) can be adapted to two-sample procedures by replacing the one-sample sample size with the sum of the sample sizes n1 + n2 . Using Rcmdr for Two-Sample Problems Manipulating Data In order to do a 2-sample t-test in Rcmdr, you need to have the data in the correct format. In particular, the data of interest must be in two columns (variables): • a quantitative variable (for the means being compared) • a categorical variable identifying the two groups If you don’t have data in the correct format, Rcmdr will not allow you to use the procedure you are after (it will be grayed out in the menu). Data sets that come in other formats and need to be converted first. 1. Converting from quantitative to categorical. Sometimes the grouping variable is given as numerical values (1 or 2, for example). This is the case in dataset ex02_19, for example. We dealt with the problem earlier in the semester when doing side-by-side boxplots. To convert the Group variable to a factor (R’s name for categorical variables), use >Data > Manage variables. . . > Convert numeric variables to factors . . . . The instructions are pretty clear from that point, just select the variable you want to convert. You will given a chance to name the variable, and you can choose to either use the numbers to enter more meaningful descriptions of the groups. 2. Stacking data. Sometimes the data for the two groups are given in separate columns and there is no separate variable for the groups. We can get this sort of data into our desired format using > Data > Active data set > Stack variables . . . An example of this is data set ex17_07, where the changes in polyphenol levels for two groups of subjects are listed in separate columns. To choose the two columns use ¡ctrl¿-click to select both Red and White from the list of variables. A new dataframe will be made, and you can name it and the resulting quantitative and categorical variable however you like. (The default names are not very meaningful.) c 2006 Randall Pruim ([email protected]) October 23, 2006 Math 143 – Fall 2006 35 3. Computing a new variable. Sometimes it is useful to calculate your own (new) variable from other variables. For example, if you want the difference between or ratio of values in two columns that already exist. Do this using >Data > Manage. . . > Recode variables. . . Graphs Useful graphs include • histograms (lattice histograms will allow you to make separate histograms for each group). • side-by-side boxplots (both boxplot and lattice bwplot allow for this). • stemplots (but there is not an option in Rcmdr to do side-by-side or back-to-back stemplots, so this in less useful for two-sample tests) Hypothesis Testing and Confidence Intervals Statistical procedures are located in the >Statistics menu. Since we are interested in comparing means of two groups, we look in >Statistics > Means > Independent samples t-test. Then we must provide the grouping variable and the response variable. Remember, if your data are not in the right shape, the >Independent samples t-test menu will not be available. Pooled Two-Sample Procedures You will notice in Rcmdr that when you do two-sample t-tests there is a check box for the question “Assume equal variances?” The default is “No” and matches the procedures we have been describing. If you choose “Yes” you will get the so-called “pooled two-sample t-test”. If the two population distributions have the same variance, there is a better way to compute SE and the resulting t statistic has exactly a t distribution (when the populations are normal). The new formula for SE combines or pools data from both samples to estimate the common variance. That’s where the name comes from. Before the days of easy computer access, this procedure was used much more commonly than it is now. We will generally avoid the “pooled” version because • It is not much more powerful than the version that does not assume equal variances, and • It is not robust against violations of the assumption that the variances are equal. So we should only use the pooled version if we are quite sure that the population variances are indeed the same (or very close). But this can be difficult to know. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 9, 2006 36 Inference with Two Categorical Variables Two-Way Tables To look for an association between two categorical variables, we first put the data into a table (called a two-way table). The values inside the table describe the joint distribution of the two variables. In the example below, right-handed men and women had their feet measured to see with foot (if any) was larger. The two categorical variables are gender (Male or Female) and foot comparison (left foot larger, both feet the same, or right foot larger). The joint distribution is displayed in the following table. Men Women LBig 2 55 Equal 10 18 RBig 28 14 In addition to the joint distrubion, there are two other kinds of distribtuions that are of interest in this situation. 1. The marginal distributions give the distribution of one variable ignoring the other variable altogether. It is common to put this information in the “margin” of the table. That’s where the name comes from. Here is our data set with the marginal distriubtions added. Men Women total LBig 2 55 57 Equal 10 18 28 RBig 28 14 42 Total 40 87 127 For the marginal distribution of gender, we see that 40 of the 127 subjects were men. For the marginal distribution of foot comparison, we see that 57 subjects had larger left feet, 28 and equally sized feet, and 42 had larger right feet. 2. The conditional distributions give the distribution of one variable for a fixed value of the other variable. For example, the conditional distribution of foot comparision among the men (i.e., conditional on being a man) appears in the first row of the table (2 with larger left feet, 10 with equally sized feet, and 28 with larger right feet out of 40 men). Of course each column also represents a conditional distribution (this time conditional on a particular foot comparison). Imporant idea: Conditional distributions are found by looking at rows or columns of a two-way table, because a row or column represents selecting just one possible value for one variable and looking at the distriubtion of the other variable in that situation. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 9, 2006 37 Often it is more useful to express the joint, marginal, and conditional distributions using percentages or proportions rather than counts. This especially makes it easier to compare them when the total counts differ. Here are the joint and marginal distributions again described with percentages. Men Women Total LBig 1.6 43.3 44.9 Equal 7.9 14.2 22.0 RBig 22.0 11.0 33.1 Total 31.5 68.5 100.0 Of more interest to us are the conditional probabilities. We would like to know, for example, how the percentage of right-handed men with larger left feet compares to the percentage of right-handed women with larger left feet. We can do this by calculating tables of row- and column-percents (or proportions). The tables below show percentages. Men Women LBig 5.0 63.2 Men Women Total Equal 25.0 20.7 LBig 3.5 96.5 100.0 RBig 70.0 16.1 Equal 35.7 64.3 100.0 Total 100.0 100.0 RBig 66.7 33.3 100.0 Exercise 1. Express the percentages in the previous two tables using conditional probability notation. [P(A | B)] Exercise 2. Answer the following questions refering to the previous two tables. a) One of these tables is much more informative than the other. Which one and why? b) Informally/intuitively, what do these data suggest? c) Can we be sure that this holds for the population as well as for our sample? Why or why not? d) Do you think that this might hold for the population as well as for our sample? Why or why not? c 2006 Randall Pruim ([email protected]) November 9, 2006 Math 143 – Fall 2006 38 The Chi-Square Test The hypothesis test that we use in this situation is called the Pearson chi-square test. It follows the same outline as our other hypothesis tests, but we have some details to fill in. 1. State the null and alternative hypotheses: Details needed: 2. Compute the test statistic. Details needed: 3. Find the p-value Details needed: 4. Draw a conclusion. Details needed: Step 1: State the Hypotheses Chi-square tests are generally used to answer 1 of 2 questions (or, the same question phrased 2 ways): 1. Does the percent in a certain category change from population to population? Example: Does the percent of deaths from heart disease change from country to country? Or we could say, is there an association between the percent of deaths from heart disease and which country a person is from? 2. Are these two categorical variables independent? Example: Is alcohol consumption associated with oesophageal cancer? In part the difference has to do with how the data were collected. If we independently sample from multiple populations, question 1 often fits the situation better. If we sample from one population and categorize the subjects in two ways, question 2 often fits better. So our null and alternative hypotheses are • H0 : • Ha : c 2006 Randall Pruim ([email protected]) November 9, 2006 Math 143 – Fall 2006 39 Step 2: Compute the Test Statistic Here is our table again. We want to measure how close these number are to what we expect if the null hypothesis is true. To answer that, we first need to ask “What do we expect?” Observed Counts Men Women total LBig 2 55 57 Equal 10 18 28 RBig 28 14 42 Total 40 87 127 Exercise 3. The top left cell contains the number 2. What would we expect if there is no association between gender and foot comparison? Fill in the table below with the expected counts for each cell. Expected Counts LBig Equal RBig Men Women We still need a way to compare the tables of observed and expected counts and to summarize the differences in a single number, our test statistic: X2 = = = 45.00 Step 3: Compute the p-value So how big is 45? Is it unusually big? About what we would expect? The p-value will answer that for us. But to get that answer we first need to answer a different question, namely: c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 10, 2006 40 Example 1. A random sample of high school and college students in 1973 was classified according to political views and frequency of marijuana use. The following two-way table summarizes the data. Liberal Consrvative Other Never 479 214 172 Marijuana Use Rarely Frequently 173 119 47 15 45 85 Example 2. A number of school children (elementary and middle) were asked which of three things they thought was most important to them, Grades, Popoularity, or Sports. Here are the results. boy girl Grades 27 20 Popular 11 14 Sports 9 3 c 2006 Randall Pruim ([email protected]) November 13, 2006 Math 143 – Fall 2006 41 Review: We are looking at methods to investigate two or more variables at once. • bivariate: • multivariate: The statistical procedures used depend upon the kind of variables (categorical or quantitative): • Chi-Square deals with • Correlation/Regression deals with • ANOVA deals with Actually, each of these methods can be extended to deal with multivariate situations. Sometimes additional descriptions are used to distinguish the simpler bivariate versions from their related multivariate versions: Bivariate Chi-square for Two-way Tables Simple (Linear) Regression One-way ANOVA Multivariate Chi-square for Three-way Tables, etc. Multiple (Linear) Regression Two-way ANOVA We will focus our attention on the bivariate cases, but will talk a little about the multivariate cases. We use regression and/or correlation when we have two quantitative variables, and we want to see if there is an association between these two variables. Two variables (reflecting two measurements for the same individual) are associated if Here’s the plan: 1. Start with graphical displays. 2. Move on to numerical summaries. 3. Then look for patterns and deviations from those patterns and use a mathematical model to describe regular patterns. 4. Finally, use statistical inference to draw conclusions about the relationship between the two quantitative variables in the population from their relationship in a sample. c 2006 Randall Pruim ([email protected]) November 13, 2006 Math 143 – Fall 2006 42 Scatter Plots Scatter plots are a graphical display of the relationship between two quantitative variables. • one dot per individual • explanatory variable (x) on horizontal axis, response variable (y) on vertical axis Examples A B C D E F ● ● ● ● ●● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●●●●● ●●● ● ● ●●● ● ● ● ●●●● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ●●●●● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ●● ●●●●● ● ● ●● ● ●● ● ●● ● ●● ●●● ● ● ● ● ● As usual, when looking at scatter plots, we are looking to see the overall pattern and any striking deviations from that pattern (e.g., outliers). The pattern in a scatter plot can often be summarized in terms of • Form • Strength • Direction – Positive association: – Negative association: If you have more than one group, points may be plotted with different colors or symbols to see if group affects the relationship between the two continuous variables. The Correlation Coefficient: r Our goal is to come up with a number that measures The correlation coefficient is defined as yi − ȳ 1 X 1 X xi − x̄ = r= ( n−1 sx sy n−1 )( ) That is, it is the sum of products of z-scores, scaled by the size of the data set (n − 1). Let’s see how to use this to compute the correlation coefficient from data. Then will figure out how this number describes the strength and direction of a linear association between two variables. c 2006 Randall Pruim ([email protected]) November 13, 2006 Math 143 – Fall 2006 43 Golf Scores Example Here is a scatter plot for the golf scores in two rounds of 12 players. The first plot uses scores, the second uses z-scores. Notice that the shape of the two graphs is identical. ● ● 2 1 ● ● ● 90 ● ● z2 round2 100 ● ● ● ● ● 0 80 −1 ● 80 ● ● ● ● ● ● ● ● ● ● ● 85 90 95 100 105 round1 −1 0 1 2 z1 The standard deviation was 7.83 for round 1 and 7.84 for round 2. We can use this to calculate the z-scores for each value and then the correlation coefficient: round1 ===== 89 90 87 95 86 81 102 105 83 88 91 79 ==== sum 1076 round2 z1 ===== ===== 94 -0.09 85 0.04 89 -0.34 89 0.68 81 -0.47 76 -1.11 107 1.57 89 1.96 87 -0.85 91 -0.21 88 0.17 80 -1.36 ==== ===== 1056 0.00 mean(round1) = 89.667 sd(round1) = 7.83 z2 ===== 0.77 -0.38 0.13 0.13 -0.89 -1.53 2.42 0.13 -0.13 0.38 0.00 -1.02 ===== 0.00 z1*z2 ====== -0.07 -0.02 -0.04 0.09 0.42 1.69 3.82 0.25 0.11 -0.08 0.00 1.39 ===== 7.56 The correlation coefficient is 1 (7.56) = 0.687 11 since there are 12 pairs of data values and 11 = 12 − 1. mean(round2) = 88 sd(round2) = 7.84 c 2006 Randall Pruim ([email protected]) r= Math 143 – Fall 2006 November 13, 2006 44 Here is the R regression output for this example: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.3320 20.6949 1.272 0.2320 round1 0.6877 0.2300 2.990 0.0136 * --Residual standard error: 5.974 on 10 degrees of freedom Multiple R-Squared: 0.4721,Adjusted R-squared: 0.4193 F-statistic: 8.942 on 1 and 10 DF, p-value: 0.01357 Notice that r is not given directly, but r2 is given (it’s called R-Squared). There is also a lot of other stuff in the output. We will learn what those other numbers mean a little later. How Correlation Works So how does this correlation coefficient work? Let’s think first about when it will be positive and when it will be negative. The correlation coefficient will be positive if we have lots of positive products and few negative products in our sum. Positive products occur when both z-scores are positive or both are negative. So a positive correlation coefficient indicates Similarly, a negative correlation coefficient indicates On the other hand, the correlation coefficient will be near zero when Properties of r: 1. 2. 3. 4. c 2006 Randall Pruim ([email protected]) November 13, 2006 Math 143 – Fall 2006 45 Another Example Here is a scatter plot for two scores (fake data): score2 80+ 60+ 40+ - o o o o o o o o o o +---------+---------+---------+---------+---------+------score1 36.0 42.0 48.0 54.0 60.0 66.0 The standard deviation of score1 is 9.89 and of score2 15.85. We can use this to calculate the z-scores and then the correlation coefficient as before: score1 46 47 52 44 55 67 49 42 39 68 === sum 509 score2 Z1 72 69 72 39 67 87 62 51 54 91 === 664 -0.50 -0.39 0.11 -0.70 0.41 1.63 -0.19 -0.90 -1.20 1.73 ===== 0.00 Z2 Z1*Z2 0.35 0.16 0.35 -1.73 0.04 1.30 -0.28 -0.97 -0.78 1.55 ===== 0.00 -0.18 -0.06 0.04 1.21 0.02 2.12 0.05 0.87 0.94 2.68 ===== 7.69 The correlation coefficient is r= Here is some Minitab regression output for this example. Again, there are things here we haven’t learned about yet, but you should be able to identify the correlation coefficient. The regression equation is score2 = -3.25020 + 1.36837 score1 S = 8.73901 R-Sq = 73.0 % R-Sq(adj) = 69.6 % Analysis of Variance Source Regression Error Total DF 1 8 9 SS 1649.44 610.96 2260.40 MS 1649.44 76.37 F 21.5979 P 0.002 c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 16, 2006 46 Regression Lines Correlations measure the direction and strength of a linear relationship. If we want to go beyond that, and to draw a line to graphically show the relationship more specifically, we need linear regression. A regression line describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Note: when we did correlations, we did not need to be careful about explanatory and response variables, but for regression we must always identify one variable as the independent (explanatory) variable and the other as the dependent (response) variable. Review of Lines Slope • Slope can be computed from any two points on a line • Definition: slope = • We get the same number for slope no matter which two points we pick. Example 1. (1, 3) and (3, 8) are on a line. What is the slope of the line? Example 2. (1, 3) is on a line that has slope 2. If the x-coordinate of another point on the line is 3, what is the y coordinate? How do we determine y if we know x? All non-vertical lines can be described by an equation of the form y = (slope)x + (intercept) y = mx + b y = a + bx ŷ = a + bx and we can determine the equation above if we know either • slope and one point • two points c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 16, 2006 47 Best Fit Lines Now back to regression lines. Suppose I make a scatter plot and think a line would be a good way to describe the overall relationship. Which line should I use? The idea behind regression is to use the line that . The full name of the regression line is the least-squares regression line of y on x. The regression line is chosen to make the sum of the squared errors in predicting y values from x values according to the line as small as possible. These errors are usually called residuals: residual = observed y - predicted y = y − ŷ Picture: We will have more to say about residuals shortly. It is an interesting mathematical problem to determine which line minimizes the sum of squared residuals. It turns out to be a straightforward application of calculus. Amazingly, it is actually quite easy to determine the regression line from data. We will describe the regression line by giving one point and the slope. From that we can get the equation. • The point is always on the regression line. • The slope of the regression line is b = Example 3. (golf scores data) r = 0.687, x̄ = 89.667, sx = 7.83, ȳ = 88, sy = 7.84. Find an equation for the regression line. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 16, 2006 48 Example 4. Making Predictions. Use the regression line with to predict the second round scores of golfers who score 80, 90, and 105 ● Residual standard error: 5.974 on 10 degrees of freedom Multiple R-Squared: 0.4721,Adjusted R-squared: 0.4193 F-statistic: 8.942 on 1 and 10 DF, p-value: 0.01357 100 round2 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.3320 20.6949 1.272 0.2320 round1 0.6877 0.2300 2.990 0.0136 * --- ● ● ● 90 ● ● ● ● ● 80 ● ● ● 80 85 90 95 100 105 round1 To sketch a line: Make two predictions, plot them, and connect them. (You will be more accurate if you make predictions that are farther apart.) Go back and add the regression line to the scatter plot. How well does the regression line fit? A regression line can be made from any set of data. But some data sets are not very well described by the regression line. We want to develop some tools to help us measure the fit. The regression line is a mathematical model. It describes the relationship between x and y, but not perfectly. The vertical distances between data points and the regression line are the errors of the model. (Remember: residual = observed - predicted.) The errors are called residuals because they represent the leftover (or residual) information that the model fails to predict. The residuals can help us assess how well the line describes the data. c 2006 Randall Pruim ([email protected]) November 16, 2006 Math 143 – Fall 2006 49 Example 4 (continued). Find the missing residuals in the golf scores example. 1 2 3 4 5 6 7 8 9 10 11 12 round 1 89.00 90.00 87.00 95.00 86.00 81.00 102.00 105.00 83.00 88.00 91.00 79.00 round 2 94.00 85.00 89.00 89.00 81.00 76.00 107.00 89.00 87.00 91.00 88.00 80.00 predicted round 2 87.54 88.23 86.17 91.67 85.48 82.04 96.48 98.55 83.42 86.85 88.92 80.66 residual 2.83 −2.67 −4.48 −6.04 10.52 3.58 4.15 −0.92 −0.66 What is the sum of all the residuals? Residual Plots Plots of residuals can be useful for detecting problems with a linear fit. Here is an example plotting residuals against round 1 scores: c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 16, 2006 resid 5 10 0 −5 −10 80 85 round1 90 95 100 105 Once we have a plot of residuals, we look for patterns. If the fit is good, we should see and when we look at residuals plots. Sketches of residual plots and what they indicate: c 2006 Randall Pruim ([email protected]) 50 November 17, 2006 Math 143 – Fall 2006 51 More Regression Examples Regression is pretty tedious to do by hand, so let’s do some more examples using R. In each example below do the following: 1. Make a scatter plot and decide if a linear association seems appropriate. You can use either Graphs > xyplot or Graphs > scatterplot. 2. If there does not seem to be a linear relationship, try transforming one of the variables before continuing. Often transforming one or both variables using logarithms or square roots will lead to a relationship that is much closer to linear. Use Data >Manage . . . > Compute new variable . . . to do the transformation. 3. Guess the correlation coefficient. (We’ll check your guess in a moment.) 4. Determine an equation for the regression line and the value of the correlation coefficient. Make a prediction or two using your regression equation. Use Statistics > Fit models > Linear regression. Can you find the information you need in the output? 5. Make a residual plot and note anything interesting you see there. You can add a bunch of information to your data set using Models > Add observation statistics to data. There are lots of things you can compute. You can leave the defaults if you like, or just choose the fitted values and the residuals. Once you have added these to your data set, you can make a residual plot by making a scatter plot of your x variable against the new residuals variable. These examples use data from the Devore6 package. Load that package using Tools > Load package(s). Example 1. A wastewater treatment plant filters the air discharged from the plant. Does the efficiency of the filter depend on the temperature? ex12.82 in the Devore6 package gives measurements of temperature (in Celsius) and efficiency (percentage of contaminants removed) for 32 different times. Example 2. In order to make the perfect tortilla chip, a company does an experiment to see how frying time (in seconds, x in the data set) affects moisture content (as a percent, y in the data set). After all, we don’t want those chips to be soggy – or too crispy either. The data are in ex13.15 of Devore6. Example 3. Hydrostatic weighing is the customary way to determine body fat percentages, but it is cumbersome. A new method was developed called the Bod Pod. A number of football players were measured using both methods to see if the simpler Bod Pod method could be used in place of hydrostatic weighing. The measurements by both methods are given in ex12.84 of Devore6. c 2006 Randall Pruim ([email protected]) Math 143 – Fall 2006 November 17, 2006 52 General Cautions about Correlation and Regression 1. Correlation only measures linear association. So we should always 2. Extrapolation (predicting outside the range of x’s in the data) is dangerous. 3. Correlations and regressions are not resistant (they can be greatly affected by outliers). Influential observations: 4. Association does not imply causation. And of course, we still need to learn how to improve our estimates by forming confidence intervals or conducting hypothesis tests. (Stay tuned.) Anscombe’s Data Anscombe created a data set with four x-variables and four y-variables to demonstrate the dangers of looking only at the numerical results of regression and ignoring the pictures. 1. Load this data set from the datasets package. It is called anscombe. (It will come after the all the things that begin with capital letters.) 2. Without looking at the graphs first, for each pair (x1 with y1, x2 with y2, x3 with y3, x4 with y4), look at the numerical regression output. What do you notice? 3. Now go back and make the scatterplots. What happened? c 2006 Randall Pruim ([email protected]) November 20, 2006 Math 143 – Fall 2006 53 Inference for Regression Simple Linear Regression is a set of procedures for dealing with data that has two quantitative values for each individual in the data set. We will call these x (explanatory) and y (response). The goal will be to use x to predict y. Example 1. A tax consultant studied the current relation between the selling price and assessed valuation of one-family dwellings in a large tax district. He obtained a random sample of recent sales of one-family dwellings and, after plotting the data, found a regression line of (measuring in thousands of dollars): (selling price) = 4.98 + 2.40(assessed valuation) At the same time, a second tax consultant obtained a second random sample of recent sales of onefamily dwellings, and after plotting the data, found a regression line of: (selling price) = 1.89 + 2.63(assessed valuation) Both consultants attempted to model the “true” linear relationship they believed to exist between price and valuation. The regression lines they found were sample estimates of the “true” (but unknown) population relationship between selling price (y) and assessed valuation (x). But each one came up with a different result. The fact that different samples lead to different regressions lines tells us that unless we know something about the accuracy of our regression line estimate, it really isn’t of much use to us. So we need to learn about inference for regression. The Regression Model The model for simple linear regression is yi = α + βxi + |{z} i |{z} | {z } data fit i ∼ N (0, σ) error The assumptions for the linear regression model are 1. 2. 3. c 2006 Randall Pruim ([email protected]) November 20, 2006 Math 143 – Fall 2006 54 Example 2. We want to predict SAT scores from ACT scores. We sample scores from 60 sutdents who have taken both tests. Here is a scatter plot of their test scores: ● 1400 ● ● 1200 ● SAT ● 1000 ● ● ● ● ● ● ● 800 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 600 ● ● ● ● 400 10 Parameter Estimates 15 20 25 30 ACT The simple regression model has 3 parameters: α, β, and σ. Given a sample from the population, we estimate these parameters with a, b and s. • a and b come from the regression line: s – b = slope = r sxy – a = intercept. We can solve for a using the fact that the point (x̄, ȳ) is on the regression line. rP (residual)2 • s= , where residual = observed - predicted = yi − yˆi . n−2 • In practice, the values of a, b, s and r are calculated by software. You should be able to identify each in the output below (r2 is given rather than r): Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 253.2 62.7 4.04 0.00016 *** ACT 31.2 2.9 10.78 1.8e-15 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 Residual standard error: 105 on 58 degrees of freedom Multiple R-Squared: 0.667,Adjusted R-squared: 0.661 F-statistic: 116 on 1 and 58 DF, p-value: 1.80e-15 Since the model is based on normal distributions and we don’t know σ . . . 1. Regression is going to be sensitive to outliers. Outliers with especially large or small values of the independent variable are especially influential. Some computer packages will help you try to identify potential problems. Here is output from Minitab: Unusual Observations Obs ACT SAT 15 10.0 500.0 21 32.0 1440.0 25 7.0 490.0 47 21.0 420.0 Fit 565.2 1251.8 471.6 908.5 SE Fit 35.0 34.2 43.1 13.5 Residual -65.2 188.2 18.4 -488.5 St Resid -0.66 1.90 0.19 -4.70 R denotes an observation with a large standardized residual X denotes an observation whose X value gives it large influence. c 2006 Randall Pruim ([email protected]) X X X R November 20, 2006 Math 143 – Fall 2006 55 2. We can check if the model is reasonable by looking at our residuals: (a) Histograms and normal quantile plots indicate overall normality. We are looking for a roughly bell-shaped histogram or a roughly linear normal quantile plot. (b) Plots of • residuals vs x, or • residuals vs. order, or • residuals vs. fit note: fit = indicate if the standard deviation appears to remain constant throughout. We are looking to NOT see any clear pattern in these plots. A pattern would indicate something other than randomness is influencing the residuals. 3. We can do inference for α, β, etc. using the t distributions, we just need to know the corresponding SE and degrees of freedom. parameter SE df s α 1 x̄2 +P n (xi − x̄)2 s SEb = pP (xi − x̄)2 s x∗ − x̄ 1 SEµ̂ = s +P n (xi − x̄)2 s 1 x∗ − x̄ SEŷ = s 1 + + P n (xi − x̄)2 SEa = s β µ̂ (prediction of mean) ŷ (individual prediction) n−2 n−2 n−2 n−2 We won’t ever compute these SE’s by hand, but notice that they are made up of pieces that look familiar (square roots, n in the denominator, square of differences from the mean, all the usual stuff.) Furthermore, just by looking at the formulas, we can learn something about the behavior of the confidence intervals and hypothesis tests involved. SEa and SEb are easy to identify in the computer output We can also find the (two-sided) p-values for two hypothesis tests. (a) H0 : α = 0 This is usually not the most interesting thing to know. Remember the intercept tells the (mean) y value associated with an x value of . For many situations, this is not even a meaningful value. (b) H0 : β = 0 This is much more interesting for two reasons. First, the slope is often a very interesting parameter to know because Second, this is a measure of how useful the model is for making predictions because β = only if c 2006 Randall Pruim ([email protected]) November 20, 2006 Math 143 – Fall 2006 56 Confidence Intervals for α and β Of course, we can also give a confidence interval for α (usually not interesting) or β (usually interesting). The form for each will be estimate ± t∗ SE Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 253.2 62.7 4.04 0.00016 *** ACT 31.2 2.9 10.78 1.8e-15 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 Residual standard error: 105 on 58 degrees of freedom Multiple R-Squared: 0.667,Adjusted R-squared: 0.661 F-statistic: 116 on 1 and 58 DF, p-value: 1.80e-15 Rcmdr will even output the confidence intervals for you (Models > Confidence Intervals) 2.5 % 97.5 % (Intercept) 127.74399 378.62603 ACT 25.41013 37.00138 Confidence Intervals vs Prediction Intervals Recall that our goal was to make predictions of y from x. As you would probably expect, such predictions will also be described using confidence intervals. Actually there are two kinds of predictions: 1. Confidence intervals for the mean response 2. Prediction intervals are confidence intervals for a . Notice that for predictions (confidence intervals and prediction intervals), the standard errors depend on x∗ (the x value for which you want a prediction made), so it is not possible for the computer to tell you what SE is until you decide what prediction your want to make. R can do the whole thing for you, but there is no option in Rcmdr to do it. Here is how R make as prediction and confidence intervals of SAT scores corresponding to an ACT score of 25. > predict(lm(SAT~ACT,data=scores),newdata=data.frame(ACT=25),interval=’prediction’) fit lwr upr [1,] 1033.329 820.563 1246.095 > predict(lm(SAT~ACT,data=scores),newdata=data.frame(ACT=25),interval=’confidence’) fit lwr upr [1,] 1033.329 998.171 1068.487 Notice that the prediction interval is wider. c 2006 Randall Pruim ([email protected]) November 20, 2006 Math 143 – Fall 2006 57 Example: Skin Thickness and Body Density There are many reasons why one would like to the the fat content of a human body. The most accurate way to estimate this is by determining the body density (weight per unit volume). Since fat is less dense than other body tissue, a lower density indicates a higher relative fat content. The best way to estimate body density is difficult to measure directly (the standard method requires weighing the subject underwater), so scientists have looked for other measurements that can accurately predict body density. One such measurement we will call skinfold thickness, and is actually the logarithm of the sum of four skinfold thicknesses measured at different points on the body. ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ●●● ● ● ● 1.04 ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● 1.0 1.2 1.4 1.6 1.8 ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ●● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ●●● ●● ● ●● ●● ● ●● ● ● ● ● ● ●● 1.03 2.0 1.05 1.5 skthick 1.07 1.0 0.5 1.09 −2 3Q 0.004949 −1 0 1 2 Residuals vs Leverage ● 70 61 ● 9● ● ● 1.05 1.07 ● 1.09 Fitted values Residuals: Min 1Q Median -0.018967 -0.005092 -0.000498 ● Scale−Location ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● 1.03 ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● Theoretical Quantiles ● ● ● 61 ● 70 ● ●● 9 Fitted values ● 0.0 Standardized residuals ● 0 1 2 3 ●● ● 70 Normal Q−Q −2 ● ● 61 ● 9● Max 0.023679 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.16300 0.00656 177.3 <2e-16 *** skthick -0.06312 0.00414 -15.2 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.00854 on 90 degrees of freedom Multiple R-Squared: 0.72, Adjusted R-squared: 0.717 F-statistic: 232 on 1 and 90 DF, p-value: <2e-16 c 2006 Randall Pruim ([email protected]) 0 1 2 3 density ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ●● 0.02 ● ●● 0.00 ● ● ● 1.06 Residuals vs Fitted ● ● ● ●● −0.02 1.08 ● Residuals ● ● ●● ● 0.5 ● 70 ● ●● 42 ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Cook's distance −2 ● ● Standardized residuals ● Standardized residuals Use the R output below to answer the question: How well does skinfold thickness predict body density? 23 0.00 0.04 Leverage 0.08 November 27, 2006 Math 143 – Fall 2006 58 The basic ANOVA situation • Two variables: 1 Categorical, 1 Quantitative • Main Question: Do the (means of) the quantitative variables depend on which group (given by categorical variable) the individual is in? Or are they all the same? • If categorical variable has only 2 values: – 2-sample t-test • ANOVA allows for 3 or more groups (sub-populations) Example 1. Treating Blisters. • Subjects: 25 patients with blisters • Treatments: Treatment A, Treatment B, Placebo (P) • Measurement: # of days until blisters heal • Data [and means]: – A: 5,6,6,7,7,8,9,10 [7.25] – B: 7,7,8,9,9,10,10,11 [8.875] – P: 7,9,9,10,10,10,11,12,13 [10.11] Question: Are these differences significant? or would we expect differences this large just by random chance? days 12 ● 10 ● 8 ● 6 A B P Whether differences between the groups are significant depends on • • • c 2006 Randall Pruim ([email protected]) November 27, 2006 Math 143 – Fall 2006 59 Some notation for ANOVA • n = number of individuals all together • I = number of groups • xij = value for individual j in group i • x̄ = sample mean of quant. variable for entire data set • s = sample s.d. of quant. variable for entire data set Group i has • ni = # of individuals in group i • x̄i = sample mean for group i • si = sample standard deviation for group i From our example (I = 3; 3 groups) • n1 = 8, n2 = 8 n3 = 9 • x̄1 = 7.25, x̄2 = 8.875, x̄3 = 10.11, etc. The ANOVA model xij = µi + ij |{z} |{z} |{z} DATA FIT ERROR ij ∼ N (0, σ) Relationship to Regression • FIT gives mean for each level of categorical variable, but the means need not fall in a pattern as in regression. • ANOVA can be thought of as regression with a categorical predictor. Assumptions of this model It is assumed that each group (subpopulation) . . . • is normally distributed about its group mean (µi ) • has the same standard deviation c 2006 Randall Pruim ([email protected]) November 27, 2006 Math 143 – Fall 2006 60 Checking the assumptions • Equal Variances Check. – Rule of Thumb: days treatment A B P N 8 8 9 Mean 7.250 8.875 10.111 Median 7.000 9.000 10.000 StDev 1.669 1.458 1.764 • Overall Normality Check: 3 ● ●● ● ● 1 ●● ●● ●● ●●●●● ● ● ●●● −1 −3 Sample Quantiles Normal Q−Q Plot for Blister Residuals ● ● ● ● −2 −1 0 1 2 Theoretical Quantiles What does ANOVA do? At its simplest (there are extensions) ANOVA tests the following hypotheses: • H0 : The means of all the groups are equal. (µ1 = µ2 = · · · = µI ) • Ha : Not all the means are equal. – doesn’t say how or which ones differ – can follow up with “multiple comparisons” if we reject H0 A quick look at the ANOVA table treatment Residuals Df 2 22 Sum Sq 34.74 59.26 Mean Sq 17.37 2.69 F value 6.45 Pr(>F) 0.0063 c 2006 Randall Pruim ([email protected]) November 27, 2006 Math 143 – Fall 2006 61 How ANOVA works ANOVA measures two sources of variation in the data and compares their relative sizes • variation BETWEEN groups – for each data value look at the difference between its group mean and the overall mean X X (group mean − overall mean)2 = (x̄i − x̄)2 = SSG • variation WITHIN groups – for each data value we look at the difference between that value and the mean of its group X X (individual measurement − group mean)2 = (xij − x̄i )2 = SSE • SSG and SSE are then adjusted to account for sample sizes and the number of groups: – DF G = I − 1 = degrees of freedom for numerator – DF E = n − I = degrees of freedom for denominator – M SG = SSG/DF G; M SE = SSG/DF E The ANOVA F-statistic is a ratio of the Between Group Variaton (explained by FIT) divided by the Within Group Variation (Residuals): F = Between M SG = W ithin M SE • A large value of F is evidence against H0 , since it indicates that there is more difference between groups than within groups. • If H0 is true (and the model assumptions are also true), then the sampling distribution for F has an F-distribution. (We’ll talk about degrees of freedom for F shortly.) c 2006 Randall Pruim ([email protected]) November 27, 2006 Math 143 – Fall 2006 62 Computing F (an example) Let’s try an even smaller example. Suppose we have three groups Group 1: 5.3, 6.0, 6.7 [x̄1 = 6.00] Group 2: 5.5, 6.2, 6.4, 5.7 [x̄2 = 5.95] Group 3: 7.5, 7.2, 7.9 [x̄3 = 7.53] overall mean: 6.44; F = 2.5528/0.25025 = 10.21575 The ANOVA table revisited The traditional way to summarize all these calculations is in an ANOVA table (which matches the inforation above up to round-off error): group Residuals Df 2 7 Sum Sq 5.13 1.76 Mean Sq 2.56 0.25 F value 10.22 Pr(>F) 0.0084 Now let’s fill in the ANOVA table for our blister example: Df treatment Residuals Sum Sq 34.74 59.26 Mean Sq F value c 2006 Randall Pruim ([email protected]) Pr(>F) 0.0063 November 27, 2006 Math 143 – Fall 2006 63 SST, MST, and R2 Let’s add another row to our ANOVA table for the blister example: treatment Residuals Total Df 2 22 24 Sum Sq 34.74 59.26 94.00 Mean Sq 17.37 2.69 3.92 F value 6.45 Pr(>F) 0.0063 √ DF T = 24 SST = 94 M ST = 3.92 M ST = 1.98 > numSummary(blister[,"days"], statistics=c("mean", "sd", "quantiles")) mean sd 0% 25% 50% 75% 100% n 8.8 1.979057 5 7 9 10 13 25 R2 = SSG = proportion of variation explained by groups SST ANOVA for Regression We can also make ANOVA tables for regression. Here is the output from our example where we wanted to predict body density from skinfold thickness. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.16300 0.00656 177.3 <2e-16 *** skthick -0.06312 0.00414 -15.2 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.00854 on 90 degrees of freedom Multiple R-Squared: 0.72, Adjusted R-squared: 0.717 F-statistic: 232 on 1 and 90 DF, p-value: <2e-16 Analysis of Variance Table Response: density Df Sum Sq Mean Sq F value Pr(>F) skthick 1 0.01691 0.01691 232 <2e-16 *** Residuals 90 0.00656 0.00007 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R2 = SSG = proportion of variation explained by linear fit SST c 2006 Randall Pruim ([email protected]) November 28, 2006 Math 143 – Fall 2006 64 Attractin Bugs (Example 22.6 in BPS3) NumTrap ColorW-ColorG ColorY-ColorG ColorY-ColorW N=24 +-------+-+--+--+-----+----+-----+----+--------+---------+ | | |N |0%|25% |50% |75% |100%|mean |sd | +-------+-+--+--+-----+----+-----+----+--------+---------+ |Color |B| 6| 7|11.75|15.0|19.00|21 |14.83333| 5.344779| | |G| 6|15|26.75|34.5|38.50|41 |31.50000| 9.914636| | |W| 6|12|13.25|15.5|17.00|21 |15.66667| 3.326660| | |Y| 6|38|45.25|46.5|47.75|59 |47.16667| 6.794606| +-------+-+--+--+-----+----+-----+----+--------+---------+ |Overall| |24| 7|14.75|21.0|39.50|59 |27.29167|14.947674| +-------+-+--+--+-----+----+-----+----+--------+---------+ Median 0.1667 3Q 5.2083 60 -4.042 4.000 8.042 ● 50 ● 40 Max 11.8333 ● Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.8333 2.7696 5.356 3.05e-05 Color[T.G] 16.6667 3.9168 4.255 0.000387 Color[T.W] 0.8333 3.9168 0.213 0.833671 Color[T.Y] 32.3333 3.9168 8.255 7.16e-08 NumTrap Residuals: Min 1Q -16.5000 -2.9167 -15.833 -26.796 -4.870 15.667 4.704 26.630 31.500 20.537 42.463 ● 30 20 ● ● 10 Residual standard error: 6.784 on 20 degrees of freedom Multiple R-Squared: 0.8209,Adjusted R-squared: 0.794 F-statistic: 30.55 on 3 and 20 DF, p-value: 1.151e-07 B Analysis of Variance Table G W Y W−B Y−G diff lwr upr p adj G-B 16.6666667 5.703670 27.629663 0.0020222 W-B 0.8333333 -10.129663 11.796330 0.9964823 Y-B 32.3333333 21.370337 43.296330 0.0000004 W-G -15.8333333 -26.796330 -4.870337 0.0032835 Y-G 15.6666667 4.703670 26.629663 0.0036170 Y-W 31.5000000 20.537004 42.462996 0.0000006 Y−W $Color W−G Y−B Tukey multiple comparisons of means 95% family-wise confidence level G−B 95% family−wise confidence level Response: NumTrap Df Sum Sq Mean Sq F value Pr(>F) Color 3 4218.5 1406.2 30.552 1.151e-07 Residuals 20 920.5 46.0 −20 −10 0 10 20 30 40 Differences in mean levels of Color Simultaneous 95% confidence intervals: Tukey contrasts Tukey contrasts for factor Color Tukey contrasts Contrast matrix: ColorG-ColorB ColorW-ColorB ColorY-ColorB ColorW-ColorG ColorY-ColorG ColorY-ColorW 0 0 0 0 0 0 ) ● ) ● ( ColorY−ColorB ( ● ) ● ) 0.001 ( ColorY−ColorG 2.799 Coefficients: ColorG-ColorB ColorW-ColorB ColorY-ColorB ( ColorW−ColorB ColorW−ColorG Absolute Error Tolerance: 95 % quantile: ( ColorG−ColorB ColorB ColorG ColorW ColorY -1 1 0 0 -1 0 1 0 -1 0 0 1 0 -1 1 0 0 -1 0 1 0 0 -1 1 ( ColorY−ColorW Estimate 2.5 % 97.5 % t value Std.Err. p raw p Bonf p adj 16.667 5.704 27.630 4.255 3.917 0.000 0.002 0.002 0.833 -10.130 11.796 0.213 3.917 0.834 1.000 0.996 32.333 21.370 43.296 8.255 3.917 0.000 0.000 0.000 ) ● −20 0 95 % two−sided confidence intervals c 2006 Randall Pruim ([email protected]) ) ● 20 40 3.917 0.001 3.917 0.001 3.917 0.000 0.004 0.003 0.004 0.004 0.000 0.000 Math 143 – Fall 2006 November 28, 2006 More Examples c 2006 Randall Pruim ([email protected]) 65
© Copyright 2026 Paperzz