Statistics for Engineers A Very Brief Introduction M. Stob November 7, 2007 Preface These notes are for the statistics portion of the course Mathematics 232, Engineering Mathematics, taught at Calvin College and required of all students in the engineering program. The prerequisites for the course include two semesters of calculus and a course in differential equations and linear algebra. Mathematics 232 includes three units: linear algebra, statistics, and vector calculus. It isn’t possible to do justice to the topic of statistics in the five weeks that are given over to this topic in Mathematics 232. Nevertheless this course is intended to give at least a broad overview into the central questions in statistical analysis. One important feature of the approach of these notes is to integrate the use of an industry-standard statistical computer program from the very beginning. Not only does this give the student some familiarity with the tools used in the so-called “real world.” it also allows us to move more quickly through the central notions of statistical analysis. I hope that all students in this course find statistics useful and interesting enough to learn more statistics sometime down the road. These notes are not intended to be self-contained. First, they assume that the student has access to a basic introduction to the R computer language. I refer explicitly in the text to the very nice introduction SimpleR, written by John Verzani and available on the web at http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf. Second, the notes assume that the student comes to class and no apologies are made for leaving out extended discussions from the notes of issues treated carefully in class. Finally, the problems must be completed independently by the reader to insure that the concepts are clearly understood. These notes have been written expressly for Mathematics 232. It seems unlikely that one would find them ideal for some other purpose. Nevertheless, the notes are freely available to anyone for their personal, non-commercial use (i.e., don’t sell these notes - they’re not worth buying anyway). These notes are part of a larger project to organize the teaching of statistics at Calvin College. That means, among other things, that much of the material in these notes is not original but has been shamelessly plagiarized from Foundations and Applications of Statistics by Randall Pruim, the text for Mathematics 343-344. In turn, these notes will also morph into part of the text for Mathematics 243 at Calvin. This is the first edition of these notes. Thus errors, typographical and otherwise, abound. I encourage readers to communicate them to me at [email protected]. I take full responsibility for the errors, even the ones that I plagiarized from Pruim. Contents 1 Introduction 101 2 Data 201 2.1 Data - Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 2.2 Graphical and Numerical Summaries of Univariate Data . . . . . . . . . . . . . . . . . . 206 2.2.1 Graphical Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 2.2.2 Measures of the Center of a Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 209 2.2.3 Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 2.3 The Relationship Between Two Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 2.4 Describing a Linear Relationship Between Two Quantitative Variables . . . . . 223 2.5 Describing a Non-linear Relationship Between Two Variables . . . . . . . . . . . . . 234 2.6 Data - Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 2.7 Data - Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 3 Probability 301 3.1 Modelling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 3.2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 3.2.2 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 3.2.3 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 3.3.1 pdfs and cdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 3.3.2 Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 3.3.3 Exponential Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 3.3.4 Weibull Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 3.4 Mean and Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 v Contents 3.5 3.6 3.4.1 The Mean of a Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 The Mean of a Continuous Random Variable . . . . . . . . . . . . . . . . . . . . . 3.4.3 Transformations of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 The Variance of a Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 325 327 328 329 335 4 Inference 401 4.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 4.2 Inferences about the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 4.3 The t-Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 4.4 Inferences for the Difference of Two Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 4.5 Regression Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 vi 1 Introduction Kellogg’s makes Raisin Bran and packages it in boxes that are labeled “Net Weight: 20 ounces”. How might we test this claim? It seems obvious that we need to actually weigh some boxes. However we certainly cannot require that every box that we weigh contains exactly 20 ounces. Surely some variation in weight from box to box is to be expected and should be allowed. So we are faced with several questions: How many boxes should we weigh? How should we choose these boxes? How much deviation in weight from the 20 ounces should we allow? These are the kind of questions that the discipline of statistics is designed to answer. Definition 1.0.1 (Statistics). Statistics is the scientific discipline concerned with collecting, analyzing and making inferences from data. While we cannot tell the whole Raisin Bran story here, the answers to our questions as prescribed by NIST (National Institute of Standards and Technology) and developed from statistical theory are something like this. Suppose that we are at a Meijer’s warehouse that has just received a shipment of 250 boxes of Raisin Bran. We first select twelve boxes out of the whole shipment at random. By at random we mean that no box should be any more likely to occur in the group of twelve than any other. In other words, we shouldn’t simply take the first twelve boxes that we find. Next we weigh the contents of the twelve boxes. If any of the boxes is “too” underweight, we reject the whole shipment - that is we disbelieve the claim of Kellogg’s (and they are in trouble). If that is not the case, then we compute the average weight of the twelve boxes. If that average is not “too” far below 20 ounces, we do not disbelieve the claim. Of course there are some details in the above paragraph. We’ll address the issue of how to choose the boxes more carefully in Section 2.6. We’ll address the issue of summarizing the data (in this case, using the average weight) in Section 2.2. The question of how far below Kellogg’s should allowed to be will be dealt with in Section 4.2. 101 1 Introduction Underlying our statistical techniques is the theory of probability which we take up in Chapter 3. The theory of probability is meant to supply a mathematical model for situations in which there is uncertainty. In the context of Raisin Bran, we will use probability to give a model for the variation that exists from box to box. We will also use probability to give a model of the uncertainty introduced because we are only weighing a sample of boxes. If the whole course was only about Raisin Bran it wouldn’t be worth it (except perhaps to Kellogg’s), even an abbreviated course like this one. But you are probably sophisticated enough to be able to generalize this example. Indeed, the above story can be told in every branch of science (biological, physical, and social). Each time we have a hypothesis about a real-world phenomenon that is measurable but variable, we need to test that hypothesis by collecting data. We need to know how to collect that data, how to analyze it, and how to make inferences from it. So without further ado, let’s talk about data. 102 2 Data 2.1 Data - Basic Notions The OED defines data as “facts and statistics used for reference or analysis.” (And the OED notes that while the word data is technically the plural of datum, it is often used with a singular verb and that usage is now generally deemed to be acceptable.) For our purposes, the sort of data that we will use comes to us in collections or datasets. A dataset consists of a set of objects, variously called individuals, cases, items, instances, units, or subjects, together with a record of the value of a certain variable or variables defined on the items. Definition 2.1.1 (variable). A variable is a function defined on the set of objects. Ideally, each individual has a value for each variable. However there are often missing values. Example 2.1.1 Calvin College maintains a dataset of all currently active students. The individuals in this dataset are the students. Many different variables are defined and recorded in this dataset. For example, every student has a GPA, a GENDER, a CLASS, etc. Not every student has an ACT score — there are missing values for this variable. We will normally think of data as presented in a two-dimensional table. The rows of the table correspond to the individuals. (Thus the individuals need to be ordered in some way.) The columns of the table correspond to the variables. Each of the rows and the columns normally has a name. In R, the canonical way to store such data is in an object called a data.frame. A number of datasets are included with the basic installation of R. The following example shows how an included dataset is accessed in R. 201 2 Data > data(iris) > dim(iris) [1] 150 5 > iris[1:5,] Sepal.Length 1 5.1 2 4.9 3 4.7 4 4.6 5 5.0 > # the dataset called iris is loaded into a data.frame called iris # list dimensions of iris data # print first 5 rows (individuals), all columns Sepal.Width Petal.Length Petal.Width Species 3.5 1.4 0.2 setosa 3.0 1.4 0.2 setosa 3.2 1.3 0.2 setosa 3.1 1.5 0.2 setosa 3.6 1.4 0.2 setosa Notice that the data.frame has rows and columns. The individuals (rows) are, by default, numbered (they can also be named) and the variables (columns) are named. The numbers and names are not part of the dataset. Each column of a data.frame is a vector and behaves like the mathematical object called a vector. In the iris dataset, there are 150 individuals (plants) and five variables. Notice that four of the variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) are quantitative variables. That is, the value of the variable is a number. The fifth variable is categorical. A categorical variable usually has a finite number of possible values. The possible values of a categorical variable are often called its levels. In this example the variable Species is categorical with three levels. A categorical variable is often called a factor. Sometimes categorical variables use numbers for the category names. For example we might code gender by using 0 for males and 1 for females. We need to be careful not to treat these variables as quantitative simply because numbers are used. The following example shows how to look at pieces of the dataset. > iris$Sepal.Length [1] 5.1 4.9 4.7 4.6 [19] 5.7 5.1 5.4 5.1 [37] 5.5 4.9 4.4 5.1 [55] 6.5 5.7 6.3 4.9 [73] 6.3 6.1 6.4 6.6 [91] 5.5 6.1 5.8 5.0 [109] 6.7 7.2 6.5 6.4 [127] 6.2 6.1 6.4 7.2 [145] 6.7 6.7 6.3 6.5 > iris$Species # a 202 # returns a vector 5.0 5.4 4.6 5.0 4.4 4.6 5.1 4.8 5.0 5.0 5.0 4.5 4.4 5.0 5.1 6.6 5.2 5.0 5.9 6.0 6.8 6.7 6.0 5.7 5.5 5.6 5.7 5.7 6.2 5.1 6.8 5.7 5.8 6.4 6.5 7.4 7.9 6.4 6.3 6.1 6.2 5.9 boring vector of this 4.9 5.4 5.2 5.2 4.8 5.1 6.1 5.6 5.5 5.8 5.7 6.3 7.7 7.7 7.7 6.3 variable 4.8 4.8 4.3 4.7 4.8 5.4 4.6 5.3 5.0 6.7 5.6 5.8 6.0 5.4 6.0 5.8 7.1 6.3 6.0 6.9 5.6 6.4 6.0 6.9 5.8 5.2 7.0 6.2 6.7 6.5 7.7 6.7 5.7 5.5 6.4 5.6 6.3 7.6 6.3 6.9 5.4 4.9 6.9 5.9 5.6 4.9 6.7 5.8 5.1 5.0 5.5 6.1 5.5 7.3 7.2 6.8 2.1 Data - Basic Notions [1] setosa setosa setosa setosa [7] setosa setosa setosa setosa [13] setosa setosa setosa setosa [19] setosa setosa setosa setosa [25] setosa setosa setosa setosa [31] setosa setosa setosa setosa [37] setosa setosa setosa setosa [43] setosa setosa setosa setosa [49] setosa setosa versicolor versicolor [55] versicolor versicolor versicolor versicolor [61] versicolor versicolor versicolor versicolor [67] versicolor versicolor versicolor versicolor [73] versicolor versicolor versicolor versicolor [79] versicolor versicolor versicolor versicolor [85] versicolor versicolor versicolor versicolor [91] versicolor versicolor versicolor versicolor [97] versicolor versicolor versicolor versicolor [103] virginica virginica virginica virginica [109] virginica virginica virginica virginica [115] virginica virginica virginica virginica [121] virginica virginica virginica virginica [127] virginica virginica virginica virginica [133] virginica virginica virginica virginica [139] virginica virginica virginica virginica [145] virginica virginica virginica virginica Levels: setosa versicolor virginica > iris$Petal.Width[c(1:5,146:150)] # selecting [1] 0.2 0.2 0.2 0.2 0.2 2.3 1.9 2.0 2.3 1.8 > setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica some individuals Accessing datasets in R We have already seen the first way of accessing a dataset in R. There are a large number of datasets that are included with the standard distribution of R. Many of these are historically important datasets or datasets that are often used in statistics courses. A complete list of such datasets is available by data(). Many users of R have made other datasets available by creating a package. A package is a collection of R datasets and/or functions that a user can load. Some of these packages 203 2 Data come with the standard distribution of R. Others are available from CRAN. To load a package, use library(package.name) or require(package.name). For example, the faraway package contains several datasets. One such dataset records various health statistics on 768 adult pima indians for a medical study of diabetes. > library(faraway) > data(pima) > dim(pima) [1] 768 9 > pima[1:5,] pregnant glucose diastolic triceps insulin bmi diabetes age test 1 6 148 72 35 0 33.6 0.627 50 1 2 1 85 66 29 0 26.6 0.351 31 0 3 8 183 64 0 0 23.3 0.672 32 1 4 1 89 66 23 94 28.1 0.167 21 0 5 0 137 40 35 168 43.1 2.288 33 1 > If the package is not included in the distribution of R installed on your machine, the package can be installed from a remote site. This can be done easily in both Windows and Mac implementations of R using menus. Finally, datasets can be loaded from a file that is located on one’s local computer or on the internet. Two things need to be known: the format of the data file and the location of the data file. The most common format of a datafile is CSV (comma separated values). In this format, each individual is a line in the file and the values of the variables are separated by commas. The first line of such a file contains the variable names. There are no individual names. The R function read.csv reads such a file. Other formats are possible and the function read.table can be used with various options to read these. The following example shows how a file is read from the internet. The file contains the offensive statistics of all major league baseball teams for the complete 2007 season. > bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’) > bball[1:4,] CLUB LEAGUE BA SLG OBP G AB R H TB X2B X3B HR RBI 1 New York A 0.290 0.463 0.366 162 5717 968 1656 2649 326 32 201 929 2 Detroit A 0.287 0.458 0.345 162 5757 887 1652 2635 352 50 177 857 3 Seattle A 0.287 0.425 0.337 162 5684 794 1629 2416 284 22 153 754 204 2.1 Data - Basic Notions 4 Los Angeles A 0.284 0.417 0.345 162 5554 822 1578 2317 324 SH SF HBP BB IBB SO SB CS GDP LOB SHO E DP TP 1 41 54 78 637 32 991 123 40 138 1249 8 88 174 0 2 31 45 56 474 45 1054 103 30 128 1148 3 99 148 0 3 33 40 62 389 32 861 81 30 154 1128 7 90 167 0 4 32 65 40 507 55 883 139 55 146 1100 8 101 154 0 > 23 123 776 Creating datasets in R Probably the best way to create a new dataset for use in R is to use an external program to create it. Excel, for example, can save a spreadsheet in CSV format. The editing features of Excel make it very easy to create such a dataset. Small datasets can be entered into R by hand. First, vectors can be created using the c() or scan() functions. > x=c(1,2,3,4,5:10) > x [1] 1 2 3 4 5 6 7 > y=c(’a’, ’b’,’c’) > y [1] "a" "b" "c" > z=scan() 1: 2 3 4 4: 11 12 19 7: 4 8: Read 7 items > z [1] 2 3 4 11 12 19 4 > 8 9 10 The scan() function prompts the user with the number of the next item to enter. Items are entered delimited by spaces or commas. We can use as many lines as we like and the input is terminated by a blank line. There is also a data editor available in the graphical user interfaces but it is quite primitive. A data.frame can be made from vectors of the same length. > x=c(’Tom’,’Dick’,’Harry’) > y=c(23,28,27) 205 2 Data > people=data.frame(names = x, ages = y) > people names ages 1 Tom 23 2 Dick 28 3 Harry 27 > 2.2 Graphical and Numerical Summaries of Univariate Data Now that we can get our hands on some data, we would like to develop some tools to help us understand the distribution of a variable in a data set. By distribution we mean two things: what values does the variable take on, and with what frequency. Simply listing all the values of a variable is not an effective way to describe a distribution unless the data set is quite small. For larger data sets, we require some better methods of summarizing a distribution. 2.2.1 Graphical Summaries The type of summary that we generate will vary depending on the type of data that we are summarizing. A table is useful for summarizing a categorical variable. The following table is a useful description of the distribution of species of iris flowers in the iris dataset. > table(iris$Species) setosa versicolor 50 50 virginica 50 Tables can be generated for quantitative variables as well. > table(iris$Sepal.Length) 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 1 3 1 4 2 5 6 10 9 4 1 6 7 6 8 7 3 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.6 7.7 7.9 4 9 7 5 2 8 3 4 1 1 3 1 1 1 4 1 206 6 6.1 6 6 2.2 Graphical and Numerical Summaries of Univariate Data Percent of Total 30 0 2 4 6 8 Frequency 12 Histogram of bball$HR 100 120 140 160 180 200 220 240 bball$HR 20 10 0 100 150 200 bball$HR Figure 2.1: Homeruns in major leagues: hist() and histogram() The table function is more useful in conjunction with the cut() function. The second argument to cut() gives a vector of endpoints of half-open intervals. Note that the default behavior is to use intervals that are open to the left, closed to the right. > table(cut(iris$Sepal.Length,c(4,5,6,7,8))) (4,5] (5,6] (6,7] (7,8] 32 57 49 12 The kind of summary in the above table is graphically presented by means of a histogram. There are two R commands that can be used to build a histogram: hist() and histogram(). hist() is part of the standard distribution of R. histogram() can only be used after first loading the lattice graphics package, which now comes standard with all distributions of R. The R functions are used as in the following excerpt which generates the two histograms in Figure 2.1. Notice that two forms of the histogram() function are given. The second form (the “formula” form) will be discussed in more detail in Section 2.3. > > > > bball=read.csv(’http://www.calvin.edu/~stob/data/baseball2007.csv’) hist(bball$HR) histogram(bball$HR) # lattice histogram of a vector histogram(~HR,data=bball) # formula form of histogram Notice that the histograms produced differ in several ways. Besides aesthetic differences, the two histogram algorithms typically choose different break points. Also, the vertical scale 207 2 Data 0 neg. skewed 5 10 15 pos. skewed symmetric Percent of Total 20 15 10 5 0 0 5 10 15 0 5 10 15 x Figure 2.2: Skewed and symmetric distributions. of histogram() is in percentages of total while the vertical scale of hist() contains actual counts. As one might imagine, there are optional arguments to each of these functions that can be used to change such decisions. In these notes, we will always use histogram() and indeed we will assume that the lattice package has been loaded. Graphics functions in the lattice package often have several useful features. We will see some of these in later Sections. A histogram gives a shape to a distribution and distributions are often described in terms of these shapes. The exact shape depicted by a histogram will depend not only on the data but on various other choices, such as how many bins are used, whether the bins are equally spaced across the range of the variable, and just where the divisions between bins are located. But reasonable choices of these arguments will usually lead to histograms of similar shape, and we use these shapes to describe the underlying distribution as well as the histogram that represents it. Some distributions are approximately symmetrical with the distribution of the larger values looking like a mirror image of the distribution of the lower values. We will call a distribution positively skewed if the portion of the distribution with larger values (the right of the histogram) is more spread out than the other side. Similarly, a distribution is negatively skewed if the distribution deviates from symmetry in the opposite manner. Later we will learn a way to measure the degree and direction of skewness with a number; for now it is sufficient to describe distributions qualitatively as symmetric or skewed. See Figure 2.2 for some examples of symmetric and skewed distributions. Notice that each of these distributions is clustered around a center where most of the values are located. We say that such distributions are unimodal. Shortly we will discuss 208 2.2 Graphical and Numerical Summaries of Univariate Data Percent of Total 12 10 8 6 4 2 0 2 3 4 5 eruptions Figure 2.3: Old Faithful eruption times (based on the faithful data set). ways to summarize the location of the “center” of unimodal distributions numerically. But first we point out that some distributions have other shapes that are not characterized by a strong central tendency. One famous example is eruption times of the Old Faithful geyser in Yellowstone National park. The command > data(faithful); > histogram(faithful$eruptions,n=20); produces the histogram in Figure 2.3 which shows a good example of a bimodal distribution. There appear to be two groups or kinds of eruptions, some lasting about 2 minutes and others lasting between 4 and 5 minutes. 2.2.2 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects of unimodal distributions that we will often want to measure are central tendency (what is a typical value? where do the values cluster?), and the amount of variation (are the data tightly clustered around a central value, or more spread out?) Two widely used measures of center are the mean and the median. You are probably already familiar with both. The mean is calculated by adding all the values of a variable and dividing by the number of values. Our usual notation will be to denote the n values as x1 , x2 , . . . xn , and the mean of these values as x. Then the formula for the mean becomes Pn xi x = i=1 . n 209 2 Data The median is a value that splits the data in half – half of the values are smaller than the median and half are larger. By this definition, there could be more than one median (when there are an even number of values). This ambiguity is removed by taking the mean of the “two middle numbers” (after sorting the data). See Exercises 2.4 – 2.6 for some problems that explore aspects of the mean and median that may be less familiar. The mean and median are easily computed in R. For example, > mean(iris$Sepal.Length); median(iris$Sepal.Length); [1] 5.843333 [1] 5.8 We can also compute the mean and median of the Old Faithful eruption times. > mean(faithful$eruptions); median(faithful$eruptions); [1] 3.487783 [1] 4 Notice, however, that in the Old Faithful eruption times histogram (Figure 2.3) there are very few eruptions that last between 3.5 and 4 minutes. So although these numbers are the mean and median, neither is a very good description of the typical eruption time(s) of Old Faithful. It will often be the case that the mean and median are not very good descriptions of a data set that is not unimodal. In the case of our Old Faithful data, there seem to be two predominant peaks, but unlike in the case of the iris data, we do not have another variable in our data that lets us partition the eruptions times into two corresponding groups. This observation could, however, lead to some hypotheses about Old Faithful eruption times. Perhaps eruption times are different at night than during the day. Perhaps there are other differences in the eruptions. Subsequent data collection (and statistical analysis of the resulting data) might help us determine whether our hypotheses appear correct. One disadvantage of a histogram is that the actual data values are lost. For a large data set, this is probably unavoidable. But for more modestly sized data sets, a stem plot can reveal the shape of a distribution without losing the actual (perhaps rounded) data values. A stem plot divides each value into a stem and a leaf at some place value. The leaf is rounded so that it requires only a single digit. The values are then recorded as in Figure 2.4. From this output we can readily see that the shortest recorded eruption time was 1.60 minutes. The second 0 in the first row represents 1.70 minutes. Note that the output of stem() can be ambiguous when there are not enough data values in a row. 210 2.2 Graphical and Numerical Summaries of Univariate Data > stem(faithful$eruptions); The decimal point is 1 digit(s) to the left of the | 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 | | | | | | | | | | | | | | | | | | 070355555588 000022233333335577777777888822335777888 00002223378800035778 0002335578023578 00228 23 080 7 2337 250077 0000823577 2333335582225577 0000003357788888002233555577778 03335555778800233333555577778 02222335557780000000023333357778888 0000233357700000023578 00000022335800333 0370 Figure 2.4: Stemplot of Old Faithful eruption times using stem(). 211 2 Data Comparing mean and median Why bother with two different measures of central tendency? The short answer is that they measure different things, and sometimes one measure is better than the other. If a distribution is (approximately) symmetric, the mean and median will be (approximately) the same. (See Exercise 2.4.) If the distribution is not symmetric, however, the mean and median may be very different. For example, if we begin with a symmetric distribution and add in one additional value that is very much larger than the other values (an outlier), then the median will not change very much (if at all), but the mean will increase substantially. We say that the median is resistant to outliers while the mean is not. A similar thing happens with a skewed, unimodal distribution. If a distribution is positively skewed, the large values in the tail of the distribution increase the mean (as compared to a symmetric distribution) but not the median, so the mean will be larger than the median. Similarly, the mean of a negatively skewed distribution will be smaller than the median. Whether a resistant measure is desirable or not depends on context. If we are looking at the income of employees of a local business, the median may give us a much better indication of what a typical worker earns, since there may be a few large salaries (the business owner’s, for example) that inflate the mean. This is also why the government reports median household income and median housing costs. On the other hand, if we compare the median and mean of the value of raffle prizes, the mean is probably more interesting. The median is probably 0, since typically the majority of raffle tickets do not win anything. This is independent of the values of any of the prizes. The mean will tell us something about the overall value of the prizes involved. In particular, we might want to compare the mean prize value with the cost of the raffle ticket when we decide whether or not to purchase one. The trimmed mean compromise There is another measure of central tendency that is less well known and represents a kind of compromise between the mean and the median. In particular, it is more sensitive to the the extreme values of a distribution than the median is, but less sensitive than the mean. The idea of a trimmed mean is very simple. Before calculating the mean, we remove the largest and smallest values from the data. The percentage of the data removed from each end is called the trimming percentage. A 0% trimmed mean is just the mean; a 50% 212 2.2 Graphical and Numerical Summaries of Univariate Data trimmed mean is the median; a 10% trimmed mean is the mean of the middle 80% of the data (after removing the largest and smallest 10%). A trimmed mean is calculated in R by setting the trim argument of mean(), e.g. mean(x,trim=.10). Although a trimmed mean in some sense combines the advantages of both the mean and median, it is less common than either the mean or the median. This is partly due the mathematical theory that has been developed for working with the median and especially the mean of sample data. 2.2.3 Measures of Dispersion It is often useful to characterize a distribution in terms of its center, but that is not the whole story. Consider the distributions depicted in the histograms below. −10 A 0 10 20 30 B 0.20 Density 0.15 0.10 0.05 0.00 −10 0 10 20 30 In each case the mean and median are approximately 10, but the distributions clearly have very different shapes. The difference is that distribution B is much more “spread out”. “Almost all” of the data in distribution A is quite close to 10; a much larger proportion of distribution B is “far away” from 10. The intuitive (and not very precise) statement in the preceding sentence can be quantified by means of quantiles. The idea of quantiles is probably familiar to you since percentiles are a special case of quantiles. Definition 2.2.1 (Quantile). Let p ∈ [0, 1]. A p-quantile of a quantitative distribution is a number q such that the (approximate) proportion of the distribution that is less than q is p. 213 2 Data 1 4 9 16 6 25 36 49 64 81 100 6 Figure 2.5: An illustration of a method for determining quantiles from data. Arrows indicate the locations of the .25-quantile and the .5-quantile. So for example, the .2-quantile divides a distribution into 20% below and 80% above. This is the same as the 20th percentile. The median is the .5-quantile (and the 50th percentile). The idea of a quantile is quite straightforward. In practice there are a few wrinkles to be ironed out. Suppose your data set has 15 values. What is the .30-quantile? 30% of the data would be (.30)(15) = 4.5 values. Of course, there is no number that has 4.5 values below it and 11.5 values above it. This is the reason for the parenthetical word approximate in Definition 2.2.1. Different schemes have been proposed for giving quantiles a precise value, and R implements several such methods. They are similar in many ways to the decision we had to make when computing the median of a variable with an even number of values. Two important methods can be described by imaging that the sorted data have been placed along a ruler, one value at every unit mark and also at each end. To find the pquantile, we simply snap the ruler so that proportion p is to the left and 1 − p to the right. If the break point happens to fall precisely where a data value is located (i.e., at one of the unit marks of our ruler), that value is the p-quantile. If the break point is between two data values, then the p-quantile is a weighted mean of those two values. For example, suppose we have 10 data values: 1, 4, 9, 16, 25, 36, 49, 64, 81, 100. The 0-quantile is 1, the 1-quantile is 100, the .5-quantile (median) is midway between 25 and 36, that is 30.5. Since our ruler is 9 units long, the .25-quantile is located 9/4 = 2.25 units from the left edge. That would be one quarter of the way from 9 to 16, which is 9 + .25(16 − 9) = 9 + 1.75 = 10.75. (See Figure 2.5.) Other quantiles are found similarly. This is precisely the default method used by quantile(). > quantile((1:10)^2) 0% 25% 50% 1.00 10.75 30.50 75% 100% 60.25 100.00 A second scheme is just like this one except that the data values are placed midway between the unit marks. In particular, this means that the 0-quantile is not the smallest 214 2.2 Graphical and Numerical Summaries of Univariate Data value. This could be useful, for example, if we imagined we were trying to estimate the lowest value in a population from which we only had a sample. Probably the lowest value overall is less than the lowest value in our particular sample. Other methods try to refine this idea, usually based on some assumptions about what the population of interest is like. Fortunately, for large data sets, the differences between the different quantile methods are usually unimportant, so we will just let R compute quantiles for us using the quantile() function. For example, here are the deciles and quartiles of the Old Faithful eruption times. > quantile(faithful$eruptions,(0:10)/10); 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.6000 1.8517 2.0034 2.3051 3.6000 4.0000 4.1670 4.3667 4.5330 4.7000 5.1000 > quantile(faithful$eruptions,(0:4)/4); 0% 25% 50% 75% 100% 1.60000 2.16275 4.00000 4.45425 5.10000 The latter of these provides what is commonly called the five number summary. The 0-quantile and 1-quantile (at least in the default scheme) are the minimum and maximum of the data set. The .5-quantile gives the median, and the .25- and .75-quantiles (also called the first and third quartiles) isolate the middle 50% of the data. When these numbers are close together, then most (well, half, to be more precise) of the values are near the median. If those numbers are farther apart, then much (again, half) of the data is far from the center. The difference between the first and third quartiles is called the inter-quartile range and abbreviated IQR. This is our first numerical measure of dispersion. The five-number summary is often present by means of a boxplot. The standard R function is boxplot and the lattice function is bwplot() A boxplot of the Sepal.Width of the iris data is in Figure 2.2.3 and was generated by > bwplot(iris$Sepal.Width) The sides of the box are drawn at the quartiles. The median is represented by a dot in the box. In some boxplots, the whiskers extend out to the maximum and minimum values. However the boxplot that we are using here attempts to identify outliers. Outliers are values that are unusually large or small and are indicated by a special symbol beyond the whiskers. The whiskers are then drawn from the box to the largest and smallest nonoutliers. One common rule for automating outlier detection for boxplots is the 1.5 IQR rule. Under this rule, any value that is more than 1.5 IQR away from the box is marked 215 2 Data ● ● 2.0 2.5 3.0 ● 3.5 ● ● 4.0 4.5 iris$Sepal.Width Figure 2.6: Boxplot of Sepal.Width of iris data. as an outlier. Indicating outliers in this way is useful since it allows us to see if the whisker is long only because of one extreme value. Variance and Standard Deviation Another important way to measure the dispersion of a distribution is by comparing each value with the mean of the distribution. If the distribution is spread out, these differences will tend to be large, otherwise these differences will be small. To get a single number, we could simply add up all of the deviation from the mean: n X total deviation from the mean = (xi − x) . i=1 The trouble with this is that the total deviation from the mean is always 0 (see Exercise 2.9). The problem is that the negative deviations and the positive deviations always exactly cancel out. To fix this problem we might consider taking the absolute value of the deviations from the mean: n X total absolute deviation from the mean = |xi − x| . i=1 This number will only be 0 if all of the data values are equal to the mean. Even better would be to divide by the number of data values. Otherwise large data sets will have large 216 2.2 Graphical and Numerical Summaries of Univariate Data sums even if the values are all close to the mean. n 1X mean absolute deviation = |xi − x| . n i=1 This is a reasonable measure of the dispersion in a distribution, but we will not use it very often. There is another measure that is much more common, namely the variance, which is defined by n 1 X variance = Var(x) = (xi − x)2 . n−1 i=1 You will notice two differences from the mean absolute deviation. First, instead of using an absolute value to make things positive, we square the deviations from the mean. The chief advatage of squaring over the absolute value is that it is much easier to do calculus with a polynomial than with functions involving absolute values. Because the squaring changes the units of this measure, the square root of the variance, called the standard deviation, is commonly used in place of the variance. mboxstandarddeviation = sd(x) = p Var(x) The second difference is that we divide by n − 1 instead of by n. There is a very good reason for this, even though dividing by n probably would have felt much more natural to you at this point. We’ll get to that very good reason later in the course.For now, we’ll settle for a less good reason. If you know the mean and all but one of the values of a variable, then you can determine the remaining value, since the sum of all the values must be the product of the number of values and the mean. So once the mean is known, there are only n − 1 independent pieces of information remaining. That is not a particularly satifying explanation, but it should help you remember to divide by the correct quantity. All of these quantities are easy to compute in R. > x=c(1,3,5,5,6,8,9,14,14,20); > > mean(x); [1] 8.5 > x - mean(x); [1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 > sum(x - mean(x)); 0.5 5.5 5.5 11.5 217 2 Data [1] 0 > abs(x - mean(x)); [1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5 > sum(abs(x - mean(x))); [1] 46 > (x - mean(x))^2; [1] 56.25 30.25 12.25 12.25 6.25 > sum((x - mean(x))^2); [1] 310.5 > n= length(x); > 1/(n-1) * sum((x - mean(x))^2); [1] 34.5 > var(x); [1] 34.5 > sd(x); [1] 5.87367 > sd(x)^2; [1] 34.5 5.5 0.25 5.5 11.5 0.25 30.25 30.25 132.25 2.3 The Relationship Between Two Variables Many scientific problems are about describing and explaining the relationship between two or more variables. In the next three sections, we begin to look at graphical and numerical ways to summarize such relationships. In this section, we consider the case where one or both the variables are categorical. We first consider the case when one of the variables is categorical and the other is quantitative. This is the situation with the iris data if we are interested in the question of how, say, Sepal.Length varies by Species. A very common way of beginning to answer this question is to construct side-by-side boxplots. > bwplot(Sepal.Length~Species,data=iris) We see from these boxplots that the virginica variety of iris tends to have the longest sepal length though the sepal lengths of this variety also have the greatest variation. The notation used in the first argument of bwplot() is called formula notation and is extremely important when considering the relationship between two variables. This 218 2.3 The Relationship Between Two Variables Sepal.Length 8 7 ● 6 5 ● ● setosa ● versicolor virginica Figure 2.7: Box plot for iris sepal length as a function of Species. formula notation is used throughout lattice graphics and in other R functions as well. The generic form of a formula is y ~ x | z which can often be interpreted as “y modeled by x conditioned on z”. For plotting, y will typically contain a variable presented on the vertical axis, and x a variable to be plotted along the horizontal axis. In this case, we are modeling (or describing) sepal length by species. In this example, there is no conditioning variable z. An example of the use of a conditioning variable occurs in histogram(). The same information in the boxplots above is contained in the side-by-side histograms of Figure 2.3. > histogram(~Sepal.Length | Species,data=iris,layout=c(3,1)) In the case of a histogram, the values for the vertical axis are frequencies computed from the x variable, so y is omitted (or can be thought of as a frequency variable that is always included in a histogram implicitly). The condition z is a variable that is used to break the data into different groups. In the case of histogram(), the different groups are plotted in separate panels. When z is categorical there is one panel for each level of z. When z is quantitative, the data is divided into a number of sections based on the values of z. 219 2 Data 4 setosa 5 6 7 8 versicolor virginica Percent of Total 50 40 30 20 10 0 4 5 6 7 8 4 5 6 7 8 Sepal.Length Figure 2.8: Sepal lengths of three species of irises The formula notation is used for more than just graphics. In the above example, we would also like to compute summary statistics (such as the mean) for each of the species seperately. There are two ways to do this in R. The first uses the aggregate() function. A much easier way uses the summary() function from the Hmisc package. The summary() function allows us to apply virtually any function that has vector input to each level of a categorical variable seperately. > require(Hmisc) # load Hmisc package Loading required package: Hmisc ............................... > summary(Sepal.Length~Species,data=iris,fun=mean); Sepal.Length N=150 +-------+----------+---+------------+ | | |N |Sepal.Length| +-------+----------+---+------------+ |Species|setosa | 50|5.006000 | | |versicolor| 50|5.936000 | | |virginica | 50|6.588000 | +-------+----------+---+------------+ |Overall| |150|5.843333 | 220 2.3 The Relationship Between Two Variables +-------+----------+---+------------+ > summary(Sepal.Length~Species,data=iris,fun=median); Sepal.Length N=150 +-------+----------+---+------------+ | | |N |Sepal.Length| +-------+----------+---+------------+ |Species|setosa | 50|5.0 | | |versicolor| 50|5.9 | | |virginica | 50|6.5 | +-------+----------+---+------------+ |Overall| |150|5.8 | +-------+----------+---+------------+ > summary(Sepal.Length~Species,iris); Sepal.Length N=150 +-------+----------+---+------------+ | | |N |Sepal.Length| +-------+----------+---+------------+ |Species|setosa | 50|5.006000 | | |versicolor| 50|5.936000 | | |virginica | 50|6.588000 | +-------+----------+---+------------+ |Overall| |150|5.843333 | +-------+----------+---+------------+ Notice that the default function used in summary() computes the mean. From now on we will assume that the lattice and Hmisc packages have been loaded and will not show the loading of these packages in our examples. If you try an example in this book and R reports that it cannot find a function, it is likely that you have failed to load one of these packages. You can set up R to automatically load these two packages every time you launch R if you like. In the above example, we investigated the relationship between a categorical and a quantitative variable. We now consider an example where both variables are categorical. A 1981 paper investigating racial biases in the application of the death penalty reported 221 2 Data on 326 cases in which the defendant was convicted of murder. For each case they noted the race of the defendant and whether or not the death penalty was imposed. We can use R to cross tabulate this data for us: > deathpenalty=read.table(’http://www.calvin.edu/~stob/data/deathPenalty.txt’,header=T) > deathpenalty[1:5,] Penalty Victim Defendant 1 Not White White 2 Not Black Black 3 Not White White 4 Not Black Black 5 Death White Black > xtabs(~Penalty+Defendant,data=deathpenalty) Defendant Penalty Black White Death 17 19 Not 149 141 > (Notice some R features. We have used read.table which is suitable to read files that are not CSV but rather in which the data is separated by spaces. However read.table() does not assume a header with variable names. Notice also that xtabs() uses the formula format in a similar way to histogram(), namely with no output variable in the formula. The output in xtabs() is counts.) From the output, it does not look like there is much of a difference in the rates at which black and white defendants receive the death penalty although a white defendant is slightly more likely to receive the death penalty. However a different picture emerges if we take into account the race of the victim. > xtabs(~Penalty+Defendant+Victim,data=deathpenalty) , , Victim = Black Defendant Penalty Black White Death 6 0 Not 97 9 , , Victim = White 222 2.4 Describing a Linear Relationship Between Two Quantitative Variables Defendant Penalty Black White Death 11 19 Not 52 132 It appears that black defendants are more likely to receive the death penalty when the victim is black and also when the victim is white. This phenomenon is known as Simpson’s Paradox. 2.4 Describing a Linear Relationship Between Two Quantitative Variables Many data analysis problems amount to describing the relationship between two quantitative variables. Example 2.4.1 Thirteen bars of 90-10 Cu/Ni alloys were submerged for sixty days in sea water. The bars varied in iron content. The weight loss due to corrosion for each bar was recorded. The R dataset below gives the percentage content of iron (Fe) and the weight loss in mg per square decimeter (loss). > library(faraway) > data(corrosion) > corrosion[c(1:3,12:13),] Fe loss 1 0.01 127.6 2 0.48 124.0 3 0.71 110.8 12 1.44 91.4 13 1.96 86.2 > xyplot(loss~Fe, data=corrosion) It is evident from the plot (Figure 2.9) that the greater the percentage of iron, the less corrosion. The plot suggests that the relationship might be linear. In the second plot, a line is superimposed on the data. (How we choose the line is the subject of this chapter.) Note that to plot the relationship between two quantitative variables, we may use either plot from the base R package or xyplot from lattice. The function xyplot() used the same formula notation as histogram(). 223 2 Data 130 ● ● ● ● loss 120 ● ● 110 ● 100 ● ● ● 0.5 ● ● 1.0 ● ● 110 ● 100 ● ● 90 0.0 ● ● 120 loss 130 1.5 2.0 ● ● ● 90 ● ● 0.0 Fe 0.5 1.0 1.5 2.0 Fe Figure 2.9: The corrosion data with a “good” line added on the right. What is the role of the line that we superimposed on the plot of the data in this example? Obviously, we do not mean to claim that the relationship between iron content and corrosion loss is completely captured by the line. But as a “model” of the relationship between these variables, the line has at least three possible important uses. First, it provides a succinct description of the relationship that is difficult to see in the unsummarized data. The line plotted has equation loss = 129.79 − 24.02Fe. Both the intercept and slope of this line have simple interpretations. For example, the slope suggests that every increase of 1% of iron content means a decrease in loss of content of 24.02 mg per square decimeter. Second, the model might be used for prediction in a situation where we have a yet untested object. We can easily use this line to make a prediction for the material loss in an alloy of 2% iron content. Finally, it might figure in a scientific explanation of the phenomenon of corrosion. All three uses of such a “model” will be illustrated in the examples of this chapter. Example 2.4.2 The current world records for men’s track appear in Table 2.4.2. The plot of record distances (in meters) and times (in seconds) looks roughly linear. We know of course (for physical reasons) that this relationship cannot be a linear one. Nevertheless, it 224 2.4 Describing a Linear Relationship Between Two Quantitative Variables Distance 100 200 400 800 1000 1500 Mile 2000 3000 5000 10,000 Time 9.77 19.32 43.18 1:41.11 2:11.96 3:26.00 3:43.13 4:44.79 7:20.67 12:37.35 26:17.53 Record Holder Asafa Powell (Jamaica) Michael Johnson (US) Michael Johnson (US) Wilson Kipketer (Denmark) Noah Ngeny (Kenya) Hicham El Guerrouj (Morocco) Hicham El Guerrouj (Morocco) Hicham El Guerrouj (Morocco) Daniel Komen (Kenya) Kenenisa Bekele (Ethiopia) Kenenisa Bekele (Ethiopia) Table 2.1: Men’s World Records in Track (IAAF) appears that a smooth curve might approximate the data very well and that this curve might have a relatively simple formula. ● Seconds 1500 1000 ● 500 0 ● ●● ● 0 ●● ●● ● 2000 4000 6000 8000 10000 Meters Example 2.4.3 The R dataset trees contains the measurements of the volume (in cu ft), girth (diameter of tree in inches measured at 4 ft 6 in above the ground), and height (in ft) of 31 black cherry trees in a certain forest. Since girth is easily measured, we might want to use girth to predict volume of the tree. A plot shows the relationship. 225 2 Data > data(trees) > trees[c(1:2,30:31),] Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 30 18.0 80 51.0 31 20.6 87 77.0 > xyplot(Volume~Girth,data=trees) Volume 80 ● 60 ●● 40 ● ●● ● ●●●● ● ● ● ● 20 ● ● ● ● ● ● ● ● ●● ● ●●● 10 15 20 Girth These three examples share the following features. In each, we are given n observations (x1 , y1 ), . . . , (xn , yn ) of quantitative variables x and y. In each case we would like to find a “model” that explains y in terms of x. Specifically, we would like to find a simple functional relationship y = f (x) between these variables. Summarizing, our goal is the following Goal: Given (x1 , y1 ), . . . , (xn , yn ), find a “simple” function f such that yi is approximately equal to f (xi ) for every i. The goal is vague. We need to make precise the notion of “simple” and also the measure of fit we will use in evaluating whether yi is close to f (xi ). In the rest of this section, we make these two notions precise. The simplest functions we study are linear functions such as the function that we used in Example 2.4.1. For the remainder of this chapter we will investigate the problem of fitting linear functions to our data. Namely, we will by trying to find b0 and b1 so that yi ≈ b0 + b1 xi for all i. (Statisticians use b0 , b1 or a, b for the slope and intercept rather than the b, m that is typical in mathematics texts. We will use b0 , b1 .) 226 2.4 Describing a Linear Relationship Between Two Quantitative Variables Of course, in only one of our motivating examples does it seem sensible to use a line to approximate the data. So two important questions that we will need to address are: How do we tell if a line is an appropriate description of the relationship? and What do we do if a linear function is not the right relationship? We will address both questions later. How shall we measure the goodness of fit of a proposed function f to the data? For each xi the function f predicts a certain value ŷi = f (xi ) for yi . Then ri = yi − ŷi is the “mistake” that f makes in the prediction of yi . Obviously we want to choose f so that the values ri are small in absolute value. Introducing some terminology, we will call ŷi the fitted or predicted value of the model and ri the residual. The following is a succinct statement of the relationship observation = predicted + residual It will be impossible to choose a line so that all the values of ri are simultaneously small (unless the data points are collinear). Various values of b0 , b1 might make some values of ri small while making others large. So we need some measure that aggregates all the residuals. Many choices are possible and R provides software to find the resulting line but the canonical choice and the one we investigate here is the sum of squares of the residuals. Namely, our goal is now refined to the following Goal: Given (x1 , y1 ), . . . , (xn , yn ), find b0 and b1 such that if f (x) = b0 + b1 x n X and ri = yi − f (xi ) then ri2 is minimized. i=1 We call n X ri2 the sum of squares of residuals and denote it by SSResid or SSE (for sum i=1 squares error). Before we discuss the solution of this problem, we show how to solve it in R using the data of Example 2.4.1. The R function lm finds the coefficients of the line that minimizes the sums of squares of the residuals. Note that it uses the same syntax for expressing the relationship between variables as does xyplot. > lm(loss~Fe,data=corrosion) Call: 227 2 Data lm(formula = loss ~ Fe, data = corrosion) Coefficients: (Intercept) Fe 129.79 -24.02 The problem of finding the line in question can be solve using multivariate calculus. We need to find b0 and b1 to minimize a certain function of b0 and b1 . This is a straightforward minimization problem that is solved by finding partial derivatives. However we will take a different approach and find b0 and b1 by recasting the problem as a linear algebra problem. Given the observations (x1 , y1 ), . . . , (xn , yn ), we construct vectors x, y ∈ Rn by x = (x1 , . . . xn ) y = (y1 , . . . , yn ) Given a vector x and a function y = f (x), it is obvious how to interpret f (x). Namely f (x) is the vector in Rn that results from applying f to each of the elements of the vector x. (Most scalar functions in R behave in precisely this manner when given a vector as an argument.) We define the vector ŷ by ŷ = f (x). Then the vector r = y − ŷ is precisely the vector (r1 , . . . , rn ) of the residuals ri that we defined above. It seems natural to choose the function f so that the length of the vector r is minimized. Minimizing the length of r = y − ŷ is the same as minimizing the sums of the squares of the residuals (since minimizing the length is the same as minimizing the square of the length). So finally we restate our goal Goal: Given the vectors x and y, find b0 and b1 so that if ŷ = b0 + b1 x, the length of the vector r = y − ŷ is minimized. Our goal is to find b0 , b1 to minimize the length of the residual vector r. The resulting line is called the least-squares line. Given the vector x define a matrix X, called the model matrix by 1 x1 1 x2 X = . .. 1 xn 228 2.4 Describing a Linear Relationship Between Two Quantitative Variables Also define the vector b = (b0 , b1 ). We call b the coefficient vector. Then we have that ŷ = Xb r = y − Xb In the best case we can find b such that y = Xb. In fact we know necessary and sufficient conditions that such b can be found. In this particular case, we see that such a b can be found if and only if y lies in the column space of X. The column space of X is a two-dimensional subspace of Rn . Of course in general y will not be in this subspace. The vector that we seek is the vector in the column space of X (i.e., a vector of form Xb) that is closest to the vector y. The following picture illustrates the situation. The plane in the illustration represents the column space of X. Namely, the vectors in this plane are all vectors of form Xb as b ranges over all possible coefficient vectors. This plane lives in Rn and the vector y is illustrated in the picture as a vector not in the column space of X. observation: y residual: y − ŷ fit: ŷ = Ab model space Figure 2.10: The relationship among the data vector, the residual vector, and the fitted vector. From the picture, we can see exactly what we need. We need to choose b so that the residual vector r = y − Xb is orthogonal to the column space of X. That is Xb will be the projection of y onto the column space of X. The condition that r is orthogonal to the column space of X can be written as X T r = X T (y − Xb) = 0. To solve for b we find that we must solve the equation X T y = X T Xb. (2.1) 229 2 Data Equation 2.1 is usually called the normal equations.The vector on the left of 2.1 is a vector in R2 . This equation will have a solution if the matrix X T X has rank two. This will be true whenever X is rank 2. The matrix X will have rank 2 if its columns are independent. This will be true whenever our data vector x is not the constant vector. It is obvious that a constant vector x gives data inappropriate for our problem. For numerical purposes, it is best to solve for b directly from the equation 2.1. However, by finding the inverse of X T X, we can find an explicit formula for b. We have b = (X T X)−1 X T y. With this expression for b, we can also find the vector ŷ. ŷ = Xb = Hy where H = X(X T X)−1 X T . The matrix H in this equation is usually called the hat matrix.While there is no need to know explicit expressions for b0 and b1 (these are always computed using software) it is easy to show that Pn (xi − x̄)yi b1 = Pi=1 b0 = ȳ − b1 x̄ (2.2) n 2 i=1 (xi − x̄) We illustrate the solution of the least squares problem in R with the following example. Note that R provides tools to do linear algebra calculations but Octave might be a better vehicle (though we have no reason to do the calculations explicitly). Example 2.4.4 A random sample of eighty seniors at a certain undergraduate college in Michigan was chosen and their ACT scores (the Composite score) and grade point averages were recorded. The population was all students who had senior status as of February 15, 2003 and who had taken the ACT test. There appears to be a modest positive relationship between ACT scores and GPA. The least-squares solution is found and graphed below. > sr=read.csv(’sr.csv’) > dim(sr) [1] 80 2 > sr[1:3,] ACT GPA 1 20 3.300 230 2.4 Describing a Linear Relationship Between Two Quantitative Variables ● GPA 3.5 ● ● ● 3.0 ● ● ● ● ● ● ● 2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.0 ● ● 3.5 ● GPA 4.0 ● ● ● 3.0 ● ● ● ● ● 2.5 ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 25 30 35 ACT 20 25 30 35 ACT Figure 2.11: GPA predicted by ACT for 80 seniors. 2 22 3.409 3 27 3.224 > l.sr=lm(GPA~ACT,data=sr) > l.sr Call: lm(formula = GPA ~ ACT, data = sr) Coefficients: (Intercept) ACT 1.25622 0.07589 The following code computes the solution to the normal equation in R. It also computes the hat matrix and the sum of the squares of the residuals. Note that t(X) provides the transpose, %*% is matrix multiplication, and solve() solves linear equations. > > > > x=sr$ACT y=sr$GPA X=cbind(1,x) X[c(1:2,79:80),] x [1,] 1 20 [2,] 1 22 [3,] 1 32 [4,] 1 31 # a matrix with two columns, 1,x # the first and last two rows of X 231 2 Data > solve(t(X)%*%X,t(X)%*%y) # solve solves systems of linear equations [,1] 1.25621992 x 0.07588661 > Hat = X%*%solve(t(X)%*%X)%*%t(X) #solve with one argument computes inverses > yhat = Hat %*% y > sum( (y-yhat)^2) # sums of squares of residuals [1] 9.973544 > anova(l.sr) # R computes sums of squares of residuals Analysis of Variance Table Response: GPA Df Sum Sq Mean Sq F value Pr(>F) ACT 1 6.3436 6.3436 49.611 6.509e-10 *** Residuals 78 9.9735 0.1279 The analysis of variance table in the last example has an entry for SSResid. Another picture helps us understand the other sum of squares in that table. The idea behind this picture is the following. We are using the value of the variable x to help “explain” or “predict” the value of y. We wish to know how much x helps us to do that. Consider the vector y = (y, . . . , y), that is the constant vector of the average value of the yi . This vector is in the column space of our model matrix X. Now the vector y − y, labeled the recentered observation in Figure 2.4, measures the deviation of the observation vector from this mean vector. This vector therefore represents the total variation in yi . Now consider the right triangle in the figure determined by the vectors y − y, ŷ − y, and y − ŷ. This is a right triangle since the residual vector is orthogonal to any vector in the model space, including y. The vector ŷ − y is the variation in the yi that is explained by the vector ŷ; i.e., by the model. By the Pythagorean Theorem, we have ||y − y||2 = ||ŷ − y||2 + ||y − ŷ||2 By the Pythagorean Theorem, we have ||y − y||2 = ||ŷ − y||2 + ||y − ŷ||2 Each of the lengths in this equation is a sum of squares of quantities defined from the data. 232 2.4 Describing a Linear Relationship Between Two Quantitative Variables recentered observation : y − y observation: y residual: y − ŷ fit: ŷ = Ab recentered fit: ŷ − y overall mean: y model space Figure 2.12: Analysis of variance decomposition. Namely ||y − y||2 = ||ŷ − y||2 = ||y − ŷ||2 = n X i=1 n X i=1 n X i=1 (yi − y)2 (SST, sum squares total) (yˆi − y)2 (SSR, sum squares regression) (yi − yˆi )2 (SSResid, sum squares residual) From these definitions, we get the following important relationship. SST = SSR + SSResid In the R output above, SSR is the entry in the column Sum Sq and the row labelled ACT. This equation is usually summarized by saying something like this: The total variation is the variation explained by x plus the error variation. The fraction SSR / SST can then be interpreted as the percentage of variation in the yi accounted for by the xi . This fraction is called R2 and is usually expressed as a percentage. R computes this fraction and reports it in the summary of an lm object. For Example 2.4.4, we would say that “39% of the variation in GPAs is explained by ACT scores.” 233 2 Data > summary(l.sr) Call: lm(formula = GPA ~ ACT, data = sr) Residuals: Min 1Q -0.90882 -0.19068 Median 0.03028 3Q 0.30582 Max 0.53473 Coefficients: Estimate Std. Error t value (Intercept) 1.25622 0.29443 4.267 ACT 0.07589 0.01077 7.044 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 Pr(>|t|) 5.53e-05 *** 6.51e-10 *** ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.3576 on 78 degrees of freedom Multiple R-Squared: 0.3888, Adjusted R-squared: 0.3809 F-statistic: 49.61 on 1 and 78 DF, p-value: 6.509e-10 2.5 Describing a Non-linear Relationship Between Two Variables A linear function is not always the appropriate model to describe the relationship between two variables. In this section we consider two different approaches to fitting a nonlinear model. We will continue to assume that we are given data (x1 , y1 ), . . . , (xn , yn ). Approach 1 - Linearize Example 2.5.1 Suppose we wish to fit a function y = b0 eb1 x to the data. This equation transforms to ln y = ln b0 + b1 x We then can use standard linear regression with the data (xi , ln yi ). This returns ln b0 and b1 . However this choice of b0 , b1 does not minimize the sums of the squares of the 234 2.5 Describing a Non-linear Relationship Between Two Variables residuals ri = yi − b0 eb1 xi . Rather, it minimizes the sums of squares of ln yi − (ln b0 + b1 xi ). In a given application, it might not be so clear that this is desirable. Note that in the above example, though ln y is nonlinear in y, linear regression finds the coefficients b0 and b1 . Generalizing this example, suppose that f is a possibly nonlinear function of one variable x that depends on two unknown parameters b0 and b1 . The goal is to transform the data (x, y) to (g(x), h(y)) so that the equation y = f (x) is equivalent to h(y) = b00 + b01 g(x) where b00 , b01 are known functions of b0 and b1 . Approach 2 - Nonlinear Least Squares Example 2.5.2 Continuing Example 2.5.1, suppose we wish to fit y = b0 eb1 x to data by minimizing the sums of the squares of the residuals ri = yi −b0 eb1 xi . This is a problem in minimizing a nonlinear function of two variables. Usually this requires an iterative method to approximate the solution. The Approaches Compared in R Example 2.5.3 In Example 2.4.3, the relationship between the volume V and girth G of a sample of cherry trees is nonlinear. Both the plot of the data and our geometrical intuition tells us this. Suppose that we assume the relationship has the form V = b0 Gb1 This is not unreasonable as we might expect volume varies as approximately the square of the girth. Linearizing gives ln V = ln b0 + b1 ln G Regression yields ln b0 = −2.353 (b0 = .095) and b1 = 2.20. On the other hand, minimizing the sums of squares of residuals directly gives a = .087 and b = 2.24. SSE = 313.75 when minimized and SSE = 317.85 when linearized. Note that the 235 2 Data nonlinear least-squares algorithm implemented in R is an iterative procedure that needs starting values for the unknows. > data(trees) > attach(trees) > logG=log(Girth) > logV=log(Volume) > l.trees=lm(logV~logG) > l.trees Call: lm(formula = logV ~ logG) Coefficients: (Intercept) logG -2.353 2.200 > fit1=predict(l.trees) > sum( (Volume-exp(fit1))^2) [1] 317.8461 > nl.trees=nls(Volume~b0*Girth^b1,start=list(b0=.2, b1=2.2)) > nl.trees Nonlinear regression model model: Volume ~ b0 * Girth^b1 data: parent.frame() b0 b1 0.08661 2.23639 residual sum-of-squares: 313.8 Number of iterations to convergence: 4 Achieved convergence tolerance: 4.831e-07 2.6 Data - Samples In the next two sections, we consider the question of data collection. If we are to make decisions based on data, we need to be careful in their collection. In this section we consider one common way of generating data, that of sampling from a population. Returning to the Raisin Bran example, it is simply not feasible to weigh every box of Raisin Bran in 236 2.6 Data - Samples the warehouse to determine whether Kellogg’s is telling the truth in its claim that the net weight of the boxes is 20 ounces. Instead, NIST tells us to select a sample consisting of a relatively small number of boxes and weigh those. The hope is that this smaller sample is representative of the larger collection. Definition 2.6.1 (Population). A population is a well-defined collection of individuals. As with any mathematical set, sometimes we define a population by a census or enumeration of the elements of the population. The registrar can easily produce an enumeration of the population of all currently registered Calvin students. Other times, we define a population by properties that determine membership in the population. (In mathematics, we define sets like this all the time since many sets in mathematics are infinite and so do not admit enumeration.) For example, the set of all Michigan registered voters is a population even though a census of the population would be very difficult to produce. It is perfectly clear for an individual whether that individual is a Michigan registered voter or not. Definition 2.6.2 (sample). A subset S of population P is called a sample from P . Quite typically, we are studying a population P but have only a sample S and have the values of one or several variables for each element of S. The canonical goal of (inferential) statistics is: Goal: Given a sample S from population P and values of a variable X on elements of S, make inferences about the values of X on the elements of P. Most commonly, we will be making inferences about parameters of the population. Definition 2.6.3 (parameter). A parameter is a numerical characteristic of the population. For example, we might want to know the mean value of a certain variable defined on the population. One strategy for estimating the mean of such a variable is to take a 237 2 Data random sample and compute the mean of the sample elements. Such an estimate is called a statistic. Definition 2.6.4 (statistic). A statistic is a numerical characteristic of a sample. Obviously, our success at solving this problem will depend to a large extent on how representative S is of the whole population P with respect to the properties measured by X. In turn, the representativeness of the sample will depend how the sample is chosen. A convenience sample is a sample chosen simply by locating units that conveniently present themselves. A convenience sample of Calvin students could be produced by grabbing the first 100 students that come through the doors of Johnny’s. It’s pretty obvious that in this case, and for convenience samples in general, there is no guarantee that the sample is likely to be representative of the whole population. In fact we can predict some ways in which a “Johnny’s sample” would not be representative of the whole student population. One might suppose that we could construct a representative sample by carefully choosing the sample according to the important characteristics of the units. For example, to choose a sample of 100 Calvin students, we might ensure that the sample contains 54 females and 46 males. Continuing, we would then ensure a representative proportion of first-year students, dorm-livers, etc. There are several problems with this strategy. There are usually so many characteristics that we might consider that we would have to take too large a sample so as to get enough subjects to represent all the possible combinations of characteristics in the proportions that we desire. It might be expensive to find the individuals with the desired characteristics. We have no assurance that the subjects we choose with the desired combination of characteristics are representative of the group of all the individuals with those characteristics. Finally, even if we list many characteristics, it might be the case that the sample will be unrepresentative according to some other characteristic that we didn’t think of and that characteristic might turn out to be important for the problem at hand. Statisticians have settled on using sampling procedures that employ chance mechanisms. The simplest such procedure (and also by far the most important) is known as simple random sampling. Definition 2.6.5 (simple random sample). A simple random sample (SRS) of size k from a population is a sample that results from a procedure for which every subset of size k has the same chance to be the sample chosen. 238 2.6 Data - Samples For example, to pick a random sample of size 100 of Calvin students, we might write the names of all Calvin students on index cards and choose 100 of these cards from a well-mixed bag of all the cards. In practice, random samples are often picked by computers that produce “random numbers.” (A computer can’t really produce random numbers since a computer can only execute a deterministic algorithm. However computers can produce numbers that behave as if they are random.) In this case, we would number all students from 1 to 4,224 and then choose 100 numbers from 1 to 4224 in such a way that any set of 100 numbers has the same chance of occurring. The R command sample(1:4224,100,replace=F) will choose such a set of 100 numbers. Now it is certainly possible that a random sample is unrepresentative in some significant way. Since all possible samples are equally likely to be chosen, by definition it is possible that we choose a bad sample. For example, a random sample of Calvin students might fail to have any seniors in it. However the fact that a sample is chosen by simple random sampling enables us to make quantitative statements about the likelihood of certain kinds of nonrepresentativeness. This in turn will enable us to make inferences about the population and to make statements about how likely it is that our inferences are accurate. The concept of random sampling can be extended to produce samples other than simple random samples. For example, we might want to take into account at least some of the characteristics of the members of the population without falling prey to the basic problems with this approach that we described above. For example, we might want to ensure that our sample of Calvin students is at least representative as far as class level goes. In our sample of 100 students, we would then want to choose a sample according to the breakdowns in Table 2.6. Having defined the sizes of our subsamples however, we would then proceed to choose simple random samples from each subpopulation. Definition 2.6.6 (stratified random sample). A stratified random sample of size k from a population is a sample that results from a procedure for that chooses simple random samples from each of a finite number of groups (strata) that partition the population. In the above example, we chose the random sample so that the number of individuals in the sample from each strata were proportional to the size of the strata. While this procedure has much to recommend it, it is not necessary and sometimes not even desirable. 239 2 Data Class Level First-year Sophomore Junior Senior Other Total Population 1,129 1,008 897 1,041 149 4,224 Sample 27 24 21 24 4 100 Table 2.2: Population of Calvin Students and Proportionate Sample Sizes For example, only 4 “other” students appear in our sample of size 100 from the whole population. This is fine if we are only interested in making inferences about the whole population, but often we would like to say something about the subgroups as well. For example, we might want to know how much Calvin students work in off-campus jobs but we might expect and would like to discover differences among the class levels in this variable. For this purpose, we might choose a sample of 20 students from each of the five strata. (Of course we would have to be careful about how to combine our numbers when making inferences about the whole population.) We would say about this sample that we have “oversampled” one of the groups. In public opinion polls, it is often the case that small minority groups are oversampled. The sample that results will still be called a random sample. Definition 2.6.7 (random sample). A random sample of size k from a population is a sample chosen by a procedure such that each element of the population has a fixed probability of being chosen as part of the sample. While we need to give a definition of probability in order to make this definition precise, it is clear from the above examples what we mean. This definition differs from that of a simple random sample in two ways. First, it does not requires that each object has the same likelihood of being the sample chosen. Second, it does not require that equal likelihood extends to groups. A sampling method that we might employ given a list of Calvin students is to choose one of the first 422 students in the list and then choose every 422nd student thereafter. Obviously some subsets can never occur as the sample since two 240 2.6 Data - Samples students whose names are next to each other in the list can never be in the same sample. Such a sample might indeed be representative however. It is very important to note that we cannot guarantee by using random sampling of whatever form that our sample is representative of the population along the dimension we are studying. In fact with random sampling, it is guaranteed that it is possible that we could select a really bad (unrepresentative) sample. What we hope to be able to do (and we will later see how to do it) is to be able to quantify our uncertainty about the representativeness of the sample. The next example gives us an idea of how this might work. Example 2.6.1 The dataset http://www.calvin.edu/~stob/data/miaa05.csv contains the statistics on every basketball player who played for an MIAA Men’s basketball team in 2005. This collection of players will be our population. Of course there is no reason to take a sample to answer a question about this population, but let’s see what would happen if we did. Suppose that we are interested in the points per game (PTSG) of these players. In the code below, we first take a sample of size 5. > miaa=read.csv(’http://www.calvin.edu/~stob/data/miaa05.csv’) > miaa[1:5,] Number Player GP GS Min AvgMin FG FGA FGPct FG3 FG3A 1 14 Brian Schaefer..... 25 19 769 30.8 146 366 0.399 67 185 2 32 Billy Collins Jr... 25 19 641 25.6 119 285 0.418 41 131 3 5 Mike Lewis......... 25 18 553 22.1 99 162 0.611 0 2 4 30 Adam Novak......... 20 13 453 22.6 95 163 0.583 3 3 5 24 Jeff Nokovich...... 25 17 702 28.1 38 109 0.349 7 31 FG3Pct FT FTA FTPct Off Def Tot RBG PF FO A TO Blk Stl Pts PTSG 1 0.362 66 94 0.702 24 42 66 2.6 37 1 96 69 1 40 425 17.0 2 0.313 37 60 0.617 18 41 59 2.4 51 0 37 35 1 19 316 12.6 3 0.000 47 63 0.746 58 81 139 5.6 65 1 29 40 6 26 245 9.8 4 1.000 45 64 0.703 52 79 131 6.6 42 2 47 25 5 33 238 11.9 5 0.226 36 60 0.600 20 60 80 3.2 63 2 104 49 3 52 119 4.8 > ptsg=miaa$PTSG > s=sample(ptsg,5,replace=F) > mean(s) [1] 6.7 The sample of size 5 that we chose has a mean of 6.7. It would be plausible to use this sample to estimate the mean of the entire population. But if we had chosen a different 241 2 Data 30 25 Percent of Total Percent of Total 25 20 15 10 20 15 10 5 5 0 0 0 5 10 15 20 25 0 ptsg 5 10 15 r Figure 2.13: PTSG of MIAA players and average PTSG of 1000 samples of size 5. sample, we would have computed a different sample mean. In the code below, we show what might happen if we choose 1,000 different samples of size 5. > > > > r=replicate(1000,mean(sample(ptsg,5,replace=F))) h1=histogram(ptsg) h2=histogram(r) summary(r) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.300 4.000 5.430 5.591 6.925 14.620 > mean(ptsg) [1] 5.593284 > The population and the 1,000 simulated samples are in Figure 2.13. In this case, since we know the true mean of the population (5.59), we can see what would happen if we used a sample of size 5 to estimate this number. It is quite possible that our estimate would be relatively close to 5.59. However it is also possible that we would get a very unrepresentative sample - in some of the simulated samples, the mean is more than 10! In fact, in this case we could (in principle) list all possible samples of size 5 (there are = 333, 859, 526 of them) and look at the distribution of the means in the only 134 5 population of all samples. The above example illustrates a basic paradigm of statistical analysis. In it we have a variable defined on a population. The distribution of that variable is unknown (our example was artificial in that it was known). In random sampling from the population, we 242 2.7 Data - Experiments compute a statistic related to the variable. That statistic itself has a distribution, known as the sampling distribution of the statistic. We simulated the sampling distribution in the example above but conceptually at least we can envision the entire sampling distribution. If the distribution of the population variable is not known, we will not know the sampling distribution. However, if we make some assumptions about the unknown distribution of the population variable, we can draw some conclusions about the shape of the sampling distribution. Section 3.5 addresses this issue in the case that our statistic is the sample mean. Even in the above example, it is possible to make some qualitative solutions about the sampling distribution of the sample mean. It appears, for example, that the distribution of the sample mean is more symmetric than the distribution of the variable in the population. Also it appears that the variation in the sample mean is less than the variation in the variable itself. We’ll make these tentative conclusions more precise later. 2.7 Data - Experiments The American Music Conference is an organization that promotes music education at all levels. On their website http://www.amc-music.com/research_briefs.htm they promote music education as havng all sorts of benefits. For example, they quote a study performed at the University of Sarasota in which “middle school and high school students who participated in instrumental music scored significantly higher than their non-band peers in standardized tests”. Does this mean that if we increase the availability of and participation in instrumental programs in the schools, that standardized test scores would generally increase? The problem with that conclusion is that there might be other factors that cause the higher test scores of the clarinetists. For example, it might be the case that students who play in bands are more likely to come from schools with more financial resources. They are also more likely to be in families that are actively involved in their education. It might be that music participation and higher test scores are a result of these variables. Scientists have long known that to establish a causal relationship between two variables, it is necessary to construct an experiment in which the conditions such as other variables are controlled. The data above come from an observational study rather than an experiment. Even worse, the observational study was retrospective rather than prospective. (In a prospective study we could at least observe the conditions of the subjects along the way and record the possibly relevant variables, even if we didn’t control them.) 243 2 Data The “gold standard” for establishing a cause and effect relationship between two variables is the randomized comparative experiment. In an experiment, we want to study the relationship between two or more variables. At least one variable is an explanatory variable and the value of the variable can be controlled or manipulated. At least one variable is a response variable. The experimenter has access to a certain set of experimental units (subjects, individuals, cases), sets various values of the explanatory variable to create a treatment, and records the values of the response variables. The experiment is a comparative experiment since the goal is to compare the responses given different values of the explanatory variables, i.e., different treatments. In a randomized experiment we assign the individuals to the various treatments at random. For example, if we took 100 fifth graders and randomly chose 50 of them to be in the band and 50 of them not to receive any music instruction, we could begin to believe that differences in their test scores could be explained by the different treatments. Randomization here plays the same role that it did in the previous section - we are attempting to arrange that the group assigned any particular treatment is representative of the whole group of subjects. Consider the chickwts data. In this experiment, the experimenter was attempting to determine which chicken feed caused the greatest weight gain. Feed was the explanatory variable and there were six treatments (six different feeds). Weight was the response variable. The first step in designing this experiment was to assign baby chicks at random to the six different feed groups. Often clinical trials of pharmaceuticals or medical procedures are randomized comparative experiments. In testing a new drug, there are often two groups of subjects – those who receive the drug and those who do not. A group receiving “no” treatment is often called a control group. A control is simply a level of the explanatory variable that represents the status quo or no treatment at all. In pharmaceutical trials, the control group is often given a placebo. A placebo is a treatment that looks like the others but has no effect. In a pharmaceutical trial, for example, a placebo might be a pill that does nothing. One often finds drug documentation that refers to a “placebo-controlled, randomized, comparative experiment.” Randomization ensures that there is no bias in the assignment of subjects to the experimental treatments. In a medical study, this ensures that characteristics of the patients (e.g., age, severity of the disease, height, weight, eye color) are not the explanation of any relationship found between the explanatory and response variable. In the chickwts example, the differences in weight in the six groups of chickens is not due to the chickens 244 2.7 Data - Experiments (if indeed they were randomly assigned groups). Randomization in a pharmaceutical trial ensures that any difference between the placebo group and the drug group is not due to, say, age. But this is not enough to claim that the difference in treatments “causes” the difference in the “response.” In the example of music participation (explanatory variable) and test scores (response variable), we noted that there was a third variable (poverty) that was a better explanation of the differences in the test scores than music participation (or at least so we conjectured). In this example, poverty is a lurking variable. A lurking variable is any variable that has a significant effect on the response but that hasn’t been included in the study variables. Lurking variables are a key reason that observational studies (particularly retrospective ones) fail to determine causality. It’s pretty easy to construct examples of lurking variables. The more churches a city has, the more bars but it is unlikely that increased church attendance causes increased drunkenness. The lurking variable here is size of city. But lurking variables often exist in experimental designs as well. For example, in the chickwts data, the chickens were probably located in six different areas. Perhaps the physical setup of these six different areas had some important effect on the eating patterns of the chicks. Lurking variables such as weather or soil conditions are a particular concern in agricultural experiments. Ideally, if we know that a variable has an effect on the response variable we should control for it. A blocking variable is a variable other than the explanatory and response variables that is controlled in the experiment, usually because it is thought that it might have an effect on the response variable. The term comes originally from agricultural experiments where a plot of land was a block. Suppose that we are trying to determine the effect of fertilizer (explanatory variable) on yield (response variable). Suppose that we have three unimaginatively named fertilizers A, B, C. We could divide the plot of land that we are using as in the first diagram of Figure 2.14. But it might be the case that the further north in the plot, the better the soil conditions. Northernness would then be a lurking variable. Instead, we could divide the patch using the second diagram in figure 2.14. Of course there still might be variations in the soil conditions across the three fertilizers. But we would at least be able to measure the effect of northernness. In clinical trials, age and gender are often used as blocking variables. We hope to uncover a relationship between the drug dosage and the response of the patient, but it might be the case that this relationship is different for males and females. Of course randomization will likely ensure that the gender breakdown of the two treatment groups is roughly the 245 2 Data A C A B B B C A C A B C Figure 2.14: Two experimental designs for three fertilizers. same, but if we think that gender is an important factor, we can ensure that the gender breakdown is exactly the same. This will help us decide if how much of the variation between the treatment groups is due to gender and not the drug. (Note that in this example, gender is a different sort of variable than is drug. We cannot control it and assign subjects randomly to one of the two genders!) We have only touched on the major issues in the subject of experimental design. There are many considerations beyond what we have described here. But the fundamental principles are the same: 1. Randomize. Randomly assigning individuals to treatment ensures that certain uncontrollable sources of variation are spread equally over the treatments. Further, randomization allows us to use statistical techniques to draw conclusions about the variation in the response variable. 2. Block. Variables that could affect the response or the relationship between the treatment and the response should be controlled if possible. Constructing blocks for the different levels of such a variable allows us to separate out the effects of the treatment variables and the blocking variables. 3. Replicate. Just like a larger random sample is better than a smaller one, assigning many subjects to each treatment allows us to separate out the normal variation in individuals from the variation caused by the treatment variables. 246 2.8 Exercises 2.8 Exercises 2.1 Read Sections 1 and 2 of SimpleR. (Get this from http://cran.r-project.org/doc/ contrib/Verzani-SimpleR.pdf.) Do problems 1, 2, 5, and 6 of Section 2 of SimpleR. 2.2 Load the builtin R dataset chickwts. (Use data(chickwts).) a) How many individuals are in this dataset? b) How many variables are in this dataset? c) Classify the variables as quantitative or categorical. 2.3 The dataset singer comes with the lattice package. Make sure that you have loaded the lattice package and then load that dataset. The dataset contains the heights of 235 singers in the New York Choral Society. Make some comments about the nature of the distribution of heights. Use a histogram to inform those comments. 2.4 The distribution of a quantitative variables is symmetric about m if whenever there are k data values m + d there are also k values of m − d. a) Show that if a distribution is symmetric about m then m is the median. (You may need to handle separately the cases where the number of values is odd and even.) b) Show that if a distribution is symmetric about m then m is the mean. c) Create a small distribution that is not symmetric about m, but the mean and median are both equal to m. 2.5 Describe some situations where the mean or median is clearly a better measure of central tendency than the other. 2.6 We could compute the mean absolute deviation from the median instead of from the mean. Show that the mean absolute deviation from the median is always smaller than the mean absolute deviation from the mean. P 2.7 Let SS(c) = (xi − c)2 . (SS stands for sum of squares.) Show that the smallest value of SS(c) occurs when c = x. This shows that the mean is a minimizer of SS. (Hint: use calculus.) 247 2 Data 2.8 Sketch a boxplot of a distribution that is positively skewed. 2.9 Show that the total deviation from the mean, defined by n X total deviation from the mean = (xi − x) , i=1 is 0 for any distribution. 2.10 Find a distribution with 10 values between 0 and 10 that has as large a variance as possible. 2.11 Find a distribution with 10 values between 0 and 10 that has as small a variance as possible. 2.12 Suppose that x1 , . . . , xn are the values of some variable and a new variable y is defined by adding a constant c to each xi . In other words, yi = xi + c for all i. a) How does y compare to x? b) How does Var(y) compare to Var(x)? 2.13 Repeat Problem 2.12 but with y defined by multiplying each xi by c. In other words, yi = cxi for all i. 2.14 The R dataset barley has the yield in bushels/acre of barley for various varieties of barley planted in 1931 and 1932. There are three categorical variables in play: the variety of barley planted, the year of the experiment, and the site at which the experiment was done (the site Grand Rapids is Minnesota, not Michigan). By examining each of these variables one at a time, make some qualitative statements about the way each variable affected yield. (e.g., did the year in which the experiment was done affect yield?) 2.15 A dataset from the Data and Story Library on the result of three different methods of teaching reading can be found at http://www.calvin.edu/~stob/data/reading.csv. The data includes the results of various pre- and post-tests given to each student. There were 22 students taught by each method. Using the results of POST3, what can you say about the differences in reading ability of the three groups at the end of the course? Would you say that one of the methods is better than the other two? Why or why not? 248 2.8 Exercises 2.16 The death penalty data illustrated Simpson’s paradox. Construct your own illustration to conform to the following story: Two surgeons each perform the same kind of heart surgery. The result of the surgery could be classified as “successful” or “unsuccessful.” They have each done exactly 200 surgeries. Surgeon A has a greater rate of success than Surgeon B. Now the surgical patient’s case can be classified as either “severe” or “moderate.” It turns out that when operating on severe cases, Surgeon B has a greater rate of success than Surgeon A. And when operating on moderate cases, Surgeon B also has a greater rate of success than Surgeon A. By the way, who would you want to be your surgeon? 2.17 Runner’s World has an online calculator http://www.runnersworld.com/cda/trainingcalculator/ 0,7169,s6-238-277-279-0-0-0-0,00.html that can be used to predict a runner’s time T 2 in a race of distance D2 from the runner’s time T 1 in a race of distance D1. The formula used by the website is d2 1.06 t2 = t1 d1 Investigate the accuracy of this formula when applied to the men’s world record data and report on your findings. Are there any records that are particularly inconsistent with this formula? 2.18 A dataset containing some statistics on all baseball teams for the 1994-1998 baseball seasons is available at http://www.calvin.edu/~stob/data/team.csv. Suppose that you want to predict the number of runs scored (R) by a team just from knowing how many home runs (HR) the team has. a) Write the linear regression of R on HR. b) Compute the predicted values for each of the teams. (Use predict(l) in R.) Make some comments on the fit. (For example, are there any values not particiularly well-fit? Do you have any explanations for that?) 2.19 Suppose that we wish to fit a linear model without a constant: i.e., y = bx. Find the value of b that minimizes the sums of squares of residuals in this case. (Hint: there is only one variable here, b, so this is a straightforward Mathematics 161 max-min problem.) 249 2 Data R will compute b in this case as well with the command lm(y∼x-1). In this expression, 1 stands for the constant term and -1 therefore means leave it out. Alternatively we can write lm(y∼x+0). 2.20 Data on the 2003 American League Baseball season is in the file htt://www.calvin. edu/~stob/data/al2003.csv’. Can we predict the number of wins (W ) that a team will have from the number of runs (R) that the team scores? a) Write W as a linear function of R. b) A better model takes into account the runs that a team’s opponent has scored as well. Write W − L as a function of R − OR (here L is losses and OR is opponents runs scored). You will have to construct new vectors that have the values of W − L and R − OR. The function lm(R-OR W-L,data=..) will not work! c) Why might it make sense from the meaning of the variables W − L and R − OR to use a linear model without a constant term as in problem 1? Write W − L as a linear function of R − OR without a constant term. d) Compare the results of parts (b) and (c) as to the goodness of fit of the model. 2.21 Find a transformation that transforms the following nonlinear equations y = f (x) (that depend on parameters b0 and b1 ) to linear equations g(y) = b00 + b01 h(x). b0 b1 + x x b) y = b0 + b1 x a) y = c) y = 1 1 + b0 eb1 x 2.22 The R dataset Puromycin gives the rate of reaction as a function (in counts/min/min) of concentration of an enzyme (in ppm) for two different substrates - one treated with Puromycin and one not treated. The biochemistry suggests that these two variables are related by conc rate = b0 b1 + conc 250 2.8 Exercises Find the least squares estimates of b0 and b1 for the treated condition by both of the methods suggested in this section and compare the sums of squares of residuals. 2.23 Often, we take a sample by some convenient method (a convenience sample) and hope that the sample “behaves like” a random sample. For each of the following convenient methods for sampling Calvin students, indicate in what ways that the sample is likely not to be representative of the population of all Calvin students. a) The students in Mathematics 232A. b) The students in Nursing 329. c) The first 30 students who walk into the FAC west door after 12:30 PM today. d) The first 30 students you meet on the sidewalk outside Hiemenga after 12:30 PM today. e) The first 30 students named in the bod book. f ) The men’s basketball team. 2.24 Suppose that we were attempting to estimate the average height of a Calvin student. For this purpose, which of the convenience samples in the previous problem would you suppose to be most representative? 2.25 Donald Knuth, the famous computer scientist wrote a book entitled “3:16”. This book was a Bible study book that studied the 16th verse of the 3rd chapter of each book of the Bible (that had a 3:16). Knuth’s thesis was that a Bible study of random verses of the Bible might be edifying. The sample was of course not a random sample of Bible verses and Knuth had ulterior motives in choosing 3:16. Describe a method for choosing a random sample of 60 verses from the Bible. Construct a method that is more complicated than simple random sampling that seeks to get a sample representative of all parts of the Bible. 2.26 Suppose that we wish to survey the Calvin student body to see whether the student body favors abolishing the Interim (we could only hope!). Suppose that instead of a simple random sample, we select a random sample of size 20 from each of the five groups of Table 2.6. Suppose that of 20 students in each group, 9 of the first-year students, 10 of 251 2 Data the sophomores, 13 of the juniors, 19 of the seniors and all 20 of the other students favor abolishing the interim. How would you use these numbers to estimate the proportion of the whole student body that favors abolishing the Interim? 2.27 Consider the set of natural numbers P = {1, 2, . . . , 30} to be a population. a) How many prime numbers are there in the population? b) If a sample of size 10 is representative of the population, how many prime numbers would we expect to be in the sample? How many even numbers would we expect to be in the sample? c) Using R choose 5 different samples of size 10 from the population P . Record how many prime numbers and how many even numbers are in each sample. Make any comments about the results that strike you as relevant. 2.28 In a clinical trial called “Preemptive Analgesia With OxyContin Versus Placebo Before Surgery for Long Bone Fractures” which is currently being performed at the Ramban Health Care Campus, the researchers are attempting to determine whether pain medication provided before surgery for repair of fractures helps relieve pain (analgesia) after surgery. (Current clinical trials are described at the website http://www.clinicaltrials.gov/. There is also a lot of information at this website on how clinical trials are conducted.) a) What is the explanatory variable and what is the response variable in this experiment? b) What are the levels of the explanatory variable (i.e., the treatments) suggested by the title of the experiment? c) Consider the response variable. How might it be measured? What are some of the difficulties in measuring it? d) The experimenters want to enroll 80 subjects in the experiment. How do you think they should go about assigning the 80 subjects to the treatments? 2.29 The R dataset CO2 has the results of an experiment on the grass species Echinochloa crus-galli. Look at the data and the help document that accompanies the dataset (to get a description of a dataset use ?CO2). 252 2.8 Exercises a) What are the explanatory and response variables in this experiment according to the short description of the data? b) What variables serve as blocking variables in this experiment? 2.30 Most clinical studies are double-blind randomized comparative experiments. Here blind refers to the fact that the subject does not know which treatment she is getting (e.g., the drug or the placebo) and double-blind refers to the fact that the clinician who is monitoring the response variable also does not know which treatment the patient is getting. Why is it desirable that the experiment be double-blind, if possible? 2.31 In 1957, the Joint Report of the Study Group on Smoking and Health concluded (in Science, vol. 125, pages 1129–1133) that smoking is an important health hazard for it causes an increased risk for lung cancer. However for many years after that the tobacco industry denied this claim. One of their principle arguments is that the data indicating this relationship came from retrospective observational studies. (Indeed, the data in the Joint Report came from 16 independent observational studies.) a) One out of every ten males who smoke at least two packs a day dies of lung cancer. Only one out of every 275 males who do not smoke dies of lung cancer. Explain why the tobacco industry claimed that this does not prove that smoking causes lung cancer in some men. b) There have been no randomized comparative experiments to investigate the relationship between smoking an lung cancer. Explain why not. c) Much of the best evidence that smoking causes lung cancer comes from prospective observational studies. Explain why prospective observational studies might help to establish this link. 2.32 Some people claim that it is more difficult to make free-throws in the Calvin Field House while shooting at the south basket than at the north basket. Construct an experiment to test this claim. (You need not perform it!) Use the language of this section to describe the experiment carefully. 253 3 Probability 3.1 Modelling Uncertainty Probability theory is the mathematical discipline concerned with modeling situations in which the outcome is uncertain. For example, in random sampling, we do not know which sample of inviduals from the population that we might actually get in our sample. The basic notion is that of a probability. Definition 3.1.1 (A probability). A probability is a number meant to measure the likelihood of the occurrence of some uncertain event (in the future). Definition 3.1.2 (probability). Probability (or the theory of probability) is the mathematical discipline that 1. constructs mathematical models for “real-world” situations that enable the computation of probabilities (“applied” probability) 2. develops the theoretical structure that undergirds these models (“theoretical” or “pure” probability). The setting in which we make probability computations is that of a random process. (What we call a random process is usually called a random experiment in the literature but we use process here so as not to get the concept confused with that of randomized experiment.) A random process has three key characteristics: 1. A random process is something that is to happen in the future (not in the past). We can only make probability statements about things that have not yet happened. 301 3 Probability 2. The outcome of the process could be any one of a number of outcomes and which outcome will obtain is uncertain. 3. The process could be repeated indefinitely (under essentially the same circumstances), at least in theory. Historically, some of the basic random processes that were used to develop the theory of probability were those originating in games of chance. Tossing a coin or dealing a poker hand from a well-shuffled deck are examples of such processes. For our purposes the two most important random processes are producing a random sample from a population and assigning subjects randomly to the treatments of a randomized comparative experiment. Essentially all the probability statements that we want to make in statistics come from these two situations (and their cousins). Given a random process, the set (collection) of all possible outcomes will be referred to as the sample space. An event is simply a set of some of the outcomes. These two fundamental notions are illustrated by the following example of random sampling. Example 3.1.1 Twenty-nine students are in a certain statistics class. It is decided to choose a simple random sample of 5 of the students. There are a boatload of possible outcomes. (It can be shown that there are 118,755 different samples of 5 students out of 29.) One event of interest is the collection of all outcomes in which all 5 of the students are male. Suppose that 25 of the students in the class are male. Then it can be shown that 53,130 of the outcomes comprise this event. Given a random process, our goal is to assign to each event E a number P (E) (called the probability of E) such that P (E) measures in some way the likelihood of E. In order to assign such numbers however, we need to understand what they are intended to measure. Interpreting probability computations is fraught with all sorts of philosophical issues but it is not too great a simplification at this stage to distinguish between two different interpretations of probability statements. 302 3.1 Modelling Uncertainty The frequentist interpretation. The probability of an event E, P (E), is the limit of the relative frequency that E occurs in repeated trials of the process as the number of trials approaches infinity. The subjectivist interpretation. The probability of an event E, P (E), is an expression of how confident the assignor is that the event will happen in the next trial of the process. It is easy to think of examples of probability statements in the real world that are more naturally interpreted using either of these interpretations rather than the other. In this text, we will usually phrase our interpretations of probability statements using the frequentist interpretation. Mathematics cannot tell us which of these two interpretations is right or indeed how to assign probabilities in any particular situation. But mathematicians have developed some basic axioms to constrain our choice of probabilities. The three fundamental axioms of probability are Axiom 3.1.3. For all events A, P (A) ≥ 0. Axiom 3.1.4. P (S) = 1. Axiom 3.1.5. If A1 and A2 are disjoint events (i.e., have no outcomes in common) then P (A1 or A2 ) = P (A1 ) + P (A2 ) If one interprets probabilities as limiting relative frequencies, it is easy to see that these three axioms should be true. The axioms do not tell us how to assign the probabilities in any particular case. They only provide some minimal constraints on this assignment. There are two important methods for assigning that we will use extensively. 303 3 Probability The equally likely outcomes model In some cases, we can list all the outcomes in such a way that is plausible to suppose that each outcome is equally likely. For example, the very definition of choosing a random sample of size 5 from a class of 29 requires us to develop a method so that each sample of size 5 is equally likely to occur. In this case, it is easy to compute the probability of an event E. If there are N equally likely outcomes, the probability of each outcome should be 1/N . The probability of an event E is k/N where there are k outcomes in the event. Example 3.1.2 A six-sided die is rolled. Then one of six possible outcomes occurs. From the symmetry of the die it is reasonable to assume that the six outcomes are equally likely. Therefore, the probability of each outcome is 1/6. If E is the event that is described by “the die comes up 1 or 2” then P (E) = 2/6 = 1/3. In a more interesting and more useful example in Example 3.1.1 there are 118,755 possible different samples of five students from 29 and by the definition of simple random sample these samples are equally likely to occur. Since 53,130 of these comprise the event E of getting all males in the sample, the probability of this event is 53130/118755 = 44.7%. Example 3.1.3 Perhaps the canonical historical example of a random process for which it is possible to generate a list of equally likely outcomes is the process in which two dice are thrown and the number on each face is recorded. It is easy to see that there are 36 equally likely outcomes (list the pairs (i, j) of numbers where i is the number on the first die, j is the number on the second die and i and j range from 1 to 6). One event related to this process is the event E that the throw results in a sum of 7 on the two dice. It is easy to see that there are 6 outcomes in E so that P (E) = 6/36 = 1/6. Past performance as an indicator of the future In some cases, we have data on many previous trials of the process. In this case we may estimate the probability of each outcome by the relative frequency with which it occurred in the previous trials. This method is used extensively in the insurance industry. For example the probability that a male alive on his 55th birthday lives to his 56th is currently 304 3.1 Modelling Uncertainty estimated to be 0.0081 or slightly less than 1% based on the recent history of 55 year old males. Example 3.1.4 In the 2007 baseball season, Manny Ramirez came to the plate 569 times. Of those 569 times, he had 89 singles, 33 doubles, 1 triple, 20 homeruns, 78 walks (and hit by pitch), and 348 outs. We might estimate that the probability Ramirez will hit a homerun in his next plate appearance to be 20/569 = .035. For the purpose of investigating how random process work, it us very useful to use R. In the following example, we simulate one, and then five, of Manny Ramirez’s plate appearances. > outcomes=c(’Out’,’Single’,’Double’,’Triple’,’Homerun’,’Walk’) > ramirez=c(348,89,33,1,20,78)/569 > sum(ramirez) [1] 1 > ramirez [1] 0.611599297 0.156414763 0.057996485 0.001757469 0.035149385 0.137082601 > sample(outcomes,1,prob=ramirez) [1] "Double" > sample(outcomes,5,prob=ramirez,replace=T) [1] "Out" "Double" "Out" "Out" "Walk" In the next example, we simulate the tossing of a coin 1,000 times. The graph provides some evidence that the limiting relative frequency of “Heads” is 0.5. > coins=sample(c(’H’,’T’),1000,replace=T) > cumfrequency = cumsum(coins==’H’)/c(1:1000) > plot(cumfrequency,type=’l’) 305 3 Probability 1.0 cumfrequency 0.9 0.8 0.7 0.6 0.5 0 200 400 600 800 1000 3.2 Discrete Random Variables 3.2.1 Random Variables If the outcomes of a random process are numbers, we will call the random process a random variable. Since non-numerical outcomes can always be coded with numbers, restricting our attention to random variables results in no loss of generality. We will use upper-case letters to name random variables (X, Y , etc.) and the corresponding lower-case letters (x, y, etc.) to denote the possible values of the random variable. Then we can describe events by equalities and inequalities so that we can write such things as P (X = 3), P (Y = y) and P (Z ≤ z). Some examples of random variables include 1. Choose a random sample of size 12 from 250 boxes of Raisin Bran. Let X be the random variable that counts the number of underweight boxes and let Y be the random variable that is the average weight of the 12 boxes. 2. Choose a Calvin senior at random. Let Z be the GPA of that student and let U be the composite ACT score of that student. 3. Assign 12 chicks at random to two groups of six and feed each group a different feed. Let D be the difference in average weight between the two groups. 306 3.2 Discrete Random Variables 4. Throw a fair die until all six numbers have appeared. Let T be the number of throws necessary. We will consider two types of random variables, discrete and continuous. Definition 3.2.1 (discrete random variable). A random variable X is discrete if its possible values can be listed x1 , x2 , x3 , . . . . In the example above, the random variables X, U , and T are discrete random variables. Note that the possible values for X are 0, 1, . . . , 12 but that T has infinitely many possible values 1, 2, 3, . . . . The random variables Y , Z, and D above are not discrete. The random variable Z (GPA) for example can take on all values between 0.00 and 4.00. (We should make the following caveat here however. All variables are discrete in the sense that there are only finitely many different measurements possible to us. Each measurement device that we use has divisions only down to a certain tolerance. Nevertheless it is usually more helpful to view these measurements as on a continuous scale rather than a discrete one. We learned that in calculus.) Definition 3.2.2 (continuous random variable). A random variable X is continuous if its possible values are all x in some interval of real numbers. In this section, we focus on properties of discrete random variables. Example 3.2.1 Two dice are thrown and the sum X of the numbers appearing on their faces is recorded. X is a random variable with possible values 2, 3, . . . , 12. By using the equally likely outcomes method we can see that P (X = 7) = 1/6 and P (X ≤ 5) = 5/18. If X is a discrete random variable, we will be able to compute the probability of any event defined in terms of X if we know all the possible values of X and the probability P (X = x) for each such value x. Definition 3.2.3 (probability mass function). The probability mass function (pmf) of a random variable X is the function f such that for all x, f (x) = P (X = x). We will sometimes write fX to denote the probability mass function of X when we want to make it clear which random variable is in question. 307 3 Probability 25 Percent of Total 20 15 10 5 0 1 2 3 4 5 r Figure 3.1: The probability histogram for the Calvin class random variable. The word mass is not arbitrary. It is convenient to think of probability as a unit mass that is divided into point masses at each possible outcome. The mass of each point is its probability. Note that mass obeys the probability axioms. Example 3.2.2 Suppose that a student is chosen at random from the Calvin student body. We will code the class of the student by 1, 2, 3, 4 for the four standard classes and 5 for other. The coded class is a random variable. Referring to Table 2.6, we see that the probability mass function of X is given by f (1) = 0.27, f (2) = 0.24, f (3) = 0.21, f (4) = 0.25, f (5) = 0.03, and f (x) = 0 otherwise. One useful way of picturing a probability mass function is by a probability histogram. For the mass function in Example 3.2.2, we have the corresponding histogram in Figure 3.2.1. On the frequentist interpretation of probability, if we repeat the random process many times, the histogram of the results of those trials should approximate the probability histogram. The probability histogram is not a histogram of data from many trials however. It is a representation of what might happen in the next trial. We will often use this idea to 308 3.2 Discrete Random Variables work in reverse. In other words, given a histogram of data that obtained from successive trials of a random process, we will choose the pmf to fit the data. Of course we might not ask for a perfect fit but instead we will choose the pmf f to fit the data approximately but so that f has some simple form. Several families of random variables are particularly important to us and provide models for many real-world situations. We examine two of such families here. Each arises from a common kind of random process that will be important for statistical inference. The second of these arises from the very important case of simple random sampling from a population. We will first study a somewhat different case (which, among other uses, can be used to study sampling with replacement). 3.2.2 The Binomial Distribution A binomial process is a process characterized by the following conditions: 1. The process consists of a sequence of finitely many (n) trials of some simpler process. 2. Each trial results in one of two possible outcomes, usually called success (S) and failure (F ). 3. The probability of success on each trial is a constant denoted by π. 4. The trials are independent one from another - that is the outcome of one trial does not affect the outcome of any other. Thus a binomial process is characterized by two parameters, n and π. Given a binomial process, the natural random variable to observe is the number of successes. Definition 3.2.4 (binomial random variable). Given a binomial process, the binomial random variable X associated with this process is defined by X is the number of successes in the n trials of the process. If X is a binomial random variable with parameters n and π, we write X ∼ Binom(n, π). Example 3.2.3 The following are all natural examples of binomial random variables. 309 3 Probability 1. A fair coin is tossed n = 10 times with the probability of a HEAD (success) being π = .5. X is the number of heads. 2. A basketball player shoots n = 25 freethrows with the probability of making each freethrow being π = .70. Y is the number of made freethrows. 3. A quality control inspector tests the next n = 12 widgets off the assembly line each of which has a probability of 0.10 of being defective. Z is the number of defective widgets. 4. Ten Calvin students are randomly sampled with replacement. W is the number of males in the sample. The probability mass function for a binomial distribution is given in the following theorem. Theorem 3.2.5 (The Binomial Distribution). Suppose that X is a binomial random variable with parameters n and π. The pmf of X is given by n! n x π x (1 − π)n−x π (1 − π)n−x = fX (x; n, π) = x x!(n − x)! Note the use of the semicolon in the defintion of fX in the theorem. We will use a semicolon to separate the possible values of the random variable (x) from the parameters (n, π). For any particular binomial experiment, n and π are fixed. If n and π are understood, we might write fX (x) for fX (x; n, π). For all but very small n, computing f by hand is tedious. We will use R to do this. Besides computing the mass function, R can be used to compute the cumulative distribution function FX which is the useful function defined in the next definition. Definition 3.2.6 (cumulative distribution function). If X is any random variable, the cumulative distribution function of X (cdf) is the function FX given by FX (x) = P (X ≤ x) = 310 X y≤x fX (y) 3.2 Discrete Random Variables We will usually use the convention that the pmf of X is named by a lower-case letter (usually fX ) and the cdf by the corresponding upper-case letter (usually FX ). The R functions to compute the cdf, pdf, and also to simulate binomial processes are as follows if X ∼ Binom(n, π). function (& parameters) explanation rbinom(n,size,prob) makes n random draws of the random variable X and returns them in a vector. dbinom(x,n,size,prob) returns P(X = x) (the pmf). pbinom(q,n,size,prob) returns P(X ≤ q) (the cdf). Suppose that a manufacturing process produces defective parts with probability π = .1. If we take a random sample of size 10 and count the number of defectives X, we might assume that X ∼ Binom(10, 0.1). Some examples of R related to this situation are as follows. > defectives=rbinom(n=30, size=10,prob=0.1) > defectives [1] 2 0 2 0 0 0 0 2 0 1 1 1 0 0 2 2 3 1 1 2 1 1 0 2 0 1 1 0 1 1 > table(defectives) defectives 0 1 2 3 11 11 7 1 > dbinom(c(0:4),size=10,prob=0.1) [1] 0.34867844 0.38742049 0.19371024 0.05739563 0.01116026 > dbinom(c(0:4),size=10,prob=0.1)*30 # pretty close to table [1] 10.4603532 11.6226147 5.8113073 1.7218688 0.3348078 > pbinom(c(0:5),size=10,prob=0.1) # same as cumsum(dbinom(...)) [1] 0.3486784 0.7360989 0.9298092 0.9872048 0.9983651 0.9998531 > It is important to note that • R uses size for the number of trials (what we have called n) and n for the number of random draws. 311 3 Probability • pbinom() gives the cdf not the pdf. Reasons for this naming convention will become clearer later. • There are similar functions in R for many of the distributions we will encounter, and they all follow a similar naming scheme. We simply replace binom with the R-name for a different distribution. 3.2.3 The Hypergeometric Distribution The hypergeometric distribution arises from considering the situation of random sampling from a population in which there are just two types of individuals. (That is there is a categorical variable defined on the population with just two levels.) It is traditional to describe the distribution in terms of the urn model. Suppose that we have an urn with two different colors of balls. There are m white balls and n black balls. Suppose we choose k balls from the urn in such a way that every set of k balls is equally likely to be chosen (i.e., a random sample of balls) and count the number X of white balls. We say that X has the hypergeometric distribution with parameters m, n, and k and write X ∼ Hyper(m, n, k). Example 3.2.4 Remember our class of 29 intrepid souls, 25 of whom are male. Let’s call the females the white balls and the males the black balls. Recall that for some reason we wanted a sample of size 5. Let X be the number of females in our sample. Then X ∼ Hyper(4, 25, 5). There is a simple formula for the pmf of the hypergeometric distribution. This formula comes from careful counting of the equally likely outcomes. m n x k−x m+n k fX (x) = . R knows the hypergeometric distribution and the syntax is exactly the same as for the binomial distribution (except that the names of the parameters have changed). 312 3.3 Continuous Random Variables function (& parameters) explanation rhyper(nn,m,n,k) makes nn random draws of the random variable X and returns them in a vector. dhyper(x,m,n,k) returns P(X = x) (the pmf). phyper(q,m,n,k) returns P(X ≤ q) (the cdf). Some interesting computations related to Example 3.2.4 are below. > dhyper(x=c(0:5),m=4,n=25,k=5) [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175 [6] 0.0000000000 > dhyper(x=c(0:5),k=5,m=4,n=25) # order of named arguments does not matter [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175 [6] 0.0000000000 > phyper(q=c(0:5),m=4,n=25,k=5) [1] 0.4473917 0.8734790 0.9896846 0.9997895 1.0000000 1.0000000 > rhyper(nn=30,m=4,n=25,k=5) # note nn for number of random outcomes [1] 2 1 1 1 1 2 2 2 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1 2 0 0 0 > dhyper(0:5,4,25,5) # default order of unnamed arguments [1] 0.4473916888 0.4260873226 0.1162056334 0.0101048377 0.0002105175 [6] 0.0000000000 > 3.3 Continuous Random Variables Recall that a continuous random variable X is one that can take on all values in an interval of real numbers. For example, the height of a randomly chosen Calvin student in inches could be any real number between, say, 36 and 80. Of course all continuous random variables are idealizations. If we measure heights to the nearest quarter inch, there are only finitely many possibilities for this random variable and we could, in principle, treat it as discrete. We know from calculus however that treating measurements as continuous valued functions often simplifies rather than complicates our techniques. In order to understand what kinds of probability statements that we would like to make about continuous random variables, it is helpful to keep in mind this idea of the finite precision of our measurements however. For example, a statement that a randomly chosen individual is 72 inches tall is 313 3 Probability 1.0 0.8 0.6 0.8 0.6 0.4 0.6 0.4 0.4 0.2 0.2 0.0 0.2 0.0 0 2 4 6 Time 8 0.0 0 2 4 6 8 Time 0 2 4 6 8 Time Figure 3.2: Discretized pmf for T . not a claim that the individual is exactly 72 inches tall but rather a claim that the height of the individual is in some small interval (maybe 71 34 to 72 14 if we are measuring to the nearest half inch). So probabilities of the form P (X = x) are not meaningful. Rather the appropriate probability statements will be of the form P (a ≤ X ≤ b). 3.3.1 pdfs and cdfs Recall the analogy of probability and mass. In the case of discrete random variables, we represented the probability P(X = x) by a point of mass P(X = x) at the point x and had total mass 1. In this case mass is continuous and the appropriate weighting of mass is a density function. In the following example, we can see how this works. Example 3.3.1 A Geiger counter emits a beep when a radioactive particle is detected. The rate of beeping determines how radioactive the source is. Suppose that we record the time T to the next beep. It turns out that T behaves like a random variable. Suppose that we measured T with increasing precision. We might get histograms that look like those in Figure 3.2 for the pmf of T . It’s pretty obvious that we want to replace these histograms by a smooth curve. In fact the pictures should remind us of the pictures drawn for the Riemann sums that define the integral. The analogue to a probability mass function for a continuous variable is a probability density function. Definition 3.3.1 (probability density function, continuous random variable). A probability density function (pdf) is a function f such that 314 3.3 Continuous Random Variables • f (x) ≥ 0 for all real numbers x, and • R∞ −∞ f (x) dx = 1. The continuous random variable X defined by the pdf f satisfies P(a ≤ X ≤ b) = Z b f (x) dx a for any real numbers a ≤ b. The following simple lemma demonstrates one way in which continuous random variables are very different from discrete random variables. Lemma 3.3.2. Let X be a continuous random variable with pdf f . Then for any a ∈ R, 1. P(X = a) = 0, 2. P(X < a) = P(X ≤ a), and 3. P(X > a) = P(X ≥ a). Z Proof. a a f (x) dx = 0 . And P(X ≤ a) = P(X < a) + P(X = a) = P(X < a). Example 3.3.2 ( 3x2 Q. Consider the function f (x) = 0 P(X ≤ 1/2). x ∈ [0, 1] Show that f is a pdf and calculate otherwise. A. Let’s begin be looking at a plot of the pdf. 315 0.0 1.0 f (x) 2.0 3.0 3 Probability 0.0 0.2 0.4 0.6 0.8 1.0 x The rectangular region of the plot has an area of 3, so it is plausible that the area under the graph of the pdf is 1. We can verify this by integration. Z ∞ Z 1 1 f (x) dx = 3x2 dx = x3 0 = 1 , −∞ so f is a pdf and P(X ≤ 1/2) = 0 R 1/2 0 1/2 3x2 dx = x3 0 = 1/8. The cdf of a continuous random variable is defined the same way as it was for a discrete random variable, but we use an integral rather than a sum to get the cdf from the pdf in this case. Definition 3.3.3 (cumulative distribution function). Let X be a continuous random variable with pdf f , then the cumulative distribution function (cdf) for X is Z x F (x) = P(X ≤ x) = f (t) dt . −∞ Example 3.3.3 Q. Determine the cdf of the random variable from Example 3.3.2. A. For any x ∈ [0, 1], FX (x) = P(X ≤ x) = 316 Z 0 x x 3t2 dt = t3 0 = x3 . 3.3 Continuous Random Variables So 0 FX (x) = x3 1 x ∈ [−∞, 0) x ∈ [0, 1] x ∈ (1, ∞) . Notice that the cdf FX is an antiderivative of the pdf fX . This follows immediately from the Fundamental Theorem of Calculus. Notice also that P(a ≤ X ≤ b) = F (b) − F (a). Lemma 3.3.4. Let FX be the cdf of a continuous random variable X. Then the pdf fX satisfies fX (x) = d FX (x) . dx Just as the binomial and hypergeometric distributions were important families of discrete random variables, there are several important families of continuous random variables that are often used as models of real-world situations. We investigate a few of these in the next three subsections. 3.3.2 Uniform Distributions The continuous uniform distribution has a pdf that is constant on some interval. Definition 3.3.5 (uniform random variable). A continuous uniform random variable on the interval [a, b] is the random variable with pdf given by ( 1 x ∈ [a, b] f (x; a, b) = b−a 0 otherwise. It is easy to confirm that this function is indeed a pdf. We could integrate, or we could simply use geometry. The region under the graph of the uniform pdf is a rectangle with 1 width b − a and height b−a , so the area is 1. Example 3.3.4 317 3 Probability Q. Let X be uniform on [0, 10]. What is P(X > 7)? What is P(3 ≤ X < 7)? A. Again we argue geometrically. P(X > 7) is represented by a rectangle with base from 7 to 10 along the x-axis and a height of .1, so P(X > 7) = 3 · 0.1 = 0.3. Similarly P(3 ≤ X < 7) = 0.4. In fact, for any interval of width w contained in [0, 10], the probability that X falls in that particular interval is w/10. We could also compute these results by integrating, but this would be silly. Example 3.3.5 Q. Let X be uniform on the interval [0, 1] (which we denote X ∼ Unif(0, 1)) what is the cdf for X? Rx A. For x ∈ [0, 1], FX (x) = 0 1 dx = x, so 0 FX (x) = x 1 x ∈ (∞, 0) x ∈ [0, 1] x ∈ (1, ∞) . F (x) 0.4 0.0 0.4 0.0 f (x) 0.8 cdf for Unif(0,1) 0.8 pdf for Unif(0,1) 0.0 0.5 1.0 x 1.5 2.0 0.0 0.5 1.0 1.5 2.0 x Although it has a very simple pdf and cdf, this random variable actually has several important uses. One such use is related to random number generation. Computers are not able to generate truly random numbers. Algorithms that attempt to simulate randomness are called pseudo-random number generators. X ∼ Unif(0, 1) is a model for an idealized random number generator. Computer scientists compare the behavior of a pseudo-random number generator with the behavior that would be expected for X to test the quality of the pseudo-random number generator. 318 3.3 Continuous Random Variables There are R functions for computing the pdf and cdf of a uniform random variable as well as a function to return random numbers. An additional function computes the quantiles of the uniform distribution. If X ∼ Unif(min, max) the following functions can be used. function (& parameters) explanation runif(n,min,max) makes n random draws of the random variable X and returns them in a vector. dunif(x,min,max returns fX (x), (the pdf). punif(q,min,max) returns P(X ≤ q) (the cdf). qunif(p,min,max) returns x such that P(X ≤ x) = p. Here are examples of computations for X ∼ Unif(0, 10). > runif(6,0,10) # 6 random values on [0,10] [1] 5.449745 4.124461 3.029500 5.384229 7.771744 8.571396 > dunif(5,0,10) # pdf is 1/10 [1] 0.1 > punif(5,0,10) # half the distribution is below 5 [1] 0.5 > qunif(.25,0,10) # 1/4 of the distribution is below 2.5 [1] 2.5 3.3.3 Exponential Distributions In Example 3.3.1 we considered a “waiting time” random variable, namely the waiting time until the next radioactive event. Waiting times are important random variables in reliability studies. For example, a common characteristic of a manufactured object is MTF or mean time to failure. The model often used for the Geiger counter random variable is the exponential distribution. Note that a waiting time can be any x in the range 0 ≤ x < ∞. Definition 3.3.6 (The exponential distribution). The random variable X has the exponential distribution with parameter λ > 0 (X ∼ Exp(λ)) if X has the pdf ( λeλx x ≥ 0 fX (x) = 0 x<0. 319 3 Probability It is easy to see that the function fX of the previous definition is a pdf for any value of λ. R refers to the value of λ as the rate so the appropriate functions in R are rexp(n,rate), dexp(x,rate), pexp(x,rate), and qexp(p,rate). We will see later that rate is an apt name for λ as λ will be the rate per unit time if X is a waiting time random variable. Example 3.3.6 Suppose that a random variable T measures the time until the next radioactive event is recorded at a Geiger counter (time measured since the last event). For a particular radioactive material, a plausible models for T is T ∼ Exp(0.1) where time is measured in seconds. Then the following R session computes some important values related to T . > pexp(q=0.1,rate=.1) # probability waiting time less than .1 [1] 0.009950166 > pexp(q=1,rate=.1) # probability waiting time less than 1 [1] 0.09516258 > pexp(q=10,rate=.1) [1] 0.6321206 > pexp(q=20,rate=.1) [1] 0.8646647 > pexp(100,rate=.1) [1] 0.9999546 > pexp(30,rate=.1)-pexp(5,rate=.1) # probability waiting time between 5 and 30 [1] 0.5567436 > qexp(p=.5,rate=.1) # probability is .5 that T is less than 6.93 [1] 6.931472 The graphs in Figure 3.3 are graphs of the pdf and cdf of this random variable. All exponential distributions look the same except for the scale. The rate of 0.1 here means that we can expect that in the long run this process will average 0.1 counts per second. 3.3.4 Weibull Distributions A very important generalization of the exponential distributions are the Weibull distributions. They are often used by engineers to model phenomena such as failure, manufacturing or delivery times. They have also been used for applications as diverse as fading in wireless 320 3.3 Continuous Random Variables 0.08 0.8 0.06 0.6 y 1.0 y 0.10 0.04 0.4 0.02 0.2 0.0 0.00 0 10 20 30 x 40 50 0 10 20 30 40 50 x Figure 3.3: The pdf and cdf of the random variable T ∼ Exp(0.1). communications channels and wind velocity. The Weibull is a two-parameter family of distributions. The two parameters are a shape parameter α and a scale parameter λ. Definition 3.3.7 (The Weibull distributions). The random variable X has a Weibull distribution with shape parameter α > 0 and scale parameter β > 0 (X ∼ Weib(α, β)) if the pdf of X is ( α α−1 e−(x/β)α x ≥ 0 βα x fX (x; α, β) = 0 x<0 Notice that if X ∼ Weib(1, λ) then X ∼ Exp(1/λ). Varying α in the Weibull distribution changes the shape of the distribution while changing β changes the scale. The effect of fixing β (β = 5) and changing α (α = 1, 2, 3) is illustrated by the first graph in Figure 3.4 while the second graph shows the effect of changing β (β = 1, 3, 5) with α fixed at α = 2. The appropriate R functions to compute with the Weibull distribution are dweibull(x,shape,scale), pweibull(q,shape,scale), etc. Example 3.3.7 The Weibull distribution is sometimes used to model the maximum wind velocity measured during a 24 hour period at a specific location. The dataset http://www. calvin.edu/~stob/data/wind.csv gives the maximum wind velocity at the San Diego airport on each of 6,209 consecutive days. It is claimed that the maximum wind velocity measured on a day behaves like a random variable W that has a Weibull distribution 321 0.2 0.4 y21 0.10 0.0 0.00 0.05 y35 0.15 0.6 0.20 0.8 3 Probability 0 2 4 6 8 10 0 2 x 4 6 8 10 x Figure 3.4: Left: fixed β. Right: fixed α. with α = 3.46 and β = 16.90. The R code below investigates that model using this past data. (In fact, this model is not a very good one although the output below suggests that it might be plausible.) > w$Wind [1] 14 11 10 13 11 11 26 21 14 13 10 10 13 10 13 13 12 12 13 17 11 11 13 25 15 [26] 18 13 17 12 14 15 10 16 17 17 13 18 14 12 20 11 14 20 16 12 14 18 17 13 16 [51] 13 16 11 13 11 15 13 15 16 18 14 15 15 14 14 16 15 18 14 16 14 10 17 14 12 ............. > cutpts=c(0,5,10,15,20,25,30) > table(cut(w$Wind,cutpts)) (0,5] (5,10] (10,15] (15,20] (20,25] (25,30] 2 434 3303 1910 409 95 > length(w$Wind[w$Wind<12.5])/6209 [1] 0.2728298 # 27.3% days with max windspeed less than 12.5 > pweibull(12.5,3.46,16.9) [1] 0.2968784 # 29.7% predicted by Weibull model > length(w$Wind[w$Wind<22.5])/6209 [1] 0.951361 > pweibull(22.5,3.46,16.9) [1] 0.9322498 > simulation=rweibull(100000,3.46,16.9) # 100,000 simulated days 322 3.4 Mean and Variance of a Random Variable > mean(simulation) [1] 15.18883 > mean(w$Wind) [1] 15.32405 > sd(simulation) [1] 4.85144 > sd(w$Wind) [1] 4.239603 > # simulated days have mean about the same as actual # simulated days have greater variation 3.4 Mean and Variance of a Random Variable Just as numerical summaries of a data set can help us understand our data, numerical summaries of the distribution of a random variable can help us understand the behavior of that random variable. In this section we develop two of the most important numerical summaries of random variables: mean and variance. In each case, we will use our experience with data to help us develop a definition. 3.4.1 The Mean of a Discrete Random Variable Example 3.4.1 Q. Let’s begin with a motivating example. Suppose a student has taken 10 courses and received 5 A’s, 4 B’s and 1 C. Using the traditional numerical scale where an A is worth 4, a B is worth 3 and a C is worth 2, what is this student’s GPA (grade point average)? A. The first thing to notice is that 4+3+2 = 3 is not correct. We cannot simply add up 3 the values and divide by the number of values. Clearly this student should have GPA that is higher than 3.0, since there were more A’s than C’s. Consider now a correct way to do this calculation and some algebraic reformulations 323 3 Probability of it. GPA = 4+4+4+4+4+3+3+3+3+2 5·4+4·3+1·2 = 10 10 5 4 1 = ·4+ ·3+ ·2 10 10 10 5 4 1 =4· +3· +2· 10 10 10 = 3.4 Our definition of the mean of a random variable follows the example above. Notice that we can think of the GPA as a sum of terms of the form (grade)(proportion of students getting that grade) . Since the limiting proportion of outcomes that have a particular value is the probability of that value, we are led to the following definition. Definition 3.4.1 (mean). Let X be a discrete random variable with pmf f . The mean (also called expected value) of X is denoted as µX or E(X) and defined by X µX = E(X) = x · f (x) . x The sum is taken over all possible values of X. Example 3.4.2 Q. If we flip four fair coins and let X count the number of heads, what is E(X)? A. If we flip four fair coins and let X count the number of heads, then the distribution of X is described by the following table. (Note that X ∼ Binom(4, .5).) value of X probability 324 0 1 16 1 4 16 2 6 16 3 4 16 4 1 16 3.4 Mean and Variance of a Random Variable So the expected value is 0· 1 4 6 4 1 +1· +2· +3· +4· =2 16 16 16 16 16 On average we get 2 heads in 4 tosses. This is certainly in keeping with our informal understanding of the word average. More generally, the mean of a binomial random variable is found by the following Theorem. Theorem 3.4.2. Let X ∼ Binom(n, π). Then E(X) = nπ. Similarly, the mean of a hypergeometric random variable is just what we think it should be. Theorem 3.4.3. Let X ∼ Hyper(m, n, k). Then E(X) = km/(m + n). The following example illustrates the computation of the mean for a hypergeometric random variable. > x=c(0:5) > p=dhyper(x,m=4,n=25,k=5) > sum(x*p) [1] 0.6896552 > 4/29 * 5 [1] 0.6896552 3.4.2 The Mean of a Continuous Random Variable If we think of probability as mass, then the expected value for a discrete random variable X is the center of mass of a system of point masses where a mass fX (x) is placed at each possible value of X. The expected value of a continuous random variable should also be the center of mass where the pdf is now interpreted as density. 325 3 Probability Definition 3.4.4 (mean). Let X be a continuous random variable with pdf f . The mean of X is defined by Z ∞ µX = E(X) = xf (x) dx . −∞ Example 3.4.3 ( 3x2 Recall the pdf in Example 3.3.2: f (x) = 0 E(X) = Z 0 1 x ∈ [0, 1] . Then otherwise. x · 3x2 dx = 3/4 . The value 3/4 seems plausible from the graph of f . We compute the mean of two of our favorite continuous random variables in the next Theorem. Theorem 3.4.5. 1. If X ∼ Unif(a, b) then E(X) = (a + b)/2. 2. If X ∼ E(λ) then E(X) = 1/λ. Proof. The proof of each of these is a simple integral. These are left to the reader. Our intuition tells us that in a large sequence of trials of the random process described by X, the sample mean of the observations should be usually be close the mean of X. This is in fact true and is known as the Law of Large Numbers. We will not state that law precisely here but we will illustrate it using several simulations in R. > r=rexp(100000,rate=1) > mean(r) [1] 0.9959467 > r=runif(100000,min=0,max=10) > mean(r) [1] 5.003549 326 # should be 1 # should be 5 3.4 Mean and Variance of a Random Variable > r=rbinom(100000,size=100,p=.1) > mean(r) [1] 9.99755 > r=rhyper(100000,m=10,n=20,k=6) > mean(r) [1] 1.99868 # should be 10 # should be 2 3.4.3 Transformations of Random Variables After collecting data, we often transform it. That is we apply some function to all the data. For example, we saw the value of using a logarithmic transformation to linearize some bivariate relationships. Now consider the notion of transforming a random variable. Definition 3.4.6 (transformation). Suppose that t is a function defined on all the possible values of the random variable X. Then the random variable t(X) is the random variable that has outcome t(x) whenever x is the outcome of X. If the random variable Y is defined by Y = t(X), then Y itself has an expected value. To find the expected value of Y , we would need to find the pmf or pdf of Y , fY (y), and then use the definition of E(Y ) to compute E(Y ). There is an easier way to compute E(t(X)) however which is given in the following lemma. Lemma 3.4.7. If X is a random variable (discrete or continuous) and t a function defined on the values of X, then if Y = t(X) and X has pdf (pmf) fX (P t(x)fX (x) if X is discrete E(Y ) = R ∞x −∞ t(x)f (x) dx if X is continuous . We will not give the proof but it is easy to see that this lemma should be so (at least for the discrete case) by looking at an example. Example 3.4.4 Let X be the result of tossing a fair die. X has possible outcomes 1, 2, 3, 4, 5, 6. Let Y be the random variable |X − 2|. Then the lemma gives E(Y ) = 6 X x=1 |x − 2| · 1 1 1 1 1 1 11 1 =1· +0· +1· +2· +3· +4· = . 6 6 6 6 6 6 6 6 327 3 Probability But if we can also compute E(Y ) directly from the definition. Noting that the possible values of Y are 0, 1, 2, 3, 4, we have E(Y ) = 4 X y=0 yfY (y) = 0 · 1 2 1 1 1 11 +1· +2· +3· +4· = . 6 6 6 6 6 6 The sum that computes E(Y ) is clearly the same sum as E(X) but in a “different order” and with some terms combined since there are more than one x that produce a given value of Y . Example 3.4.5 Suppose that X ∼ Unif(0, 1) and that Y = X 2 . Then Z 1 E(Y ) = x2 · 1 dx = 1/3 . 0 This is consistent with the following simulation. > x=runif(1000,0,1) > y=x^2 > mean(y) [1] 0.326449 While it is not necessarily the case that E(t(X)) = t(E(X)) (see problem 3.23), the next proposition shows that the expectation function is a “linear operator.” Lemma 3.4.8. If a and b are real numbers, then E(aX + b) = a E(X) + b. 3.4.4 The Variance of a Random Variable We are now in a position to define the variance of a random variable. Recall that the variance of a set of n data points x1 , . . . , xn is almost the average of the squared-deviation from the sample mean. X Var(x) = (xi − x)2 /(n − 1) 328 3.5 The Normal Distribution The natural analogue for random variables is the following. Definition 3.4.9 (variance, standard deviation of a random variable). Let X be a random variable. The variance of X is defined by 2 σX = Var(X) = E((X − µX )2 ) . The standard deviation is the square root of the variance and is denoted σX . The following lemma records the variance of several of our favorite random variables. Lemma 3.4.10. 1. If X ∼ Binom(n, π) then Var(X) = nπ(1 − π). m n m+n−k 2. If X ∼ Hyper(m, n, k) then Var(X) = k m+n m+n m+n−1 . 3. If X ∼ Unif(a, b) then Var(X) = (b − a)2 /12. 4. If X ∼ E(λ) then Var(X) = 1/λ2 . 3.5 The Normal Distribution The most important distribution in statistics is called the normal distribution. Definition 3.5.1 (normal distribution). A random variable X has the normal distribution with parameters µ and σ if X has pdf f (x; µ, σ) = √ 1 2 2 e−(x−µ) /2σ 2πσ −∞<x<∞. We write X ∼ Norm(µ, σ) in this case. 329 3 Probability 0.4 f(x) 0.3 0.2 0.1 0.0 −3 −2 −1 0 1 2 3 x Figure 3.5: The pdf of a standard normal random variable. The mean and variance of a normal distribution are µ and σ 2 so that the parameters are aptly, rather than confusingly, named. R functions dnorm(x,mean,sd), pnorm(q,mean,sd), rnorm(n,mean,sd), and qnorm(p,mean,sd) compute the relevant values. If µ = 0 and sd = 1 we say that X has a standard normal distribution. Figure 3.5 provides a graph of the density of the standard normal distribution. Notice the following important characteristics of this distribution: it is unimodal, symmetric, and can take on all possible real values both positive and negative. The curve in Figure 3.5 suffices to understand all of the normal distributions due to the following lemma. Lemma 3.5.2. If X ∼ Norm(µ, σ) then the random variable Z = (X − µ)/σ has the standard normal distribution. Proof. To see this, we show that P(a ≤ Z ≤ b) is computed by the integral of the standard normal density function. Z µ+bσ 1 X −µ 2 2 √ P(a ≤ Z ≤ b) = P(a ≤ ≤ b) = P (µ + aσ ≤ X ≤ µ + bσ) = e−(x−µ) /2σ dx . σ 2πσ µ+aσ Now in the integral, make the substitution u = (x − µ)/σ. We have then that Z µ+bσ Z b 1 1 2 −(x−µ)2 /2σ 2 √ √ e−u /2 du . e dx = 2πσ 2π µ+aσ a 330 3.5 The Normal Distribution But the latter integral is precisely the integral that computes P(a ≤ U ≤ b) if U is a standard normal random variable. The normal distribution is used so often that it is helpful to commit to memory certain important probability benchmarks associated with it. The 68–95–99.7 Rule If Z has a standard normal distribution, then 1. P(−1 ≤ Z ≤ 1) ≈ 68% 2. P(−2 ≤ Z ≤ 2) ≈ 95% 3. P(−3 ≤ Z ≤ 3) ≈ 99.7%. If the distribution of X is normal (but not necessarily standard normal), then these approximations have natural interpretations using Lemma 3.5.2. For example, we can say that the probability that X is within one standard deviation of the mean is about 68%. Example 3.5.1 In 2000, the average height of a 19-year old United States male was 69.6 inches. The standard deviation of the population of males was 5.8 inches. The distribution of heights of this population is well-modeled by a normal distribution. Then the percentage of males within 5.8 inches of 69.6 inches was approximately 68%. In R, > pnorm(69.6+5.8,69.6,5.8)-pnorm(69.6-5.8,69.6,5.8) [1] 0.6826895 It turns out that the normal distribution is a good model for many variables. Whenever a variable has a unimodal, symmetric distribution in some population, we tend to think of the normal distribution as a possible model for that variable. For example, suppose that we take repeated measures of a difficult to measure quantity such as the charge of an electron. It might be reasonable to assume that our measurements center on the true value of the quantity but have some spread around that true value. And it might also be reasonable to 331 3 Probability assume that the spread is symmetric around the true value with measurements closer to the true value being more likely to occur than measurements that are further away from the true value. Then a normal random variable is a candidate (and often used) model for this situation. The most important use of the normal distribution stems from the way that it arises in the analysis of repeated trials of a random experiment. This is a result of what might be called the Fundamental Theorem of Statistics — The Central Limit Theorem. Before we state the theorem, we give two examples illustrating the principles. Example 3.5.2 Suppose that X is the result of tossing a single die and recording the number. Now suppose that we wish to toss the die 100 times and record the results, x1 , . . . , x100 that obtain. These data can be viewed as the result of performing 100 random processes represented by random variables X1 , . . . , X100 which all have the same distribution and are independent one from another. Consider now the sum y = x1 + · · · + x100 of the 100 tosses. (We’d expect this number to be in the ballpark of 350, wouldn’t we?) We can consider this number y to be the result of a random variable, namely Y = X1 + · · · + X100 . Y itself has a distribution and in theory we could write the pmf for Y . (Y is discrete with possible values 100, 101, . . . , 599, 600.) A simulation suggests what happens. > trials10000=replicate(10000,sum(sample(c(1:6),100,replace=T))) > summary(trials10000) Min. 1st Qu. Median Mean 3rd Qu. Max. 286.0 339.0 350.0 350.2 362.0 414.0 > histogram(trials10000,xlab="Sum of 100 dice rolls") Note that the histogram in Figure 3.6 suggests that Y has a distribution that is unimodal and symmetric. Example 3.5.3 The random variable in the previous example was discrete. Suppose instead that X is a continuous random variable. For example, suppose that X ∼ Exp(1). X might be a waiting time random variable that measures the time until the next radioactive event detected at a Geiger counter. Suppose that X1 , . . . , Xn are n independent trials of the random process X. This would be a natural model for the experiment in which we wait 332 3.5 The Normal Distribution Percent of Total 20 15 10 5 0 300 350 400 Sum of 100 dice rolls Figure 3.6: 10,000 trials of the sum of 100 dice. 0.20 0.08 0.15 0.10 Density Density Density 0.06 0.10 0.05 0.05 0.04 0.02 0.00 0.00 0 5 10 15 Sum of 5 Exp(1) Random Variables 20 0.00 0 5 10 15 20 Sum of 10 Exp(1) Random Variables 25 30 10 20 30 40 Sum of 20 Exp(1) Random Variables Figure 3.7: Sums of independent exponential random variables. for not just one radioactive event but for n in succession. In this case Y = X1 +· · ·+Xn is just the time until n events have happened. The histograms in Figure 3.7 shows what might happen if n = 5, n = 10, and n = 20. One can see that as the number of trials of the experiment increase, the distribution of the sum becomes more symmetric. To describe this situation in general, we note that the situation we are imagining is that we have n random variables X1 , . . . , Xn that have the same distribution and that are independent one from another (i.e., the dice don’t talk to each other). We will call such random variables i.i.d. (for independent and identically distributed). Random variables that arise from repeating a random process and observing the same random variable are the canonical example of i.i.d. random variables. In this situation, we sometimes refer to the original random variable that describes the distribution in question as being the population random variable. If X1 , . . . , Xn are i.i.d. random variables, X1 , . . . , Xn are 333 3 Probability usually called a random sample. Note that this is the same term that we used to describe a sample from a population. These meanings are related but different. Before we state the Central Limit Theorem, we consider the properties of Y = X1 + · · · + Xn in terms of those of X. Specifically we have Lemma 3.5.3. Suppose that X1 , . . . , Xn are random variables and that Y = X1 +· · ·+Xn . Then P 1. E(Y ) = ni=1 E(Xi ), and P 2. if in addition the Xi are independent, then Var(Y ) = ni=1 Var(Xi ), and 3. if in addition the Xi have normal distributions and independent, then Y has a normal distribution. In particular, if the Xi are i.i.d. with mean µ and variance σ 2 , then µY = nµ and Var(Y ) = nσ 2 . The lemma says that the sum of random variables that have normal distributions and are independent is normal. The Central Limit Theorem says that even if the Xi are not normal, if n is large the sum of the Xi is approximately normal. Theorem 3.5.4 (Central Limit Theorem). Suppose that X1 , . . . , Xn are i.i.d. random variables with common mean µ and variance σ 2 . Then as n gets large the random variable Yn = X1 + · · · + Xn has a distribution that approaches the normal distribution. Given i.i.d. random variables X1 , . . . , Xn , we will often be interested in the mean of the values of the random variables rather than the sum. Definition 3.5.5 (sample mean). Given i.i.d. random variables X1 , . . . , Xn (i.e., a random sample), the sample mean is the random variable Xn defined by Xn = (X1 + · · · + Xn )/n . 334 3.6 Exercises Corollary 3.5.6. Suppose that X1 , . . . , Xn are i.i.d. random variables with common mean µ and variance σ 2 . Then as n gets large the sample mean Xn has a distribution that is approximately normal with mean µ and variance σ 2 /n. Returning to Example 3.5.3, we have that if we find the sum Y of 10 different exponential random variables with λ = 1, the mean and variance of Y are each 10. (Recall that the mean and variance of X ∼ Exp(λ) are 1/λ and 1/λ2 respectively.) Corollary 3.5.6 is especially important since we are often interested in the mean of data values x1 , . . . , xn that can be modeled as resulting from repeating a random process n times. Example 3.5.4 In Example 2.6.1 we considered simple random samples of size 5 from a population of 134 MIAA basketball players. We observed the points per game of each of the 5 players in our sample. We could consider the sample of size 5 that we generated as the result of 5 random variables X1 , . . . , X5 . Now this sample does not fit exactly the framework of Corollary 3.5.6. Namely, these random variables are not independent. Once we choose the first player at random (X1 , a discrete random variable with 134 possibilities) the distribution of points per game of X2 changes. This is because we generally sample without replacement. We can rectify this in two ways. First, we may sample with replacement. That guarantees that the five random variables are i.i.d. Else, we can sample with replacement but believe that the random variables Xi are close enough to being independent so as not to affect the result too much. This is especially true if the sample size (5) is much smaller than the population size (134). 3.6 Exercises 3.1 Suppose that four coins (a penny, nickel, dime and quarter) are tossed and the face-up side of each is observed as heads or tails. a) How many equally likely outcomes are there? List them. b) In how many of these outcomes is exactly one head showing? 335 3 Probability c) What is the probability that exactly one head is showing? 3.2 Suppose that ten coins are tossed. a) How many equally likely outcomes are there? Do not list them! b) In how many of these outcomes is exactly one head showing? c) What is the probability that exactly one head is showing? 3.3 Use R to simulate the rolling of a fair six-sided die. (e.g., sample(c(1:6),1) will do the trick). Roll the die 600 times. a) How many of each of the numbers 1 through 6 did you “expect” to occur? b) How many of each of the numbers 1 through 6 actually occurred? c) Are you surprised by the discrepancy between your answers to (a) and (b)? Why or why not? 3.4 Suppose that a small class of 10 students has 4 male students and 6 female students. A random sample of two students is chosen from this class. What is the probability that both of the students are male? (Hint: first find the number of equally likely outcomes.) 3.5 Toss a coin 1,000 times (a simulated coin, not a real one!). a) What is the number of heads in the 1,000 tosses. (You can do this very easily if you code heads as 1 and tails as 0.) b) Now repeat this procedure 10,000 times (that is toss 1,000 coins 10,000 times). You now have 10,000 different answers to part (a). Don’t write them all down but describe the distribution of these 10,000 numbers using the terminology and techniques for describing distributions. 3.6 Let E C be the event “E doesn’t happen.” For example, if we toss one die and E is the event that the die comes up 1 or 2, then E C is the event that the die doesn’t come up 336 3.6 Exercises 1 or 2 (and so E C is the event that the die comes up 3, 4, 5, or 6). Show from the axioms of probability that P (E C ) = 1 − P (E). 3.7 Suppose that you roll 5 standard dice. Determine the probability that all the dice are the same. 3.8 Suppose that you deal 5 cards from a standard deck of cards. Determine the probability that all the cards are of the same color. (A standard deck of cards has 52 cards in two colors. There are 26 red and 26 black cards.) 3.9 Acceptance sampling is a procedure that tests some of the items in a lot and decides to accept or reject the entire lot based on the results of testing the sample. Suppose that the test determines whether an item is “acceptable” or “defective”. Suppose that in a lot of 100 items, 4 are tested and that the lot is rejected if one or more of those four are found to be defective. a) If 10% of the lot of 100 are defective, what is the probability that the purchaser will reject the shipment? b) If 20% of the lot of 100 are defective, what is the probability that the purchaser will reject the shipment? 3.10 Suppose that there are 10,000 voters in a certain community. A random sample of 100 of the voters is chosen and are asked whether they are for or against a new bond proposal. a) If only 4,500 of the voters are for the bond proposal, what is the probability that fewer than half of the sampled voters are in favor of the bond proposal? b) Suppose instead that the sample consists of 2,000 voters. Answer the same question as in the previous part. 3.11 If the population is very large relative to the size of the sample, it seems like sampling with replacement should yield very similar results to that of sampling without replacement. Suppose that an urn contains 10,000 balls, 3,000 of which are white. a) If 100 of these balls are chosen at random with replacement, what is the probability that at most 25 of these are white? 337 3 Probability b) If 100 of these balls are chosen at random without replacement, what is the probability that at most 25 of these are white? ( 2x 3.12 A random variable X has the triangular distribution if it has pdf fX (x) = 0 x ∈ [0, 1] otherwise. a) Show that fX is indeed a pdf. b) Compute P(0 ≤ X ≤ 1/2). c) Find the number m such that P(0 ≤ X ≤ m) = 1/2. (If is natural to call m the median of the distribution.) ( k(x − 2)(x + 2) 3.13 Let f (x) = 0 −2 ≤ x ≤ 2 otherwise. a) Determine the value of k that makes f a pdf. Let X be the corresponding random variable. b) Calculate P(X ≥ 0). c) Calculate P(X ≥ 1). d) Calculate P(−1 ≤ X ≤ 1). 3.14 Describe a random variable that is neither continuous nor discrete. Does your random variable have a pmf? a pdf? a cdf? 3.15 Show that if f and g are pdfs and α ∈ [0, 1], then αf + (1 − α)g is also a pdf. 3.16 Suppose that a number of measurements that are made to 3 decimal digits accuracy are each rounded to the nearest whole number. A good model for the “rounding error” introduced by this process is that X ∼ Unif(−.5, .5) where X is the difference between the true value of the measurement and the rounded value. a) Explain why this uniform distribution might be a good model for X. 338 3.6 Exercises b) What is the probability that the rounding error has absolute value smaller than .1? 3.17 If X ∼ Exp(λ), find the median of X. That is find the number m such that P(X ≤ m) = 1/2. 3.18 A part in the shuttle has a lifetime that can be modeled by the exponential distribution with parameter λ = 0.01, where the units are hours. The shuttle mission is scheduled for 200 hours. a) What is the probability that the part fails on the mission? b) The event that is described in part (a) is BAD. So the shuttle carries two replacements for the part (a total of three altogether). What is the probability that the mission ends without all three failing? 3.19 The lifetime of a certain brand of water heaters in years can be modeled by a Weibull distribution with α = 2 and β = 25. a) What is the probability that the water heater fails within its warranty period of 10 years? b) What is it probability that the water heater lasts longer than 30 years? c) Using a simulation, estimate the average life of one of these water heaters. 3.20 Prove Theorem 3.4.5. 3.21 Suppose that you have an urn containing 100 balls, some unknown number of which are red and the rest are black. You choose 10 balls without replacement and find that 4 of them are red. a) How many red balls do you think are in the urn? Give an argument using the idea of expected value. b) Suppose that there were only 20 red balls in the urn. How likely is it that a sample of 10 balls would have at least 4 red balls. 339 3 Probability 3.22 The file http://www.calvin.edu/~stob/data/scores.csv contains a dataset that records the time in seconds between scores in a basketball game played between Kalamazoo College and Calvin College on February 7, 2003. a) This waiting time data might be modeled by an exponential distribution. Make some sort of graphical representation of the data and use it to explain why the exponential distribution might be a good candidate for this data. b) If we use the exponential distribution to model this data, which λ should we use? (A good choice would be to make the sample mean equal to the expected value of the random variable.) c) Your model of part (b) makes a prediction about the proportion of times that the next score will be within 10, 20, 30 and 40 seconds of the previous score. Test that prediction against what actually happened in this game. 3.23 Show that it is not necessarily the case that E(t(X)) = t(E(X)). 3.24 Let X be the random variable that results form tossing a fair six-sided die and reading the result (1–6). Since E(X) = 3.5, the following game seems fair. I will pay you 3.52 and then we will roll the die and you will pay me the square of the result. Is the game fair? Why or why not? 3.25 In this problem we compare sampling with replacement to sampling without replacement. You will recall that the former is modeled by the binomial distribution and the latter by the hypergeometric distribution. Consider the following setting. There are 4,224 students at Calvin and we would like to know what they think about abolishing the interim. We take a random sample of size 100 and ask the 100 students whether or not they favor abolishing the interim. Suppose that 1,000 students favor abolishing the interim and the other 3,224 misguidedly want to keep it. a) Suppose that we sample these 100 students with replacement. What is the mean and the variance of the random variable that counts the number of students in the sample that favor abolishing the interim? b) Now suppose that we sample these 100 students without replacement. What is the mean and the variance of the random variable that counts the number of students in the sample that favor abolishing the interim? 340 3.6 Exercises c) Comment on the similarities and differences between the two. Give an intuitive reason for any difference. 3.26 Scores on IQ tests are scaled so that they have a normal distribution with mean 100 and standard deviation 15 (at least on the Stanford-Binet IQ Test). a) MENSA, a society supposedly for persons of high intellect, requires a score of 130 on the Stanford-Binet IQ test for membership. What percentage of the population qualifies for MENSA? b) One psychology text labels those with IQs of between 80 and 115 as having “normal intelligence.” What percentage of the population does this range contain? c) The top 25% of scores on an IQ test are in what range? d) If two different individuals are chosen at random, what is the probability that the sum of their IQ scores is greater than 240? 3.27 In this problem we investigate the accuracy of the Central Limit Theorem by simulation. Suppose that X is a random variable that is exponential with parameter λ = 1/2. 2 = 4. Suppose that we repeat the random experiment n times to get Then µX = 2 and σX independent random variables X1 , . . . , Xn each of which is exponential with λ = 1/2. (The R function rexp(n,lambda=.5) will simulate this experiment.) a) From Lemma 3.5.3 X = (X1 + · · · + X5 )/5 has what mean and variance? b) If n = 5, what does the Central Limit Theorem predict for P (1.5 < X < 2.5)? c) Simulate the distribution of X by taking many samples of size 5. Compute the proportion of your samples for which (1.5 < x < 2.5) and compare to part (b). d) Repeat parts (b) and (c) for n = 30. 341 4 Inference 4.1 Hypothesis Testing Suppose that a real-world process is modeled by a binomial distribution for which we know n but do not know π. Examples abound. Example 4.1.1 1. We have said that a fair coin is equally likely to be heads or tails when tossed. But now suppose we have a coin and toss it 100 times. How do we know it is fair? That is, how do we know π = 0.5? 2. A factory produces the ubiquitous widget. It claims that the probability that any widget is defective is less than 0.1%. We receive a shipment of widgets. We wonder whether the claim about the defective rate is really true. If we test 100 widjits, this is an example of a binomial experiment with n = 100 and π unknown. 3. A National Football League team is trying to decide whether to replace its field goal kicker with a new one. The current kicker makes about 30% of his kicks from 45 yards out. The team tests the new kicker by asking him to try 20 kicks from 45 yards out. This might be modeled by a binomial distribution with n = 20 and π unknown. The team is hoping that π > .3. 4. A standard test for ESP works as follows. A card with one of five printed symbols is selected without the person claiming to have ESP being able to see it. The purported psychic is asked to name what symbol is on the card while the experimenter looks at it and “thinks” about it. A typical experiment consists of 25 trials. This is an example of a binomial experiment with n = 25 and unknown π. The experimenter usually believes that π = .2. 401 4 Inference In each of the instances of Example 4.1.1 we have a hypothesis about π that we could be considered to be testing. In the four cases we could be considered to be testing the hypotheses π = .5, π ≤ 0.001, π ≤ 0.3, and π = .2. A hypothesis proposes a possible state of affairs with respect to a probability distribution governing an experiment that we are about to perform. There are a variety of kinds of hypotheses that we might want to test. 1. A hypothesis stating a fixed value of a parameter: π = .5. 2. A hypothesis stating a range of values of a parameter: π ≤ .3. 3. A hypothesis about the nature of the distribution itself: X has a binomial distribution. To test the hypothesis that the coin is fair (π = .5) we must actually collect data. Suppose that we toss the coin n = 100 times and get x = 40 heads. What should we conclude about our hypotheses? The first thing to note is that we cannot conclude anything with certainty in this case. Any value of x = 0, 1, . . . , 100 is consistent with both π = 0.5 and any other value of π. However, if the coin really is fair, some results for x are more surprising than others. In this case, for example, if the our hypothesis is true, then P(X ≤ 40) = 0.02844, so we would only get 40 or fewer heads about 2.8% of the times that did this test. In other words, getting only 40 heads is pretty unusual, but not extremely unusual. This gives us some evidence to suggest that the coin in biased. After all, one of two things must be true. Either • the coin is fair (π = 0.50) and we were just “unlucky” in our particular 100 tosses, or • the coin is not fair, in which case the probability calculation we just did doesn’t apply to the coin. That in a nutshell is the logic of a statistical hypothesis test. We will learn a number of hypothesis tests, but they all follow the same basic outline. Step 1: State the null and alternative hypotheses In a typical hypothesis test, we pit two hypotheses against each other. 402 4.1 Hypothesis Testing 1. Null Hypothesis. The null hypothesis, usually denoted H0 , is generally a hypothesis that the data analysis is intended to investigate. It is usually thought of as the “default” or “status quo” hypothesis that we will accept unless the data gives us substantial evidence against it. 2. Alternate Hypothesis. The alternate hypothesis, usually denoted H1 or Ha , is the hypothesis that we are wanting to put forward as true if we have sufficient evidence against the null hypothesis. In the example of the supposedly fair coin, it is clear that the hypotheses should be H0 : Ha : π = 0.5 π 6= 0.5 The null hypothesis simply says that the coin in fair while the alternate hypothesis says that it is not. We want to choose between these two hypotheses. In this example, the alternate hypothesis is two-sided. There are also situations when we wish to consider a one-sided alternate hypothesis. Consider the ESP example. Our null hypothesis is surely that the subject cannot do better than chance (π = .2) but our alternate hypothesis is that the subject can do better than chance (π > .2). In our particular test, we do not allow for the possibility that the the subject somehow typically does worse than chance (although this is logically possible). Step 2: Calculate a test statistic In our example, we compute the number of heads (40). This is the number that we will use to test our hypothesis. The number 40 in this instance is called a statistic. Since we use this statistic to test our hypothesis, we will sometimes call it a test statistic. In fact we will use the term statistic in two different ways. In this case, the number 40 is a specific value that is computed from the data. But also, 40 is the value of a certain random variable that is computed from the experiment of tossing a coin 100 times. We will refer to both the random variable and its value as statistics. In keeping with our notation for random variables and data, upper-case letters will denote random variables and lower-case letters their particular values. A test statistic should be some number that measures in some way how true the null hypothesis looks. In this case, a number near 50 is in keeping with the null hypothesis. The farther x is from 50, the stronger the evidence against the null hypothesis. 403 4 Inference Step 3: Compute the p-value . Now we need to evaluate the evidence that our test statistic provides. To do this requires that we think about our statistic as a random variable. In the case of the supposedly fair coin, our test statistic X ∼ Binom(100, π). As a random variable, our test statistic has a distribution. The distribution of the test statistic is called its sampling distribution. Now we can ask probability questions about our test statistic. The general form of the question is “How unusual would my test statistic be if the null hypothesis were true?” To do this, it is important that we know about the distribution of X when the null hypothesis is true. In this case, X ∼ Binom(100, 0.5). So how unusual is it to get only 40 heads? Assuming that the null hypothesis is true (i.e., that the coin is fair), and P(X ≤ 40) = pbinom(40,100,.5) = 0.0284 , P(X ≥ 60) = 1 - pbinom(59,100,.5) = 0.0284 . So the probability of getting a test statistic at least as extreme (unusual) as 40 is 0.0568. This probability is called a p-value. There is some subtlety to the above computation and we shall return to it. Step 4: Draw a conclusion Drawing a conclusion from a p-value is a judgment call and it is a scientific rather than mathematical decision. Our p-value is 0.0568. This means that if we flipped 100 fair coins many times, between 5 and 6% of these times we would fewer than 41 or more than 59 heads. So our result of 40 is a bit on the unusual side, but not extremely so. Our data provide some evidence to suggest that the coin may not be fair, but the evidence is far from conclusive. If we are really interested in the coin, we probably need to gather more data. Other hypothesis tests will proceed in a similar fashion. The details of how to compute a test statistic and how to convert it into a p-value will change from test to test, but the interpretation of the p-value is always the same. The p-value measures how surprising the value of the test statistic would be if the null hypothesis were true. The next example illustrates the steps of the hypothesis testing paradigm in a case where the alternate hypothesis is one-sided. Example 4.1.2 404 4.1 Hypothesis Testing A company receives a shipment of printed circuit boards. The claim of the manufacturer is that the defective rate is at most 1%. If 100 boards are tested, should we dispute the claim of the manufacturer if we find 3 defective boards in this test? In this situation, the pair of hypotheses to test are H0 : π = 0.01 Ha : π > 0.01 The following R session is relevant to this example. > 1-pbinom(c(0:5),100,.01) [1] 0.633968 0.264238 0.079373 0.018374 0.003432 0.000535 From this computation, we find that even if the null hypothesis is true, we could expect to find 3 or more defective boards 7.9% of the time if we test 100. This result doesn’t seem surprising enough to reject the null hypothesis or the shipment. (But perhaps you disagree!) In this example, we have illustrated how we proceed when the alternate hypothesis is one-sided. Namely, we only consider results to favor the alternate hypothesis when they are in the correct direction of the null hypothesis. That is, we wouldn’t consider having too few defectives as evidence against the null hypothesis in favor of the alternate hypothesis. It is often the case that we must make a decision based on our hypothesis test. In Example 4.1.2, for example, we must finally decide whether to reject the shipment. There are of course two different kinds of errors that we could make. Definition 4.1.1 (Type I and Type II errors). A Type I error is the error of rejecting H0 even though it is true. A Type II error is the error of not rejecting H0 even though it is false. Of course, if we reject the null hypothesis, we cannot know whether we have made a Type I error. Similarly, if we do not reject the null hypothesis, we cannot know whether we have made a Type II error. Whether we have committed such an error depends on the true value of π which we cannot ever know simply from data. What we can do however is to compute the probability that we will make such an error given our decision rule and our true state of nature. To illustrate the computation of these two kinds of errors, let’s return to the computation of the p-value in the case of the (un)fair coin. Suppose that we decide that whenever we 405 4 Inference toss a coin 100 times, we will consider it unfair if we have 40 or fewer or 60 or more heads. Then the p-value computation (recall the p-value was 0.0568) tells us that If the null hypothesis is true, our decision rule will make a Type I error with probability 5.68% Is this the right decision rule to use? If we instead we decide to reject the null hypothesis only if X ≤ 39 or X ≥ 61 we find that we will make a type I error with probability only pbinom(39,100,.5) + (1-pbinom(60,100,.5))=0.035. Which decision rule should we use? A common convention is to make some canonical choice of a probability of Type I error that we are willing to tolerate. A probability of Type I error of 5% is often chosen. If 5% were the greatest type I error probability we were willing to tolerate then we would not reject a null hypothesis if our p-value was greater than 5%. In the coin example, 40 heads would be acceptable but 39 would not. The choice of 5% is conventional but somewhat arbitrary. It is usually better to report the result of a hypothesis test as a p-value rather than simply reporting that the null hypothesis is rejected. We usually denote by α the probability of a Type I error that we are willing to accept in our decision rule. Notice that if we lower α it becomes more difficult to reject the null hypothesis. This means that if the null hypothesis is false, the probability of a Type II error increases with decreasing α. (Oddly enough, the probability of a Type II error is named β.) We cannot compute the probability of a Type II error however without knowing the true value of π. Consider the case of the un(fair) coin. Suppose we choose α = .5 and so we choose to reject the null hypothesis only if X ≤ 39 or X ≥ 61. What is the probability that we make a type II error if the true value of π = 0.55? It is easy to see that this is computed by > pbinom(39,100,.55) + (1-pbinom(60,100,.55)) [1] 0.1351923 Notice that we will reject the null hypothesis only 13.5% of the time so that the probability that we make a Type II error is 86.5%! Obviously, our test is very conservative and will not detect an unfair coin very often. That is the penalty we pay for wanting to be reasonably sure that we do not make a Type I error. The next example illustrates these considerations in the case of a one-sided alternate hypothesis. Example 4.1.3 As described in Example 4.1.1, the conventional test for ESP is a card test. The subject is asked to guess what is on 25 consecutive cards each of which contains one of five symbols. The appropriate pair of hypotheses in this case are 406 4.2 Inferences about the Mean H0 : π = 0.2 Ha : π > .2 The following computation from R will help us develop our test. > 1-pbinom(c(5:10),25,.2) [1] 0.38331 0.21996 0.10912 0.04677 0.01733 0.00555 Obviously, our decision rule should say to reject the null hypothesis if the number of successes is too large. Note that the probability that P(X ≥ 9) = 4.7% if the null hypothesis is true. Therefore if we choose α to be 5% as is a custom, we should reject the null hypothesis in favor of the alternate hypothesis if the number of successes is at least 9. If we follow this rule, the probability that we will make a Type I error is 4.7% if the null hypothesis is true. What if the null hypothesis is false? For example what if the true value of π = .3? (This is a rather modest case of ESP but such a person would be interesting!) In this case, our decision rule would reject the null hypothesis with probability 1-pbinom(8,25,.3)=.323. Note that even if our subject has ESP, our test could very well not detect this. What one should notice in our treatment of decision rules is the asymmetry between the two hypotheses. We are generally not willing to tolerate a large probability of a Type I error – we often set α = 5%. However this seems to lead to a rather large probability of a Type II error in the case that the null hypothesis is false. This asymmetry is intentional however as the null hypothesis usually has a preferred status as the “innocent until proven guilty” hypothesis. 4.2 Inferences about the Mean One of the most important problems in inferential statistics is that of making inferences about the (unknown) mean of a population. Example 4.2.1 1. What is the average height of a Calvin College student? It not being feasible to measure each student, we might take a random sample of Calvin students and 407 4 Inference compute the sample mean, x of these students. How close is x likely to be to the true mean? 2. We have a number of chickens that we feed a diet of sunflower seeds. The average weight of the chickens after 30 days is 330 grams. How close is this number to the average weight of the (theoretical) population of “all” similar chickens? 3. We take a number of measurements of the speed of the light. How close is the average of these measurements likely to be to the “true” value? In this section, we will conceptualize the above examples as instances of this question. Given i.i.d. random variables X1 , . . . , Xn with unknown mean µX , what can we infer about µX from a particular outcome x1 , . . . , xn ? Estimates and Estimators We will call x an estimate of µX and X and estimator of µX . The difference is that X is a random variable – you can think of it as a procedure for producing an estimate — and x is a number. The estimator X has two very important properties that make it a desirable estimator. The first is that E(X) = µX . In other words, in the long run, the sample mean will average the population mean. Because of this, we say that the estimator X is unbiased. An unbiased estimator doesn’t have a tendency to under- or over-estimate the quantity in question. The general definition is this. Definition 4.2.1 (unbiased estimator). Suppose that θ is a parameter of a distribution and that Y is a statistic computed from a random sample X1 , . . . , Xn from that distribution. Then Y is an unbiased estimator of θ if E(Y ) = θ. 2 which is the real It turns out that the sample variance S 2 is an unbiased estimator of σX 2 reason we use n − 1 rather than n in the definition of S . 408 4.2 Inferences about the Mean The second important property is that X is likely to be close to µX if n is large. Formally, we can say that for every > 0 we have that limn→∞ P(|Xn − µX | > ) = 0. While we will not prove this, it follows from the fact that the variance of Xn is σ 2 /n. These two properties together suggest that X is a good choice for an estimator of µ. The Idea of a Confidence Interval While the estimator X may be a good procedure to use, we recognize that in any particular instance, the estimate x will not be equal to to µX . We next will use the Central Limit Theorem to say something about how close to µX the estimate is likely to be. The Central Limit Theorem allows us to say that X is approximately normally distributed with mean µX and variance σ 2 /n. Thus the following random variable has a distribution that is approximately standard normal: Z= X −µ √ σ/ n Therefore we can write X −µ √ < 1.96 ≈ .95 . P −1.96 < σ/ n (The number 2 in the 68%-95%-99.7% law is actually 1.96.) Using algebra, we find that σ σ ≈ .95 . P X − 1.96 √ < µ < X + 1.96 √ n n What this probability statement says is that the interval σ σ X − 1.96 √ , X + 1.96 √ n n is likely to contain the true mean of the distribution. This interval is a random interval. Definition 4.2.2 (confidence interval). Suppose that X1 , . . . , Xn is a random sample from a distribution that is normal with mean µ and variance σ 2 . Suppose that x1 , . . . , xn is the observed sample. The interval σ σ x − 1.96 √ , x + 1.96 √ n n 409 4 Inference is called an approximate 95% confidence interval for µ. How does this notion of a confidence interval help us? Actually not much since this interval is defined in terms of σ, the standard deviation of the original distribution. But σ is not likely to be known (after all, we don’t even know the mean µ of the original distribution). Let’s set that issue aside and consider an example. Example 4.2.2 A machine creates rods that are to have a diameter of 23 millimeters. It is known that the standard deviation of the actual diameters of parts created over time is 0.1 mm. A random sample of 40 parts are measured precisely to determine if the machine is still producing rods of diameter 23 mm. The data and 95% confidence interval are given by > x [1] 22.958 23.179 23.049 22.863 23.098 23.011 22.958 23.186 [11] 23.166 22.883 22.926 23.051 23.146 23.080 22.957 23.054 [21] 23.040 23.057 22.985 22.827 23.172 23.039 23.029 22.889 [31] 22.837 23.045 22.957 23.212 23.092 22.886 23.018 23.031 > mean(x) [1] 23.024 > c(mean(x)-(1.96)*.1/sqrt(40),mean(x)+(1.96)*.1/sqrt(40)) [1] 22.993 23.055 23.015 22.995 23.019 23.059 23.089 22.894 23.073 23.117 It appears that the process could still be producing rods of average diameter 23 mm. We use the term confidence interval for this interval since we are are reasonably confident that the true mean of the rods is in the interval (22.933, 23.055). We even have a number that quantifies that confidence, 95%. But we need to be very careful in what are saying. We are not saying that (BAD - DO NOT SAY) the probability that the true mean is in the interval (22.993, 23.055) is 95%. There is no probability after the data are collected. Either the mean is in the interval or it isn’t. Rather we are making a statement before the data are collected: 410 4.2 Inferences about the Mean If we are to generate a 95% confidence interval for the mean from a random sample of size 40 from a normal distribution with standard deviation 0.1, then the probability is 95% that the resulting confidence interval will contain the mean. On the frequentist conception of probability we could say If we generate many 95% confidence intervals by this procedure, approximately 95% of them will contain the mean of the population. After the data are collected, a good way of describing the confidence interval that results is Either the population mean is in (22.993, 23.055) or something surprising happened. Notice that the confidence interval says something about the precision of our estimate. A wide confidence interval means that our estimate is not very precise. But σ Isn’t Known! Using the Central Limit Theorem, we have seen that σ σ P X − 1.96 √ < µ < X + 1.96 √ ≈ .95 . n n (4.1) The next step is to make another approximation. We need to get rid√of σ. Since S 2 , the sample variance is an unbiased estimate of σ 2 , the trick is to use S = S 2 , the sample statidard deviation, to estimate σ. Thus we have S S P X − 1.96 √ < µ < X + 1.96 √ ≈ .95 . n n Now, after the experiment we have values for both X and S. We illustrate the procedure for getting our new confidence interval using Example 4.2.2. Note that the following R code computes a 95% confidence interval for µX . 411 4 Inference > sd(x) [1] 0.098755 > c( mean(x) - 1.96* sd(x)/sqrt(40), mean(x) + 1.96 * sd(x)/sqrt(40)) [1] 22.993 23.054 Removing the Approximations Our new 95% confidence interval for the mean s s . x − 1.96 √ , x + 1.96 √ n n makes two approximations: • We use the CLT to say that we can use the normal distribution (that’s where the 1.96 comes from) • We use S instead of σ simply because we do not know σ The CLT Approximation There are two ways of getting around the fact that we use the CLT in our approximation. First, we could assume that the underlying distribution is normal. Then there is no need to approximate since the distribution of X is exactly normal. Or we could use facts about the particular distribution in question. For example if X is binomial, we could use similar facts about the binomial distribution to develop a different kind of confidence interval. In general however we are just going to have to be content with the fact that the our confidence intervals are approximate and hope that our sample size n is large enough. The Approximation of using S for σ The bottom line here is that we will change the 1.96 used in our current approximation to a slightly larger number to compensate for the approximation that results from not knowing σ. It seems right to do this: if we are less sure that we are using the right endpoints for the interval, we should make the interval a little wider to ensure that we have a 95% chance of capturing the mean. How much wider we should make the interval is a somewhat tricky (and long) story that we will tell in the next section. 412 4.3 The t-Distribution Before we modify our intervals to take into account the approximation of σ by S, we note that we could modify our confidence intervals in a number of ways. For example, the number 95% is not sacred. It should be clear how to generate a 68% confidence interval or even a 80% confidence interval. We merely need to look up the appropriate fact about the standard normal distribution. A second way in which we might modify our intervals is to make them one-sided. For example, if we wanted a lower-bound for our rod diameters, since qnorm(.05,0,1)=-1.644854 we could use S P X − 1.64 √ < µ < ∞ ≈ .95 . n 4.3 The t-Distribution In the last section, we left the problem of finding a confidence interval for µ at the point where we were had a perfectly reasonable, but approximate, confidence interval. There were two approximations: the use of the CLT and the approximation of σ by S. We focus on the later problem here. We will begin by assuming that the random sample X1 , . . . , Xn are normal random variables so that we need not concern ourselves with the X −µ X −µ √ by √ ? CLT approximation. Then the question is, what is the effect of replacing σ/ n S/ n The t-distribution holds the key. Definition 4.3.1 (t-distribution). A random variable T has a t distribution (with parameter ν ≥ 1, called the degrees of freedom of the distribution) if it has pdf 1 Γ((ν + 1)/2) 1 f (t) = √ 2 Γ(ν/2) (1 + t /ν)(ν+1)/2 πν −∞<t<∞. (The Γ function in the definition of the pdf above is an important function from analysis that is a continuous extension of the factorial function. But in this instance, it doesn’t really matter what it is since its purpose is simply as a constant to ensure that the integral of the density is 1.) Some properties of the t-distribution include 1. f is symmetric about t = 0 and unimodal. In fact f looks bell-shaped. 413 4 Inference 2. The mean of T is 0 if ν > 1 (and does not exist if ν = 1). 3. The variance of T is ν/(ν − 2) if ν > 2. 4. For large ν, T is approximately standard normal. 0.3 density 0.2 0.1 x=seq(-3,3,.01) y=dt(x,3) z=dt(x,10) w=dnorm(x,0,1) plot(w~x,type="l",ylab="density") lines(y~x) lines(z~x) 0.0 > > > > > > > 0.4 R knows the t-distribution of course and the appropriate functions are dt(x,df), pt(), qt(), and rt(). The graphs of the normal distribution and two t-distributions are shown below. −3 −2 −1 0 1 2 3 x The importance of the t-distribution is contained in the following Theorem. Theorem 4.3.2. If X1 , . . . , Xn are i.i.d. normal random variables with mean µ and variance σ 2 , then the random variable X −µ √ S/ n has a t distribution with n − 1 degrees of freedom. It is now clear how to generate an exact confidence interval for µ in the case that the data come from a normal distribution. For any number β, let tβ,ν be the unique number such that P (T > tβ,ν ) = β where T is random variable that has a t distribution with ν degrees of freedom. Then we have 414 4.3 The t-Distribution Theorem 4.3.3. If x1 , . . . , xn are the observed values of a random sample from a normal distribution with unknown mean µ and t∗ = tα/2,n−1 , the interval ∗ s ∗ s √ √ x̄ − t , x̄ + t n n is an 100(1 − α)% confidence interval for µ. In Example 4.2.2 where we considered the diameter of manufactured rods, we had n = 40. If we assume that the measurements come from a normal distribution, we would use the t-distribution with ν = 39. To find a 95% confidence interval we need t.025,39 . R of course computes this as qt(.975,39)= 2.022691 . So the effect of not knowing σ in this case is to use 2.02 in determining the width of the confidence interval rather than 1.96. Notice that in this confidence interval there are three components. The first is an estimate x of the quantity it is a confidence interval for. Second there is a number t∗ determined from the t-distribution by the level of confidence and the degrees of freedom. This number √ is usually referred to as a critical value. Finally, there is an estimate s/ n of the standard √ deviation of the estimator. The number σ/ n is often called the standard error (of the √ estimator or of the mean) and is often denoted σe . The estimate s/ n of this standard error is often denoted se . Therefore we have that the confidence interval is of the form (estimate) ± (critical value) · (estimate of standard error) . Many other confidence intervals in statistics have the same form. The critical values and estimates change based on the situation but the general form of the interval is the same. Because of the importance of confidence intervals for µ that are generated by the tdistribution, there is a function in R that does the table lookup and the arithmetic for us. We illustrate in the next example. Example 4.3.1 Returning to the iris data, we might want to know the average sepal width of virginica irises. There is a lot to ignore in the following output but note that two confidence intervals are generated (95% and 90%) and that the t-distribution is used with 49 degrees of freedom (as n = 50). > data(iris) > sw=iris$Sepal.Width[iris$Species=="virginica"] 415 4 Inference > hist(sw) > t.test(sw) One Sample t-test data: sw t = 65.208, df = 49, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 2.882347 3.065653 sample estimates: mean of x 2.974 > t.test(sw,conf.level=.9) One Sample t-test data: sw t = 65.208, df = 49, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 90 percent confidence interval: 2.897536 3.050464 sample estimates: mean of x 2.974 Now that we have exact confidence intervals in the case that data come from a normal distribution even if σ is unknown, we turn to the case that the underlying distribution is unknown. In this case we advocate using the t-distribution just as above, recognizing that the result is just an approximation. We illustrate in the following example. Example 4.3.2 Thirty seniors are chosen at random from the collection of 1,333 seniors at a certain midwest college. The average GPA of the thirty seniors chosen is 3.2981. What inferences can we make about the mean GPA of the 1,333 seniors? We first simplify and assume that the 30 seniors represent the result of thirty i.i.d. random variables. Though sampling was without replacement, this seems like a relatively harmless as- 416 4.3 The t-Distribution sumption. We next realize that the underlying distribution of GPAs is not likely to be normal but rather to be negatively skewed. (This does not mean that we expect to find negative GPAs!) The technology of the last section suggests to use the normal distribution with s in place of σ. Using the t-distribution instead produces > sr=read.csv(’http://www.calvin.edu/~stob/data/actgpa.csv’) > sr$GPA [1] 3.992 2.533 3.377 3.009 3.509 3.969 3.917 3.547 3.416 3.287 4.000 3.446 [13] 3.905 2.926 3.100 3.446 2.785 3.663 3.368 3.352 3.929 2.750 3.620 3.765 [25] 2.763 1.986 2.836 2.696 3.119 2.662 > t.test(sr$GPA) One Sample t-test data: sr$GPA t = 35.1095, df = 29, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 3.097500 3.480700 sample estimates: mean of x 3.2891 In this case, the effect of using the t-distribution is to replace 1.96 for 2.045 in the computation of the width of the confidence interval. It seems prudent to use a wider confidence interval since we are only approximating the “true” 95% confidence interval in light of the fact that we are using the CLT. Most statisticians recommend the approach of the last example. Namely, when constructing an approximate confidence interval in the case when our data are not from a normal distribution, we use the t-distribution with its slightly wider intervals than would be constructed by using the normal distribution. Statisticians have found that when this is done, the confidence intervals constructed work well for a wide variety of underlying nonnormal distributions. That is, 95% confidence intervals produced from the t-distribution tend to be approximately 95% confidence intervals even though the distributional hypothesis is not satistfied. We say that this method of producing confidence intervals is robust meaning it is not particularly sensitive to departures from the hypothesis (normality) on 417 4 Inference which it is based. (Older books suggest that one could use the normal distribution instead of the t-distribution if n ≥ 30 but this was a computational simplification. R knows all the t-distributions and can use one as easily as another.) There are two very important cautions to be made here. Although the t-distribution works well over a wide range of distributions and sample sizes, it still is an approximation and in particular can give poor results if the sample size is small and the underlying distribution is quite skew. And the t-distribution will often fail disastrously if the independence assumption is violated. 4.4 Inferences for the Difference of Two Means In this section we consider the problem of making inferences about the difference of two unknown means. We first give some examples. Example 4.4.1 1. One might hypothesize that females get better grades at Calvin than males on average. One way of stating this claim precisely is to claim that the average GPA of females is greater than the average GPA of males. Since Calvin does not publish the average GPA by gender, we might test this claim by choosing a random sample of males and a separate random sample of females and comparing the two sample means. 2. One might claim that Tylenol is better than ibuprofren for treating pain from fractures in young children. To test this, one might assign children with leg fractures at random to treatment by Tylenol or ibuprofren. One would then compare the averages of some measure of pain relief in the two groups. 3. Kaplan claims to be able to raise SAT scores by 100 points with its tutoring program. To test the claim, they take a number of individuals who have already taken the SAT test and subject them to their program. The students then take the SAT test after the program and their before and after scores are compared. In the first case of the example, it is easy to see that we are choosing a random sample from each of two different populations. The second case is somewhat different. The “populations” of ibuprofren and Tylenol takers are really theoretical and not actual populations. 418 4.4 Inferences for the Difference of Two Means But we can still think of the results as random sample from these theoretical populations (e.g., the population of all children with similar injuries who might be given ibuprofren), in part because we randomized the assignment of individuals to the two groups. The third case of the example is clearly different. The before and after scores do not represent two independent populations since we measured these scores on the the same individuals. In this section we address the issue of determining whether there is a difference in means between the two populations. In this section, we consider the situation that arises in the the first two cases of the example. We will call this the “two independent samples” case. Assumptions for two independent samples: 1. X1 , . . . , Xm is a random sample from a population with mean µX and variance 2 . σX 2. Y1 , . . . , Yn is a random sample from a population with mean µY and variance σY2 . 3. The two samples are independent one from another. 4. The samples come from normal distributions. We first write a confidence interval for the difference in the two means µX − µY . Just as did our confidence intervals for one mean µ, our confidence interval will have the form (estimate) ± (critical value) · (estimate of standard error) . The natural choice for an estimator of µX − µY is X − Y . To write the other two pieces of the confidence interval, we need to know the distribution of X − Y . The necessary fact is this: X − Y − (µX − µY ) q ∼ Norm(0, 1) . 2 2 σX σY m + n Analogously to confidence intervals for a single mean, it seems like the right way to proceed is to estimate σX by sX , σY by sY and to investigate the random variable 419 4 Inference X − Y − (µX − µY ) q . 2 SX SY2 m + n (4.2) The problem with this approach is that the distribution of this quantity is not known in general (unlike the case of the single mean where the analogous quantity has a tdistribution). We need to be content with an approximation. Lemma 4.4.1. (Welch) The quantity in Equation 4.2 has a distribution that is approximately a t-distribution with degrees of freedom ν where ν is given by ν= 2 SX m 2 /m)2 (SX m−1 + + SY2 n 2 (SY2 /n)2 n−1 (4.3) (It isn’t at all obvious from the formula but it is good to know that min(m − 1, n − 1) ≤ ν ≤ n + m − 2.) We are now in a position to write a confidence interval for µX − µY . An approximate 100(1 − α)% confidence interval for µX − µY is ! r 2 2 s s 1 x − y ± t∗ + 2 m n (4.4) where t∗ is the appropriate critical value tα/2,ν from the t-distribution with ν degrees of freedom given by (4.3). We note that ν is not necessarily an integer and we leave it R to compute both the value of ν and the critical value t∗ . Example 4.4.2 420 4.4 Inferences for the Difference of Two Means The t-test is due to “Student” (a pseudonym of William Sealy Gossett whose employer, Guinness Brewery, did not allow him to publish under his own name). In a famous paper in 1908 addressing the issue of the inference about means, Student considered data from a sleep experiment. Two different soporifics were tried on a number of subjects and the amount of extra sleep that each subject attained was recorded. The question is whether one soporific worked better than another. > sleep extra group 1 0.7 1 2 -1.6 1 3 -0.2 1 4 -1.2 1 5 -0.1 1 ................. > t.test(extra~group,data=sleep) Welch Two Sample t-test data: extra by group t = -1.8608, df = 17.776, p-value = 0.0794 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.3654832 0.2054832 sample estimates: mean in group 1 mean in group 2 0.75 2.33 We see that each group averaged more sleep and the excess was more for those subjects in group 2. However it does not appear that we could say one drug was clearly better than the other (after all, 0 is in the confidence interval so that the mean difference could be 0). A 95% confidence interval for the difference in mean effect of the two drugs is (−3.37, 0.21). We can see that the degrees of freedom is 17.776 and we can be grateful that we didn’t have to compute it or the critical value. Note too that R refers to this a Welch test. We should remark at this point that older books (and the Fundamentals of Engineering 421 4 Inference Exam) suggest an alternate approach to the problem of writing confidence intervals for µX − µY . These books suggest that we assume that the two standard deviations σX and σY are equal. In this case the exact distribution of our quantity is known. The problem with this approach is that there is usually no reason to suppose that σX and σY are equal and if they are not the proposed confidence interval procedure is not as robust as the one we are using. In these notes we take the approach of not even mentioning what this alternate procedure is since it has fallen into disfavor. Hypotheses and Cautions Confidence intervals generated by Equation 4.4 are probably the most common confidence intervals in the statistical literature. But those who generate such intervals are not always sensitive to the hypotheses that are necessary to be confident about the confidence intervals generated. It should first be noted that the confidence intervals constructed are based on the hypothesis that the two populations are normally distributed. It is often apparent from even a cursory examination of the data that this hypothesis is unlikely to be true. However, if the sample sizes are large enough, we can rely on the Central Limit Theorem to tell us our results are approximately true. There are a number of different rules of thumb as to what large enough means, but n, m > 15 for distributions that are relatively symmetric and n, m > 40 for most distributions are common rules of thumb. A second approximation concerns the approximation made in computing the Welch interval. The rule of thumb here is that we are surer of confidence intervals where the quotients s2X /m and s2Y /n are not too different in size than those in which they are quite different. Turning Confidence Intervals into Hypothesis Tests It is often the case that we are interested in testing a hypothesis about µX − µY rather than computing a confidence interval for that quantity. For example, the null hypothesis µX − µY = 0 in the context of an experiment is a claim that there is no difference in the two treatments represented by X and Y . Hypothesis testing of this sort has fallen into disfavor in many circles since the knowledge that µX − µY 6= 0 is of rather limited interest unless the size of this quantity is known. A confidence interval answers that question more directly. Nevertheless, since the literature is still littered with such hypothesis tests, we give an example here. Example 4.4.3 422 4.4 Inferences for the Difference of Two Means Returning to our favorite chicks, we might want to know if we should believe that the effect of a diet of horsebean seed is really different that a diet of linseed. Suppose that x1 , . . . , xm are the weights of the m chickens fed horsebean seed and y1 , . . . , yn are the weights of the n chickens fed linseed. The hypothesis that we really want to test is H0q: µX − µY = 0. We note that if the null hypothesis is true, then T = (X − Y )/ Sx2 /m + Sy2 /m has a distribution that is approximately a t-distribution with the Welch formula giving the degrees of freedom. Thus the obvious strategy is to reject the null hypothesis if the value of T is too large. Fortunately, R does all the appropriate computations. Notice that the mean weight of the two groups of chickens differs by 58.5 but that a 95% confidence interval for the true difference in means is (−99.1, −18.0). On this basis we expect to conclude that the linseed diet is superior, i.e., that there is a difference in the mean weights of the two populations. This is verified by the hypothesis test of H0 : µX − µY = 0 which results in a p-value of 0.007. That is, this great a difference in mean weight would have been quite unlikely to occur if there was no real difference in the mean weights of the populations. > hb=chickwts$weight[chickwts$feed=="horsebean"] > ls=chickwts$weight[chickwts$feed=="linseed"] > t.test(hb,ls) Welch Two Sample t-test data: hb and ls t = -3.0172, df = 19.769, p-value = 0.006869 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -99.05970 -18.04030 sample estimates: mean of x mean of y 160.20 218.75 Variations One-sided confidence intervals and one-sided tests are possible as are intervals of different confidence levels. All that is needed is an adjustment of the critical numbers (for confidence 423 4 Inference intervals) or p-values for tests. Example 4.4.4 A random dot stereogram is shown to two groups of subjects and the time it takes for the subject to see the image is recorded. Subjects in one group (VV) are told what they are looking for but subjects in the other group (NV) are not. The quantity of interest is the difference in average times. If µX is the theoretical average of the population of the NV group and µY is the average of the VV group, then we might want to test the hypothesis H0 : µX − µY = 0 Ha : µX > µY > rds=read.csv(’http://www.calvin.edu/~stob/data/randomdot.csv’) > rds Time Treatment 1 47.20001 NV 2 21.99998 NV 3 20.39999 NV ...................... 77 1.10000 VV 78 1.00000 VV > t.test(Time~Treatment,data=rds,conf.level=.9,alternative="greater") Welch Two Sample t-test data: Time by Treatment t = 2.0384, df = 70.039, p-value = 0.02264 alternative hypothesis: true difference in means is greater than 0 90 percent confidence interval: 1.099229 Inf sample estimates: mean in group NV mean in group VV 8.560465 5.551429 > From this we see that a lower bound on the difference µX − µY is 1.10 at the 90% level of confidence. And we see that the p-value for the result of this hypothesis test 424 4.5 Regression Inference is 0.023. We would probably conclude that those getting no information take longer than those who do on average. . 4.5 Regression Inference In Section 2.4, we tried to describe the relationship between two quantitative variables by fitting a line to the data that came to us in pairs (x1 , y1 ), . . . , (xn , yn ). In this section, we describe a statistical model that attempts to account for both the linear relationship in the data and also the fact that the data are not exactly collinear. What results is known as the standard linear model. The standard linear model is given by the following equation that relates the values of x and y. Yi = β0 + β1 xi + i . where 1. β0 , β1 are (unknown) parameters, 2. i is a random variable with mean 0 and (unknown) variance σ 2 , 3. thus Yi is a random variable with mean β0 + β1 xi and variance σ 2 , 4. the random variables i (and hence the variables Yi ) are independent, 5. the random variables i are normally distributed. We can write this model more succinctly in terms of linear algebra. Let β = (β0 , β1 ). Then the model says that Y = Xβ + where is a random vector. There are three unknown parameters to estimate in this model: β0 , β1 , and σ 2 . Estimating β0 and β1 One obvious choice for the estimates of β0 and β1 is given by the coefficients b0 , b1 of the least squares regression line. It turns out that there are good statistical reasons for using b0 , b1 to estimate β0 , β1 . 425 4 Inference Lemma 4.5.1. The estimates b0 and b1 are unbiased estimates of β0 and β1 respectively. Therefore, ŷi = b0 + b1 xi is an unbiased estimate of β0 + β1 xi . Since b0 and b1 are the estimates, we will use B0 , B1 for the estimators (just as we used X and x for the estimator and estimate of the mean). Unbiased estimators are not much good to us if they have large variance. It is fairly easy to show (using equation 2.2, say) that σ2 Var(B1 ) = P (xi − x̄)2 P 2 xi 1 x̄2 σ2 2 P =σ +P Var(B0 ) = n (xi − x̄)2 n (xi − x̄)2 An inspection of these formulas for the variances of the coefficients shows that the variances of the estimators decrease as the number of observations increase (provided that the values xi are not all identical). The variance depends not only the the error variance but also on the spread of the independent variables xi . Qualitatively, at least, the variances of the estimators behaves as we would want them to. But could we find estimators with even smaller variance? The following famous theorem says that the least-squares estimates of β0 and β1 are the best estimators in a certain precise sense. Theorem 4.5.2 (Gauss-Markov Theorem). Assume that E(i ) = 0, Var(i ) = σ 2 , and the random variables i are independent. Then the estimators B0 and B1 are the unbiased estimators of minimum variance among all unbiased estimators that are linear in the random variables Yi . (We say that these estimators are BLUE which stands for Best Linear Unbiased Estimator.) Estimating σ 2 The random variables i have mean 0, variance σ 2 and are independent. Thus E(2i ) = σ 2 . So we could estimate σ 2 by Pn 2 i=1 (yi − (β0 + β1 xi )) . (4.5) n 426 4.5 Regression Inference This fraction would give an unbiased estimate of σ 2 . This is not much good however as we do not know β0 and β1 . Substituting estimates for β0 and β1 and changing the denominator of the fraction gives us the estimate we need Pn (yi − (b0 + b1 xi ))2 SSResid MSE = i=1 = . n−2 n−2 This estimate, denoted MSE, is called the mean squared error. The justification for substituting n − 2 in the denominator rather than n which would be more natural is the same as that for using n − 1 in the definition of s2 in Section 4.2. Namely, the use of n − 2 ensures that MSE is an unbiased estimate of σ 2 . Notice that the denominator in each case accounts for the number of parameters estimated (one in the case of s2 and two in the case of MSE. Example 4.5.1 A class taught at a college in the midwest took three tests and a final exam. There were 32 students in the class. The final exam scores are related to the scores on Test 1. The result of a regression analysis appears below. > class=read.csv(’http://www.calvin.edu/~stob/data/m222.csv’) > class[1:3,] Test1 Test2 Test3 Exam 1 98 100 98 181 2 93 91 89 168 3 100 99 99 193 > l.class=lm(Exam~Test1,data=class) > summary(l.class) Call: lm(formula = Exam ~ Test1, data = class) Residuals: Min 1Q -33.6930 -10.1574 Median -0.9462 3Q 8.5918 Max 44.0759 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.9652 22.7916 1.051 0.301 Test1 1.6044 0.2729 5.880 1.95e-06 *** 427 4 Inference --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 18.86 on 30 degrees of freedom Multiple R-Squared: 0.5354, Adjusted R-squared: 0.5199 F-statistic: 34.57 on 1 and 30 DF, p-value: 1.952e-06 > p.class=predict(l.class) > mse=sum( (class$Exam-p.class)^2/30 ) > rse=sqrt(mse) > rse [1] 18.86495 √ Notice that R computes MSE which is called the residual standard error. The residual standard error is used as an estimate for σ (although it is not an unbiased estimate of σ). In keeping with our previous use of s to denote the estimate of the standard deviation of an unknown distribution, we will generally use se to denote the residual standard error (and Se to denote the corresponding estimator). What we have done until now does not depend on the normality assumptions on the random variables i but only on the fact that they are independent with mean 0 and common variance σ 2 . In order to make inferences about the parameters β0 and β1 , we need to assume something about the distribution of the i and so we now assume also that the random variables i are normally distributed. This in turn implies that the random variables Yi are normally distributed with E(Yi ) = β0 + β1 xi and Var(Yi ) = σ 2 . Under this assumption, it turns out the estimators B0 and B1 are normally distributed as well. So we have that 1 x2 2 P B0 ∼ N β0 , σ + n (xi − x)2 2 σ B1 ∼ N β1 , P (xi − x)2 We will primarily be concerned with constructing confidence intervals and hypothesis tests for β1 , the slope in the regression line. The reason for this is that the slope tells 428 4.5 Regression Inference us the direction and size of the supposed linear relationship between x and y. The same reasoning can be used to write confidence intervals and tests for β0 . Our procedure for writing a confidence interval for β1 is very similar to that of constructing a confidence interval for the mean µ of an unknown distribution. Just as in that case, the unknown standard deviation σ is a nuisance parameter and we must substitute an estimate of σ for it. In this case we use se . This in turn means that the sampling distribution of our statistic becomes a t-distribution rather than a normal distribution. The resulting fact is the statistic T has a t-distribution with n − 2 degrees of freedom. (Here n − 2 matches the denominator in the definition of se .) T = B1 − β 1 P Se / (xi − x)2 P We define sb1 = se /( (xi − x)2 ) This number sb1 is called the estimate of the standard error of b1 . We now have the following result Confidence Intervals for β1 A 100(1 − α)% confidence interval for β1 is given by (b1 − tα,n−2 sb1 , b1 + tα,n−2 sb1 ) Example 4.5.2 In Example 2.4.1 we used linear regression to write a relationship between iron content and material loss in certain Cu/Ni alloy bars. The dataset was the corrosion dataset in R. In what follows, we write a 95% confidence interval for the slope of the regression line. > summary(l.corrosion) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 129.787 1.403 92.52 < 2e-16 *** 429 4 Inference Fe -24.020 1.280 -18.77 1.06e-09 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 3.058 on 11 degrees of freedom Multiple R-Squared: 0.9697, Adjusted R-squared: 0.967 F-statistic: 352.3 on 1 and 11 DF, p-value: 1.055e-09 > qt(.975,11) [1] 2.200985 > c(-24.020 - qt(.975,11)*1.280, -24.020+qt(.975,11)*1.280) [1] -26.83726 -21.20274 The confidence interval constructed, (−26.84, −21.20), is a 95% confidence interval for the slope of the “true” linear relationship between x and the mean of y. To interpret this, we might say something like “We are 95% confident that the an increase in iron content of 1% results in an average loss of between 21.2 and 26.8 milligrams per square decimeter of material.” Notice the high R2 value in this model. A very high percentage of the loss due to corrosion is explained by the percentage iron content of the bar. 4.6 Exercises 4.1 A basketball player claims to be a 90% free-throw shooter. Namely, she claims to be able to make 90% of her free-throws. Should we doubt her claim if she makes 14 out of 20 in a session at practice? Set this problem up as a hypothesis testing problem and answer the following questions. a) What are the null and alternate hypotheses? b) What is the p-value of the result 14? c) If the decision rule is to reject her claim if she makes 15 or fewer free-throws, what is the probability of a Type I error? 430 4.6 Exercises 4.2 In Example 4.1.1(c), we are trying to decide whether to fire the old kicker and hire a new one on the basis of a trial of 20 kicks. Suppose that we decide to hire the new kicker if he makes 8 or more kicks. a) Suppose that he makes exactly 8 kicks. What is the p-value of this result? b) What is α, the probability of a Type I error, for this decision rule? c) If the kicker truly has a 35% chance of making each kick, what is the probability of a Type II error (i.e., that we don’t believe that he is better than the old kicker)? 4.3 Nationally, 79% of students report that they have cheated on an exam at some point in their college career. You can’t believe that the number is this high at your own institution. Suppose that you take a random sample of size 50 from your student body. Since 50 is so small compared to the size of the student body, you can treat this sampling situation as sampling with replacement for the purposes of doing a statistical analysis. a) Write an appropriate set of hypotheses to test the claim that 79% of students cheat. b) Construct a decision rule so that the probability of a Type I error is less than 5%. 4.4 In this problem, you will develop a hypothesis test for a random variable other than a binomial one. Suppose that you believe the waiting time until you are served at the MacDonald’s on 28th street is a random variable with an exponential distribution but with unknown λ. The sign at the drive-up window says that the average wait time is 1 minute. You actually wait 2 minutes. Your friend in the car says that this is outrageous and that the claim on the sign must be wrong. a) Write a pair of hypothesis about λ that captures the discussion between you and your friend. b) What is the p-value of the single data point of a 2 minute wait? c) Write a sentence that explains clearly to your friend the meaning of that p-value. Remember that your friend has not yet been fortunate enough to take a statistics course. 431 4 Inference d) How long would you have to have waited to be suspicious of MacDonald’s claim? There are many right answers to this question but any answer needs statistical justification. 4.5 In Example 4.2.2 we generated an approximate 95% confidence interval for µ assuming that σ is known. a) Construct instead a 90% confidence interval for µ. b) Construct both 90% one-sided confidence intervals for µ. c) Describe clearly a situation in which you would want a one-sided confidence interval rather than a two-sided one. 4.6 Suppose that we are in a situation where we would want to construct a confidence interval for µ and we knew σ = 0.3. How large a sample should we take to ensure that a 95% confidence interval would estimate µ to within 0.1? 4.7 Suppose that X1 , . . . , Xn are i.i.d. from an exponential distribution with parameter λ unknown. In this problem we write a confidence interval for λ using X. a) Rewrite Equation 4.1 in this case by substituting for µ and σ the appropriate expressions involving λ. b) Solve the inequality that results in part (a) for an inequality of form a < λ < b where a and b do not involve λ. c) Suppose that n = 30 and X = 4.23. Using (b), write an approximate 95% confidence interval for λ. Note that this confidence interval relies of the CLT but makes no other approximation. 4.8 The chickwts dataset presents the results of an experiment in which chickens are fed six different feeds. Suppose that we assume that the chickens were assigned to the feed groups at random so that we can assume that the chickens can be thought of as coming from one population. For each feed, we can assume that the chickens fed that feed are a random sample of the (theoretical) population that would result from feeding all chickens that feed. 432 4.6 Exercises a) Write 95% confidence intervals for the mean weight of chickens fed each of the six seeds. b) From an examination on the six resulting confidence intervals, is there convincing evidence that some diets are better than others? c) Since you no doubt used the t-distribution to generate the confidence intervals in (a), you might wonder whether that is appropriate. Are there any features in the data that suggest that this might not be appropriate? 4.9 The dataframe in http://www.calvin.edu/~stob/data/miaa05.csv contains statis- tics on each of the 134 players in the MIAA 2005 Men’s Basketball season. Choose 10 different random samples of size 15 from this dataset. a) From each, compute a 90% confidence interval for the mean PTSG (points per game) of all players. b) Of the 10 confidence intervals you computed in part (a), how many actually did contain the true mean? (Which you can compute since you have the population in this instance.) c) How many of the 10 confidence intervals in part (a) would you have expected (before you actually generated them) to contain the true mean? d) In light of your answer in (c), are you surprised by your answer in (b)? 4.10 The dataset http://www.calvin.edu/~stob/data/reading.csv contains the results of an experiment done to test the effectiveness of three different methods of reading instruction. We are interested here in comparing the two methods DRTA and Strat. Let’s suppose, for the moment, that students were assigned randomly to these two different treatments. a) Use the scores on the third posttest (POST3) to investigate the difference between these two teaching methods by constructing a 95% confidence interval for the difference in the means of posttest scores. b) Your confidence interval in part (a) relies on certain assumptions. Do you have any concerns about these assumptions being satisfied in this case. 433 4 Inference c) Using your result in (a), can you make a conclusion about which method of reading instruction is better? 4.11 Surveying a choir, you might expect that there would not be a significant height difference between sopranos and altos but that there would be between sopranos and basses. The dataset singer from the lattice package contains the heights of the members of the New York Choral Society together with their singing parts. a) Decide whether these differences do or do not exist by computing relevant confidence intervals. b) These singers aren’t random samples from any particular population. Explain what your conclusion in (a) might be about. 4.12 Returning to the sport of baseball one last time, let’s reexamine the results of the 1994–1998 baseball seasons in http://www.calvin.edu/~stob/data/team.csv. Earlier, we tried to predict R (runs) by HR (homeruns). Let’s refine that analysis here. a) Instead of predicting R from HR, use regression to write a linear relationship to predict RG (runs per game) from HRG (homeruns per game). b) Interpret the slope and intercept of the line in part (a) informally. c) Write a 95% confidence interval for the slope of the line in (a). 4.13 The dataset http://www.calvin.edu/~stob/data/lakemary.csv contains the age and length (in mm) of 78 bluegills captured from Lake Mary, Minnesota. (Richard Frie, J. Amer. Stat. Assoc., (81), 922-929). a) Write a linear function to predict the length from the age. b) Interpret the slope and intercept of the line in (a). c) Write a 95% confidence interval for the slope of the regression line. d) Do you have any comments about the data or the model? 434 4.6 Exercises 4.14 The dataset http://www.calvin.edu/~stob/data/home.csv contains the prices of homes in a certain community at two different points in time. a) Write a linear function to predict the old price from the new. b) Write a 90% confidence interval for the slope of the line in (a). c) Write a sentence explaining what the confidence interval in (b) means. 435
© Copyright 2026 Paperzz