Math 243 – Spring 2007 January 29, 2007 Important Informtion 1. Course Web Page: http://www.calvin.edu/ rpruim/courses/m243/S07/. 2. Provide exam “black out dates” (with reasons) via email by Friday. Introduction to Statistics 1. What is statistics? 2. 8 steps in a statistical study (Lady Tasting Tea Example) c 2007 Randall Pruim ([email protected]) 1 January 30, 2007 Math 243 – Spring 2007 2 Announcements • Send exam “blackout dates” via email by Friday. • Meet in Mac Lab on Thursday. (NH 067, in the basement.) Read SimpleR, pages 1–7 BEFORE class on Thursday. • Problem Set 2 is due on Friday. • Current plan: problem sets will usually be due Tuesdays and Fridays. Looking at Data Goals: Today we will learn 1. how data sets are organized and the terminology used to describe and talk about data sets. 2. some common graphical displays of variables (stemplot and histograms) and what they tell us about the distribution of a variable. 3. how to use R to make these graphical displays for us. 1. Structure of a data set (a) units (individuals, subjects, cases, ...) (b) variables (measurements made on individuals) 2. Types of Variables (a) categorical (qualitative) – places units into categories (b) quantitative (numerical) – measurement on some scale • • • • • nominal: ordinal: interval: ratio: discrete vs. continuous: 3. Distribution of a variable (what values and how often) 4. Graphing distributions (a) stemplots (b) histograms c 2007 Randall Pruim ([email protected]) January 30, 2007 Math 243 – Spring 2007 3 Useful R Commands • Getting Started, Getting Data ◦ getCalvin(’m243’) – load functions, data, etc. for this class. ◦ source(’foo.R’) – execute the commands in the file foo.R. · From your own machine you can use source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) to do the same thing as getCalvin(’m243’). · require(Devore6) – load package of data sets from textbook. (Other packages can be loaded similarly. getCalvin(’m243’) will load a few for you, including Devore6.) · data(xmp01.05), data(ex01.05), data(faithful) – attach data set for an Example 1.5, Exercise 1.5, or a named data set. (The data type most commonly used for a data set in R is called a data frame. A data frame can contain many variables, each of which is represented as a vector, which is much like an array.) · names(faithful) – see the names of the variables in this data set. • Graphical Summaries ◦ stem(xmp01.05$bingePct) – basic stemplot of bingePct variable. Notice $ to get at a single variable (as a vector). ◦ stemplot(xmp01) – fancier stemplot defined in m243.R; by default makes a stemplot for each numeric variable in data frame. ◦ hist(faithful$eruptions) – make a histogram. ◦ histogram(~eruptions,data=faithful) – make a histogram of eruptions times for Old Faithful. (Requires lattice package.) • Getting Some Help ◦ ?hist, ?faithful – get documentation for the hist() command or the faithful data frame. (Works with most other built-in or prepaaged commands as well. Won’t necessarily work with utilities I provide.) ◦ args(hist) – list the arguments of the hist() function. ◦ apropos(plot) – list all functions that have plot in their name. Useful for recalling the name of a function or checking to see if a function exists. c 2007 Randall Pruim ([email protected]) February 2, 2007 Math 243 – Spring 2007 4 Quatiles and the Five Number Summary 1. What is a median? A1. the middle value of a sorted data set (or average of two middle values if even number data values) A2. a value with half the distribution above and half the distribution below 2. q-quantile is a value such that the proportion of the distribution below the value is q (and 1-q is above) • percentile is the same thing expressed as a percentage instead of as a proportion. • quartiles, quintiles, deciles also common • median = .5-quantile = 50th percentile 3. Different methods for determining quantiles from data • median of each half ◦ variations depending on whether we include or exclude the median of the original data • ruler method: ◦ imagine the data arranged along a ruler and determine location of quantiles by measuring ◦ variations depending on where the data are “positioned” along the ruler. 4. Boxplots are a graphical version of the five-number summary (the five quartiles). Useful R Commands • quantile() and fivenum() give (different) versions of quantiles. • boxplot() and bwplot() produce boxplots. bwplot() requires the lattice package (which is installed as part of the basic R installation but may need to be loaded). • summary() – summarize an object The summary() command does different things depending on the type of object it is given. For a data frame it gives a summary of each variable. For quantitative variables it gives the five-number summary plus the mean. For a categorical variables it provides (a portion of) the distribution as a table. summary() can be used on many other kinds of objects as well. > summary(1:10) Min. 1st Qu. 1.00 3.25 Median 5.50 Mean 3rd Qu. 5.50 7.75 Max. 10.00 • The colon operator can be used to make a consecutive sequence. seq() gives finer control: > 1:10 [1] 1 2 3 4 5 > seq(0,100,by=20) [1] 0 20 40 60 6 7 8 9 10 80 100 c 2007 Randall Pruim ([email protected]) February 2, 2007 Math 243 – Spring 2007 5 • table(x) gives a table of the values in the vector x. • cut() makes a categorical variable (which R calls a factor ) from a quantitative variable by placing each value into a bin. Bin divisions are specified with the break argument. This is much like the first step in forming a histogram. Can you figure out this example? > table(cut((1:10)^2,breaks=c(0,20,40,60,80,100))) (0,20] 4 (20,40] 2 (40,60] 1 (60,80] (80,100] 1 2 • cumsum() is sometimes handy. Here is an example > cumsum(1:10) [1] 1 3 6 10 15 21 28 36 45 55 c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 February 5, 2007 6 Randomness & Probability 1. Calculation of variance and standard deviation 2. Two key properties of randomness (a) (b) 3. Important terminology • outcome: • sample space: • event: 4. Random Variables • Notation: P(X = 8), P(X ≤ 7), etc. • Important example: sampling distributions 5. Two types of probability (a) (b) Before class tomorrow: Find a penny (the newest American penny you can locate) and flip/spin/tip it 50 times, keeping track of how many heads and tails you get for each. Useful R Commands R has several utilities for working with important random variables. The Lady Tasting Tea situation that we talked about on day one is an example of a binomial random variable. We’ll learn more about this random variable in the coming days. • rbinom(n,size,prob) – simulates X n times, where X is a binomial random variable with success probability prob and size trials. To simulate 1000 Ladies Tasting Tea who are just guessing: rbinom(1000,10,.5). • dbinom(q,size,prob) – calculates P(X = q) where X is a binomial random variable with success probability prob and size trials. For the Lady Tasting Tea, dbinom(8,10,.5) gives the probability of getting exactly 8 of 10 correct just by guessing. We’ll learn how to do this calculation by hand tomorrow. • pbinom(q,size,prob) – calculates P(X ≤ q) where X is a binomial random variable with success probability prob and size trials. For the Lady Tasting Tea, 1 - pbinom(7,10,.5) gives the probability of getting at least 8 of 10 correct just by guessing. • There are versions of the functions above for many other random variables, too. The names are all rdist, ddist, pdist, where dist is replaced by some version of the name of the distribution. Stay tuned. c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 February 6, 2007 7 Probability calculations using the Theoretical Method 1. Three Probability Axioms (a) P(E) ∈ [0, 1] (b) P(S) = 1 (where S is the sample space) (c) Probability of disjoint union is sum of probabilities. • P(A ∪ B) = P(A) + P(B) provided A ∩ B = ∅. • P(A1 ∪ A2 ∪ · · · ∪ Ak ) = P(A1 ) + P(A2 ) + · · · + P(Ak ) provided Ai ∩ Aj = ∅ whenever i 6= j. P∞ • P (∪∞ i=1 Ai ) = i=1 P(Ai ) provided Ai ∩ Aj = ∅ whenever i 6= j. 2. More Probability Rules and Examples (a) Equally Likely Rule: If every outcome in a sample space is equally likely, then P(E) = |E| N (E) size of event = = |S| N (S) size of sample space • probability of heads is .5 if heads and tails are equally likely • many other applications (fair dice, shuffled cards, etc.) • easy to apply if we can determine sizes of E and S (see item 3 below) (b) Complement Rule: P(E) = 1 − P(E 0 ). • Every probability problem is really two probability problems; we can choose to solve the “easier” of the two. (c) Inclusion/Exclusion: P(A ∪ B) = P(A) + P(B) − P(A ∩ B) 3. Counting without counting To use the equally likely rule, we need to be able to count the number of outcomes in our event and in the sample space. For small sets, we can do this by inspection, but for larger sets, we need to “count without counting”. (a) Bijection Rule: Don’t double count, don’t skip (b) Sum Rule: Divide and Conquor (c) Difference Rule: Compensate for double counting (d) Product Rule: One stage at a time • Permutations (e) Division Rule: Another way to compensate for “double” counting • Combinations (binomial coefficients) • Lady Tasting Tea (guessing) c 2007 Randall Pruim ([email protected]) February 6, 2007 Math 243 – Spring 2007 8 Examples: Calculating Theoretical Probabilities 1. Equally Likely Rule: If every outcome in a sample space is equally likely, then P(E) = |E| N (E) size of event = = |S| N (S) size of sample space • Only applies when all outcomes are equally likely. • Easy way to determine probabilities if the counting problems are easy; can be challenging when counting is more difficult. • Two easy examples: ◦ Probabilty of rolling doubles (with two fair 6-sided dice) ◦ Daily 3 Lottery game 2. Permutations and Combinations as systematic counting methods (a) Probability of getting a (5-card) flush • ordered hand method (using multiplication principle) Number of hands = 52 · 51 · 50 · 49 · 48 Number of flushes = 52 · 12 · 11 · 10 · 9 • unordered hand method Each unordered hand has 5 · 4 · 3 · 2 · 1 orderings. So 52 · 51 · 50 · 49 · 48 Number of hands = 5·4·3·2·1 52 · 12 · 11 · 10 · 9 Number of flushes = 5·4·3·2·1 (b) Permutations: Pk,n = number of ways to order k objects from a set of n objects. n! Pk,n = n(n − 1)(n − 2) · · · (n − k + 1) = (n − k)! n (c) Combinations: k = number of ways to select k objects from a set of n objects (order doesn’t matter). Pk,n n n! = = k k! k!(n − k)! choose a suit, then 5 cards from that suit z }| { 13 4 5 • Probability of getting a 5-card flush = 52 5 | {z } choose 5 cards from 52 Useful R Commands • factorial(n) – compute n! (n factorial) • choose(n,k) – compute nk (n choose k) c 2007 Randall Pruim ([email protected]) February 8, 2007 Math 243 – Spring 2007 9 More Probability Problems 1. Lady Tasting Tea Suppose the lady is just guessing, let X be the number of correct guesses. 10 number of ways to guess 8 correctly in 10 tries 8 • P(X = 8) = = 10 number of ways to guess 2 • R can easily make a table for different numbers of correct guesses: > cbind(5:10,dbinom(5:10,10,.5),1/dbinom(5:10,10,.5)) [1,] [2,] [3,] [4,] [5,] [6,] 5 6 7 8 9 10 0.2460937500 4.063492 # about 1 time in 4 0.2050781250 4.876190 0.1171875000 8.533333 0.0439453125 22.755556 0.0097656250 102.400000 0.0009765625 1024.000000 # 1 time in 1024 Not very likely to 8 or more correct by guessing. 2. Probability of rolling 5 dice and getting at least two that match. 6·5·4·3·2 65 • P(at least two match) = 1 − P(all different). • Think complement: P(all different) = 3. Let X be the number of suits in a 5-card hand, what is the distribution of X? > > > > > p<-rep(NA,4) p[1] <- choose(4,1) * choose(13,5) / choose(52,5) p[2] <- choose(4,2) * ( choose(26,5) - (choose(13,5)+choose(13,5))) / choose(52,5) p[4] <- choose(4,1) * choose(13,2) * 13 * 13 * 13 / choose(52,5) p[3] <- 1 - sum(p[-3]) # sum of all probabilities must be 1 > rbind(1:4,p) 1.000000000 2.0000000 3.0000000 4.0000000 p 0.001980792 0.1459184 0.5883553 0.2637455 Conditional Probability 1. Q: If a family has two kids and at least one is a boy, what is the probability that both are boys? GG GB BG BB 2. General Formala: Probability of A given B = P(A | B) = probability=1/3 P(A ∩ B) P(B) • a Venn diagram is useful to picture this c 2007 Randall Pruim ([email protected]) February 9, 2007 Math 243 – Spring 2007 10 Conditional Probability 1. Definition: P (A | B) = P(A∩B) P(B) 2. Example: Suppose items are produced on two assembly lines. Some are good and some are defective (bad). One day the production data were as follows: Assembly Line 1 Assembly Line 2 • P(Bad) = 3/18; Bad 2 1 Good 6 9 P(AL 1) = 8/18 • P(Bad | AL 1) = 2/8 = 2/18 8/18 ; P(AL 1 | Bad) = 2/3 = 2/18 3/18 3. Turning the formula around: P (A ∩ B) = P(A) · P(B | A) = P(B) · P(A | B) (a) Example: Yahtzee; roll 5 dice • P(5 different numbers) = 1 · 56 · 46 · 36 · 26 = 0.0925926 • P(large straight | 5 different numbers) = 2/6 = 1/3 (6 possible missing numbers – two yield large straight.) • So P(large straight) = P(5 different)·P(large straight | 5 different) = ( 56 · 64 · 36 · 26 ) 31 = .0308642 (b) Example: Medical Tests. Suppose a test correctly identifies diseased people 98% of the time, and correctly identifies healthy people 99%. Furthermore assume that one person in 1000 has the disease. If a random person is tested and the test comes back positive, what is the probability that the person has the disease? • P(D) = 0.001; P(H) = 0.999 • P(+ | D) = 0.98; P(− | H) = .99 P(− | D) = 0.02; P(+ | H) = .01 P(D) · P(+ | D) .001 · .98 P(D ∩ +) = = = 0.0893 • P(D | +) = P(+) P(D ∩ +) + P(H ∩ +) .001 · .98 + .999 · .01 (Wow!) Indpendence 1. If P (A | B) = P(A) then we say that A and B are independent. • Intuition: knowing whether or not B occurs gives no information to change the probabilities that A occurs; the two events are independent. 2. If A and B are independent, then P (A ∩ B) = P(A) · P(B). 3. Two ways independence shows up in statistics: (a) As a hypothesis to be tested using data. (E.g., does it appear that smoking and getting lung cancer are independent?) (b) As an assumption (because it makes probability calculations easier). For example, we will usually assume some sort of indpendence in our sampling methods. (Stay tuned.) c 2007 Randall Pruim ([email protected]) February 12, 2007 Math 243 – Spring 2007 11 Discrete Distributions 1. Can be described by giving a table of probabilities. 2. Can be described by giving the pmf (probability mass function, also called distribution function), a function giving values for P(X = x). 3. Can be described by giving the cdf (cummulative distribution function), a function giving values for P(X ≤ x). 4. Often we can give a single formula for all the distributions in a family of distributions. Such formulas will mention parameters – numbers that distinguish the members of the family from one another. (See examples below.) Some Important Families of Distributions 1. Geometric Distributions • • • • two outcomes for each trial (S and F) probablity of S on each trial is constant p each trial independent of others repeat until we get S If we let X count the number of failures before success, then X is geometric with parameter p. • pdf = g(x; p) = (1 − p)x p 2. Binomial Distributions • • • • two outcomes for each trial (S and F) probablity of S on each trial is constant p each trial independent of others repeat predetermined number of times (n) If we let X count the number of successes in n tries, then X is binomial with parameters n and p. n x • pdf = b(x; n, p) = p (1 − p)n−x x We can use binomial distributions to model the lady tasting tea under the assumption that her probability of being correct is p. Useful R Commands • dbinom(x,size,prob) = pmf for a binomial distribution (think d for d istribution). • pbinom(x,size,prob) = P(X ≤ x) = cdf for a binomial distribution. • dgeom(), pgeom() do the same for geometric distribution. • Similar functions exist for many families of distributions. c 2007 Randall Pruim ([email protected]) February 13, 2007 Math 243 – Spring 2007 12 Describing Discrete Distributions Let X be a discrete random variable with pmf f . 1. Pictures: Probability histograms and line graphs (pages 103–104) 2. Expected Value (mean) • motivating example: computing GPA = “average grade” X X • E(X) = µX = x · f (x) = value · probability • Easy to do in R for familiar finite distributions. Example: > sum(0:10 * dbinom(0:10,size=10,prob=.8)) # = 8 P • Expected Value of a function: E(h(X)) = h(x) · f (x). 3. Variance (and standard deviation) P 2 = E (X − µ )2 = • Var(X) = σX (x − µX )2 f (x) X • Alternate Formula: Var(X) = E(X 2 ) − [E(X)]2 Proof: Expand algebraically; split into 3 sums; simplify each sum. 4. Expected Value of Linear Function: E(aX + b) = a E(X) + b 5. Variance of Linear Function: Var(aX + b) = a2 E(X) Examples 1. Daily 3 Lottery played straight W = winnings ($) probability 0 .999 500 .001 2. Bernoulli random variable (with paramter p) value of X probability 0 1−p 1 p 3. Searching for either one of two out of 5 possibilities. (Sock drawer, for example.) Y = number of attempts one is found. value of Y probability 1 .4 2 .3 3 .2 4 .1 4. Daily 3 Lottery played boxed (123) Z = winnings ($) probability 0 .994 41 .005 291 .001 c 2007 Randall Pruim ([email protected]) February 13, 2007 Math 243 – Spring 2007 13 Imporant Named Discrete Distributions family (R name) pdf expected value (mean) variance Bernoulli f (1) = p, f (0) = 1 − p p pq q =1−p binomial (binom) x x n−x b(x; n, p) = p q n np npq q =1−p geometric (geom) g(x; p) = q x p q 1 = −1 p p q p2 q =1−p rq p2 q =1−p np N −n · npq N −1 M p= N q =1−p kp m+n−k · kpq m+n−1 negative binomial (nbinom) hypergeometric [text book] nb(x; p, r) = h(x; n, M, N ) = m n x n−x n+m k dhyper(x,m,n,k) x+r−1 r−1 M x q x pr N −M n−x N n r· q = r · p1 − 1 p notes m n+m q =1−p p= • Both R and our text count the number of failures to get the value for the random variable called geometric or negative binomial. Some authors count instead the number of trials (successes + failures). • The book and R parameterize the hypergeometric distributions in differnt (but equivalent) ways. It is traditional to describe the hypergeometric model in terms of selecting objects (usually balls or marbles) from an urn. Each object has one of two colors. Given this description, here are the two paramterizations: book N R n+m number of objects with target color M m number of objects with other color N −M n number of objects drawn from urn n k number of objects in urn In both cases the random variable counts the number of items drawn from the urn that have the “target” color. c 2007 Randall Pruim ([email protected]) February 15, 2007 Math 243 – Spring 2007 14 More Examples of Discrete Distributions 1. Lady Tasting Tea (again) Suppose that we are once again in the business of testing whether the lady can really tell the difference between the two infusion orders. This time we adopt a different strategy. We will again prepare 5 cups of tea, but this time we make sure that 5 cups are prepared each way. Does this matter? Is it a better idea than our other method? Worse? Suppose the lady is just guessing. She knows there are five cups of each type, so she is asked to place a marker by the 5 cups that had tea put in them first. Let X be the number of these that are correctly placed. Let’s work out the pmf for X: value of X probability 0 1 2 3 4 5 What is P(X ≥ 4)? How does it compare with getting 8 or more correct in the binomial situation? 2. The hypergeometric distribution The random variable in item 1 is an example of a hypergeometric distribution. It is traditional to describe this distribution in terms of an urn model. Suppose • m white balls are in the urn, • n black balls are in the urn, • k balls are selected at random from the urn, and • X = the number of white balls selected (among the k); then X has a hypergeometric distribution. The pmf for X (using R’s parameterization) is h(x; m, n, k) = 3. Going to a Play Here are the data from our class survey about going to a play after loosing either a ticket or cash. attend play? no yes lost cash 9 13 lost ticket 14 9 The proportion of survey participants who would buy a new ticket differs depending on what was lost, but is this just random chance, or are people who lose cash really more likely to buy a ticket than those who lose a ticket? c 2007 Randall Pruim ([email protected]) February 15, 2007 Math 243 – Spring 2007 15 Useful R Code > read.csv(’http://www.calvin.edu/~rpruim/data/survey/littleSurvey.csv’) -> survey > dim(survey) [1] 279 27 > names(survey) # output surpressed > table(survey$section) 143A 21 143B 16 143C 20 > summary(survey) 143D 110 243A dcm07 45 67 # output surpressed > levels(survey$playVer) [1] "v1" "v2" > levels(survey$playVer) = c(’lostCash’,’lostTicket’) > xtabs(~play+playVer,survey) playVer play lostCash lostTicket no 61 69 yes 103 44 > xtabs(~play+playVer,survey,subset=section==’243A’) playVer play lostCash lostTicket 0 0 no 9 14 yes 13 9 > round(dhyper(0:5,5,5,5),digits=4) [1] 0.0040 0.0992 0.3968 0.3968 0.0992 0.0040 > round(dhyper(0:22,22,23,22),digits=4) [1] 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0044 0.0203 0.0635 [10] 0.1382 0.2124 0.2317 0.1797 0.0987 0.0381 0.0102 0.0018 0.0002 [19] 0.0000 0.0000 0.0000 0.0000 0.0000 > plot(dhyper(0:22,22,23,22)) > xyplot(dhyper(0:22,22,23,22)~0:22,groups=dhyper(0:22,22,23,22) <= dhyper(9,22,23,22),pch=16) c 2007 Randall Pruim ([email protected]) February 15, 2007 Math 243 – Spring 2007 > fisher.test(xtabs(~play+playVer,survey,subset=section==’243A’)) Fisher’s Exact Test for Count Data data: xtabs(~play + playVer, survey, subset = section == "243A") p-value = 0.238 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.1141432 1.7066640 sample estimates: odds ratio 0.4533593 > fisher.test(xtabs(~play+playVer,survey)) Fisher’s Exact Test for Count Data data: xtabs(~play + playVer, survey) p-value = 0.0001362 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.2236475 0.6366559 sample estimates: odds ratio 0.3790412 > binom.test(8,10,.5) Exact binomial test data: 8 and 10 number of successes = 8, number of trials = 10, p-value = 0.1094 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.4439045 0.9747893 sample estimates: probability of success 0.8 > 1- pbinom(7,10,.5) [1] 0.0546875 > 1- pbinom(7,10,.5) + pbinom(2,10,.5) [1] 0.109375 c 2007 Randall Pruim ([email protected]) 16 February 16 & 19, 2007 Math 243 – Spring 2007 17 Rules for Expected Value and Variance The following three “rules” can be proven by some straightforward algebra applied to the definitions. 1. E(aX + b) = a E(X) + b 2. Var(aX + b) = a2 Var(X) 3. Var(X) = E(X 2 ) − [E(X)]2 Continuous Distributions 1. Definitions (a) continuous vs. discrete (b) pdf (probability density function) • Key Idea: Rb • definition of probability using pdf: P(a ≤ X ≤ b) = a f (x) dx Rx (c) cdf (cummulative density function): F (x) = P(X ≤ x) = −∞ f (t) dt • FTC ⇒ pdf is derivative of cdf; cdf is an antiderivative of pdf 2. Examples (a) uniform distribution: pdf is constant on an interval, what constant? (b) f (x) = kx2 on [0, 1], what is k? (c) f (x) = ke−x on [0, ∞), what is k? 3. Expected Value and Variance • Key Idea: R∞ • Expected Value: E(X) = −∞ xf (x) dx R∞ • Variance: Var(X) = −∞ (x − µX )2 f (x) dx • Three Rules for Expected Value and Variance still true. Useful R Commands • f <- function(x) { x^2 } – define a function • integrate(f,0,3) – do numerical integration give text report that includes the estimate and an indication of its accuracy • integrate(f,0,3)$value – do numerical integration, and retun the result as a number • integrate(function(x) {exp(-x)},0,Inf) – numerical approximation of indefinite integral • plot(f,xlim=c(0,3)) – make a sketch of the graph of f c 2007 Randall Pruim ([email protected]) February 20, 2007 Math 243 – Spring 2007 18 Quantiles of Continuous Distributions 1. To find the 90th percentile of a continuous distribution, solve for q in the equation below: Z q P (X ≤ q) = f (x) dx = pdist(q) = .90 −∞ 2. Other quantiles are found by replacing .90 with the appropriate proportion. 3. Median = 50th percentile = .5-quantile. 4. R has this built in for commonly used disributions. Example: qexp(.90) gives the 90th percentile of the exponential distribution with mean and variance equal to 1. The Normal Distributions Undoubtedly the most important family of continuous distributions for statistics is the family of normal distributions. This distribution (and some of its cousins) will occupy us nearly every day for the remainder of the semester. 1. Reason for importance: Central Limit Theorem (stay tuned) 2. pdf: f (x; µ, σ) = 2 2 √1 e−(x−µ) /2σ σ 2π = dnorm(x,mean=0,sd=1) • Nasty to integrate; usually we can only approximate it numerically • f (x) has symmetric, bell-shaped graph (the bell curve) ◦ maximum when x = µ ◦ inflection points at µ ± σ • E(X) = µ; Var(X) = σ 2 3. Linear transformations of normal random variables are normal (a) We can determine the mean and variance using our rules. • If X ∼ N (µ, σ), and Y = X − µ then Y ∼ N (0, σ). • If X ∼ N (µ, σ), and Z = X−µ σ , then Z ∼ N (0, 1). (b) One distribution to rule them all. Questions about X ∼ N (µ, σ) can be converted into questions about the standard normal distribution: Z ∼ N (0, 1). For example, P(a ≤ X ≤ b) = P a−µ b−µ ≤Z≤ σ σ (c) 68-95-99.7 Rule P(µ − σ ≤ X ≤ µ + σ) = P (−1 ≤ Z ≤ 1) ≈ .68 P(µ − 2σ ≤ X ≤ µ + 2σ) = P (−2 ≤ Z ≤ 2) ≈ .95 P(µ − 3σ ≤ X ≤ µ + 3σ) = P (−3 ≤ Z ≤ 3) ≈ .997 c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 February 20, 2007 19 4. Some Notation (a) cdf for standard normal distribution: Φ(x) = pnorm(x) (b) “backwards quantile”: zα = qnorm(1 − α), that is P(Z ≤ −zα ) = P(Z ≥ zα ) = α 5. Computers, Calculators, and Normal Probability Tables • In the old days, when computers and calculators were not so readily available to do calclulations for us, normal probabilities were obtained from reference tables (as were values of things like sine, cosine, logarithms, etc.) These probability tables still appear in the backs of most statistics books. See pages 740–741 of our book, for example. • Some calculators are also capable of providing you with values equivalent to those given by pnorm() and qnorm(). c 2007 Randall Pruim ([email protected]) February 22, 2007 Math 243 – Spring 2007 20 An Important Situation Let’s suppose that • some event is happening “at random times”, and that • the probability of an event occuring in any small time interval depends only on (and is proportional to) the length of the interval, not on when it occurs. We are interested in two random variables associated with such a situation: • X counts the number of occurances in a fixed amount of time. [X is discrete positive integer values.] • Y measures the time until the next occurance. [Y is continuous on the interval [0, ∞).] Poisson Distributions 1. What pmf would be a reasonable model for X? • If we divide our time into n pieces, then the probability of an occurance in one of these subintervals is proportional to 1/n, so let’s call the probability λ/n. • If n is large, then the probability of having two occurances in one subinterval is very small – so let’s just pretend it can’t happen. • If we assume that occurances in each interval are independent from one another, then a good approximation for X is X ≈ Binom(n, λ/n) because we have n independent subintervals with probability λ/n of an occurance in each one. • This approximation should get better and better as n → ∞. When n is very large, then n! λx λ n λ −x · · 1− · 1− x! (n − x)! nx n n x x n λ ≈ · · e−λ · 1 x! nx λx = e−λ x! x n λ λ n−x P(X = x) ≈ 1− = x n n λx be the pmf for a Poisson random variable with rate parameter x! λ. (We’ll see why it is called a rate parameter in just a moment.) • So let’s let p(x; λ) = e−λ • We should check that this definition is legitimate. ∞ X λx ◦ Recall from calculus that = eλ x! [Think: Taylor series.] x=0 ◦ So ∞ X x=0 e−λ λx =1 x! [Good thing.] c 2007 Randall Pruim ([email protected]) February 22, 2007 Math 243 – Spring 2007 21 2. Mean and Variance of X • E(X) = ∞ X xe x=0 −λ λ x x! = ∞ X xe x=1 −λ λ x x! =λ ∞ X ∞ e −λ x=1 X λx−1 λx =λ e−λ =λ (x − 1)! (x)! x=0 ◦ If on average we expect λ occurances in our fixed amount of time, then the rate of occurances is λ per unit time. Thus λ is a rate. ∞ ∞ ∞ ∞ X X X λx X 2 −λ λx λx−1 λx • E(X 2 ) = x2 e−λ = x e =λ xe−λ =λ (x + 1)e−λ = λ(λ + 1) x! x! (x − 1)! (x)! x=0 x=1 x=1 x=0 • So Var(X) = λ(λ + 1) − λ · λ = λ 3. Example: Football fumbles • Is a Poisson distribution a good model? 4. Example: Customers come to a small business at an average rate of 6 per hour. Let’s assume that a Poisson distribution is a good model. • How unusual is it to go 20 minutes without any customers? • How unusual is it to have 10 or more customers in an hour? • Rank in order of likelihood the following three events: (a) fewer than six customers in an hour, (b) exactly six customers in an hour, (c) more than six customers in an hour. Exponential Distributions 1. What is the pdf for Y ? Suppose events occur at a rate of λ per unit time; recall that Y = time until next occurrance. • Let’s get the cdf first: P(Y ≤ y) = 1 − P(Y > y) = 1 − P(0 occurances in time [0, y]) = 1 − e−λy where we get the last probability using the Poisson distribution! • Differentiate to get the pdf. P(Y = y) = d 1 − e−λy = λe−λy dy 2. Some integration by parts shows that • E(X) = 1 λ • Var(X) = 1 λ2 3. Example: Football fumbles again. • How long should we expect to wait until the next fumble? • What is the probability of having no fumbles in the first half of a game? c 2007 Randall Pruim ([email protected]) February 22, 2007 Math 243 – Spring 2007 22 Imporant Distributions family pdf or pmf binomial hypergeometric expected value variance dbinom(x,size=n,prob=p) x x n−x = p q n np npq dhyper(x,m,n,k) n m kp m+n−k · kpq m+n−1 = x n−x n+m k notes q =1−p p= m n+m q =1−p negative binomial dnbinom(x,size=r,prob=p) x+r−1 x r = q p r−1 Poisson dpois(x,lambda=λ) = e−λ uniform dunif(x,a,b) = normal dnorm(x,mean=µ,sd=σ) exponential dexp(x,rate=λ) = λe−λx 1 b−a λx x! rq p2 q =1−p λ λ λ = rate b+a 2 (b − a)2 12 µ σ2 1/λ 1/λ2 r· q p λ = rate • gamma – a generalization of exponential with two parameters (α = shape and β = scale = 1/rate). Possible values are [0, ∞). The distributions are all right skewed. The exponential distributions arise when α = 1 and β = 1/λ. • Weibull – another generalization of exponential, again with two parameters (α = shape and β = scale = 1/rate). Possible values are [0, ∞). Not all Weibull distributions are right skewed. The Weibull distributions have proved to be good models of “time to failure” data (how long maufactured parts last before failure, for example). The exponential distributions arise when α = 1 and β = 1/λ. • beta – the standard Beta distributions are continuous distribution that gives values between 0 and 1. (They can be shifted or stretched to give values on any other interval, if desired). There are two shape parameters (α = shape1 and β = shape2). The uniform distribution on [0, 1] is a special case when α = β = 1. c 2007 Randall Pruim ([email protected]) February 27, 2007 Math 243 – Spring 2007 23 Some Properties of Expected Value & Variance Let’s suppose that X and Y are discrete random variables with pmf’s p(x) and q(y). Recall that X and Y are independent if and only if P (X = x & Y = y) = p(x) · q(y). Then the following can be verified by looking at the sums involved and using some algebra: 1. For any rv’s X and Y , E(X + Y ) = E(X) + E(Y ). (Note: independence not required.) X (x + y)P (X = x & Y = y) E(X + Y ) = (1) x,y = X x · P (X = x & Y = y) + y · P (X = x & Y = y) (2) x,y = XX x = y y y · P (X = x & Y = y) (3) x X X X X x P (X = x & Y = y) + y P (X = x & Y = y) x = XX x · P (X = x & Y = y) + X y y x · P (X = x) + X x (4) x y · P (Y = y) (5) y = E(X) + E(Y ) (6) 2. If X and Y are independent, then E(XY ) = E(X) · E(Y ). X E(XY ) = (xy)P (X = x &Y = y) (7) x,y = X (xy)p(x)q(y) (8) x,y = XX x = X xy · p(x)q(y) xp(x) X x = X (9) y y · q(y) (10) y xp(x) E(Y ) (11) x = E(X) · E(Y ) (12) 3. If X and Y are independent, then Var(X + Y ) = Var(X) + Var(Y ). Var(X + Y ) = E((X + Y )2 ) − [E(X + Y )]2 2 (13) 2 2 = E(X + 2XY + Y ) − [E(X) + E(Y )] 2 2 (14) 2 2 = E(X ) + E(2XY ) + E(Y ) − [(E(X)) + 2 E(X) E(Y ) + (E(Y )) ] 2 2 2 2 = E(X ) + 2 E(X) E(Y ) + E(Y ) − [E(X) + 2 E(X) E(Y ) + E(Y ) ] 2 2 2 (15) (16) 2 (17) 2 = E(X ) − E(X) + E(Y ) − E(Y ) (18) = Var(X) + Var(Y ) (19) = E(X ) + E(Y ) − E(X) − E(Y ) 2 2 2 c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 February 27, 2007 24 Continuous Variables The results above are still true for continuous variables. The proofs are almost the same, after replacing sums with integrals and pmf’s with pdf’s. Application: Mean and Variance of Binomial Distributions If X ∼ Bin(n, p), we can use these properties to determine the mean and variance of X by writing X = X1 + X2 + · · · Xn , where each Xi is an independent Bernoulli random variable with probability p of success. Recall that E(X) = p and Var(X) = p(1 − p). E(X) = E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) = p + p + · · · + p = np Var(X) = Var(X1 + X2 + · · · + Xn ) = Var(X1 ) + Var(X2 ) + · · · + Var(Xn ) = np(1 − p) An Extra Property for Normal Distributions If X and Y are independent and normally distributed, then X + Y will also be normal (and we can calculate the mean and variance using the methods just described). Example: If X ∼ N(10, 3) and Y ∼ N(12, 4), what is P(X ≥ Y )? c 2007 Randall Pruim ([email protected]) March 2, 2007 Math 243 – Spring 2007 The Distribution of X Consider the following common sampling situation: • population with µ and variance σ 2 • simple random sample (SRS): X1 , X2 , . . . , Xn • X= X1 + X2 + · · · + Xn 1 = (X1 + X2 + · · · + Xn ) n n It is important to distinguish three distributions: 1. Population distribution 2. Each particular sample has a distribution 3. The sampling distribution: distribution of X over many samples. We want to know about the sampling distribution of X. • First apply our rules for means: 1 1 E(X) = E (X1 + X2 + · · · + Xn ) = E (X1 + X2 + · · · + Xn ) n n 1 = (E(X1 ) + E(X2 ) + · · · + E(Xn )) n 1 = (µ + µ + · · · + µ) n 1 nµ = µ = n • Then apply our rules for variances, remembering that for an SRS the Xi ’s are indpendent: 1 1 Var(X) = Var (X1 + X2 + · · · + Xn ) = 2 Var (X1 + X2 + · · · + Xn ) n n 1 = (Var(X1 ) + Var(X2 ) + · · · + Var(Xn )) n2 1 = σ2 + σ2 + · · · + σ2 2 n 1 1 nσ 2 = σ 2 = n2 n • And if the population is normal, then the distribution of X is normal, too. • Better still, Central Limit Theorem. For any population with mean µ and standard deviation σ, √ if n is large enough, then the ditribution of X ≈ N(µ, σ/ n). c 2007 Randall Pruim ([email protected]) 25 March 5, 2007 Math 243 – Spring 2007 26 An Important General Setting 1. We are interested in information about some unknown parameter(s) of a population or process. • We will often make some assumptions about the population distribution. Examples: ◦ perhaps some parameters are known (e.g., we know the standard deviation but not the mean) ◦ maybe we assume we know the family of distriubtions it comes from (e.g., the population is normal, but we don’t know the mean and standard deviation) 2. We obtain some information by means of a simple random sample (SRS) or by repetition of the process. • X1 , X2 , . . . , Xn – notation for random sample thought of as random variables. Each Xi has the same distriubtion as the population and is independent of the others. (We say they are i.i.d., independent and identically distributed. • x1 , x2 , . . . , xn – notation for a particular sample (our data). 3. Big question: What does our sample data reveal about the unknown parameters of the population? (a) Estimation: Can we give good estimates for the unknown parameters, along with some indication of accuracy? • What proportion of cups does the lady tasting tea correctly identify? ◦ Unknown paramter: success probability. ◦ Desired answer: “probably” between and . (b) Hypothesis testing: Can we answer, with a specified degree of confidence, some yes/no question about the unknown parameters? • Does the lady tasting tea get the answer correct more than half the time? (Unknown paramter: success probability.) • We will develop formal procedures for many situations of this sort over the next weeks. For today, we want to think a bit more generally about the process. Estimators and Estimates Let θ be an unknown parameter we want to know about. • Statistic: a function from a sample to R. ◦ We said earlier that a statistic is a number that describes a sample. This definition makes that idea a bit more precise. ◦ Examples: mean, median, variance, standard deviation, quantiles, etc. • Estimator: a statistic applied to X1 , X2 , . . . , Xn . • Estimate: a statistic applied to x1 , x2 , . . . , xn . • An estimator is a random variable, an estimate is a number. ◦ estimator: X = X1 +X2n+···Xn ◦ estimate: x = x1 +x2n+···xn c 2007 Randall Pruim ([email protected]) March 5, 2007 Math 243 – Spring 2007 27 What makes a good estimator? 1. Unbiased: E(θ̂) = θ – it’s correct “on average” 2. Low Variance: • Standard deviation of an estimator is called standard error and is denoted σθ̂ . • If we can only estimate the standard error, we have an estimated standard error and is denoted σ̂θ̂ or sθ̂ . 3. Known Distribution (or known good approximation to it) – that way we can make probability claims about our estimator. Estimators (µ̂) for the Populaion Mean (µ) The sample mean (X) is a very nice estimator for µ: • Always unbiased. • Sampling distribution is approximately normal if n is large enough; exactly normal for any n if the population is normal. (And easy formalas for mean and variance, too!) • If the population is normal, it is the MVUE (minimum variance unbiased estimator). But other estimators may be better in some situations: • median may be better if the population distribution is symmetric and has heavy tails (since outlying values have a big effect on the mean and are likely to be in our data). • mean of extremes (mean of maximum and minimum) is best for a population that is uniform. • trimmed mean works well (but usually not best) in a very wide range of situations. (But it is more complicated to study mathematically.) Other Examples 1. Estimating maximum possible of uniform distribution. 2. Estimating variance (a) if the mean is known (b) if the mean is unknown c 2007 Randall Pruim ([email protected]) March 6, 2007 Math 243 – Spring 2007 Sampling Distributions • Yesterday: General ideas about estimation (estimators and estimates) • Next two weeks: Two important examples ◦ Estimating µ, a population mean (quantitative data) · Estimator: X · Sampling Distribution: X ∼ N(µ, √σn ) ◦ Estimating p, a population proportion (categorical data) · Estimator: p̂ = X /n · Sampling Distributions: X ∼ Binom(n, p); q p̂ ≈ N p, pq n c 2007 Randall Pruim ([email protected]) 28 March 8–9, 2007 Math 243 – Spring 2007 29 Confidence Intervals For the Mean of a Normal Population Assumptions 1. Data are SRS: X1 , X2 , . . . , Xn 2. Population is normal 3. The population variance σ 2 is known. (The population mean µ is unknown.) Key Idea Since • X is “usually” “close to” µ, and √ • we can quantify this probabilistically since we know X ∼ N(µ, σ/ n), and • whenever X is close to µ, µ must be close to X, we can try to find a number m such that P(|X − µ| ≤ m) = 1 − α (quantifying “probably” and “close to”). This leads to an interval of the form (X − m, X + m) = X ± m around X where 1 − α is the confidence level (α is the probably-part) and m is the margin of error (m is the close-to-part). The Magic Formula If we specify a level of confidence of 1 − α, then P(|X − µ| ≤ z∗ SE) = 1 − α √ where z∗ = zα/2 is “some number of standard deviations”, and SE = σ/ n is the standard error (standard deviation of the sampling distribution). The critical number zα/2 is chosen so that P(Z > zα/2 ) = α/2 that is, we are choosing the critical number so that the fraction α of the standard normal distribution lies within ±zα/2 . (Draw a quick picutre when making these calculations.) So our confidence interval becomes: σ x ± zα/2 √ n or more simply written: x ± z∗ SE c 2007 Randall Pruim ([email protected]) March 8–9, 2007 Math 243 – Spring 2007 30 Robustness Issues What if our assumptions are not met? 1. If the population is not normal. . . If n is “large enough” then the sampling distribution for X is still approximately normal by the Central Limit Theorem, so the methods are still approximately correct. “Large enough” depends on how the population differs from normality. A good rule of thumb is that 30 is large enough for nearly all distributions. 2. If the population variance is unknown (usually the case). . . we could try plugging in s for σ in all our formulas, BUT . . . That would not be quite right, even if the population is normal. The problem is that the sampling distribution of X −µ √ s/ n would no longer be normal. The good news is that the distribution it has is known. It is the t distribution with n − 1 degrees of freedom. Our new CI becomes x ± t∗ eSE or s x ± tn−1,α/2 √ n 3. If we have both problems (don’t know σ and popoulation is not normal). . . The t intervals are also robust and can be used in most situations where n > 30 and for even smaller n if the population is unimodal and not terribly skewed. 4. If the sample is not an SRS, then things can become much more difficult (depending on how it differs from an SRS). To say these confidence intervals are robust means that the stated confidence level is quite close to the actual coverage rate of the interval. Sometimes this is verified theoretically (by folks who know more statistics that we have learned); sometimes it is verified emprically (by running simulations). More Variations 1. One-sided and lop-sided confidence intervals 2. Determining sample size for specified confidence and margin of error. 3. Confidence Intervals for population proportions (method 1). c 2007 Randall Pruim ([email protected]) March 12, 2007 Math 243 – Spring 2007 31 Things we have not covered There are some topis from Chapter 7 that we have not yet covered: • tolerance and prediction intervals • an improved method of confidence intervals for proportions (Box 7.10 on page 295). Hypothesis Testing Last week we talked about confidence intervals. This week we will focus on the other main category of inference procedure – hypothesis testing. Actually, we have discussed most of the ideas involved already. The lady tasting tea is an example of hypothesis testing. • Hypothesis: • Statisitcal Hypothsesis: ◦ µ = 100: The population mean is 100. ◦ µ > 100: The population mean greater than 100. ◦ µM > µF : The mean weight of male ogres is more than the mean weight of female ogres. ◦ The population is normal. • Our goal with hypothesis testing is to assess the evidence provided by data with respect to some claim (hypothesis) about a population. ◦ A hypothesis test is a formal procedure for comparing observed (sample) data with a hypothesis whose truth we want to ascertain. ◦ The results of a test are expressed in terms of a probability that measures how well the data and the hypothesis agree. In other words, it helps us decide if a hypothesis is reasonable or unreasonable based on the likelihood of getting sample data similar to our data. Just like confidence intervals, hypothesis testing is based on our knowledge of sampling distributions. c 2007 Randall Pruim ([email protected]) March 12, 2007 Math 243 – Spring 2007 32 4 step procedure for testing a hypothesis Step 1: Identify parameters and state the null and alternate hypotheses. The hypothesis to be tested is called the null hypothesis, designated H0 . The alternate hypothesis describes what you will believe if you reject the null hypothesis. It is designated Ha . (It may well be that H0 is a straw man and that we hope or suspect is true instead of H0 .) Example 1. The average LDL of healthy middle-aged women is 108.4. We suspect that smokers have a higher LDL (LDL is supposed to be low). Example 2. The average Beck Depression Index (BDI) of women at their first post-menopausal visit is 5.1. We wonder if women who have quit smoking between their baseline and post visits have a BDI different from the other healthy women. Step 2: Compute the test statistic A test statistic measures how well the sample data agrees with the null hypothesis. When we are testing a hypothesis about the mean of a population, the test statistic has the form (estimate) - (hypothesis value) SE or eSE We use this statistic because of what we know about sampling distributions when the population is ∼ N(µ, σ): √ • X ∼ N(µ, σ/ n) • X −µ √ ∼ N(0, 1) σ/ n • X −µ √ ∼ tn−1 . s/ n These results are quite robust, so we can apply them in many situations where we are not willing to assume the population is normal – especially if the sample size is large. In each of these cases, the larger the test statistic, the stronger the evidence against the null hypothesis. This is true of most test statistics. For our HWS examples (LDL: 159 smokers, x̄ = 115.7, sd=29.8; BDI: 27 quitters, x̄ = 5.6, sd=5.1): LDL: t = x̄−µ √0 s/ n = 115.7−108.4 √ 29.8/ 159 BDI: t = x̄−µ √0 s/ n = 5.6−5.1 √ 5.1/ 27 = 3.089 = 0.509 c 2007 Randall Pruim ([email protected]) March 12, 2007 Math 243 – Spring 2007 33 Step 3: Compute the p-value The p-value (for a given hypothesis test and sample) is the probability, if the null hypothesis is true, of obtaining a test statistic as extreme or more extreme than the one actually observed. How do we compute the p-value? Here is how hypothesis testing works based on the t-distribution: Suppose that an SRS of size n is drawn from a population having unknown mean µ. To test the hypothesis H0 : µ = µ0 based on an SRS of size n, compute the one-sample t statistic t= x̄ − µ0 √ s/ n The p-value for a test of H0 against • Ha : µ > µ0 is P (T > t) • Ha : µ < µ0 is P (T < t) • Ha : µ 6= µ0 is P (T > |t|) + P (T < −|t|) = 2P (T > |t|) where T is a random variable with a t distribution (df = n − 1). These p-values are exact if the population distribution is normal and are approximately correct for large enough n in other cases. The first two kinds of alternatives lead to one-sided or one-tailed tests. The third alternative leads to two-sided or two-tailed test. Example 3. Find the p-value for testing our hypotheses regarding LDL in the HWS. t = 3.09, one-tailed alternate hypothesis, n = 159 smokers. >1- pt(3.09,df=158) [1] 0.0011829 Example 4. Find the p-value for testing our hypothesis regarding BDI in the HWS. t=.51, two-tailed alternate hypothesis, n = 27 quitters. >2 * (1 - pt(.51,df=26)) [1] 0.61435 Note: This is exactly how we evaluated the claims of the lady tasting tea. c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 March 12, 2007 34 Step 4: State a conclusion A decision rule is simply a statement of the conditions under which the null hypothesis is or is not rejected. This condition generally means choosing a significance level α. If p is less than the significance level α, we reject the null hypothesis, and decide that the sample data do not support our null hypothesis and the results of our study are then called statistically significant at significance level α. If p > α, then we do not reject the null hypothesis. This doesn’t necessarily mean that the null hypothesis is true, only that our data do not give strong enough evidence to reject it. It is rather like a court trial. If there is strong enough evidence, we convict (guilty), but if there is reasonable doubt, we find the defendant “not guilty”. This doesn’t mean the defendant is innocent, only that we don’t have enough evidence to claim with confidence that he or she is guilty. Looking back at our previous examples, if we select a significance level of α = 0.05, what do we conclude? • LDL (p-value < 0.0025) Since 0.0025 < 0.05, we the null hypothesis and conclude that the smokers have a significantly higher LDL. ”Smokers have higher LDL (t = 3.09, df = 158, p < .0025)”. • BDI (p-value > .5) This p-value is higher than .05, so we the null hypothesis. It is quite possible that the null hypothesis is true and our data differed from the hypothesized value just based on random variation. In this case we say that the difference between the mean of 5.6 (those who quit smoking) and 5.1 (average healthy women) is not statistically significant. ”Those who quit smoking do not have a significantly different average BDI than middle-aged healthy women (t = .51, df = 26, p > .50).” It’s good to report the actual p-value, rather than just the level of significance at which it was or was not significant. This is because: Example 5. Suppose that it costs $60 for an insurance company to investigate accident claims. This cost was deemed exorbitant compared to other insurance companies, and cost-cutting measures were instituted. In order to evaluate the impact of these new measures, a sample of 26 recent claims was selected at random. The sample mean and sd were $57 and $10, respectively. At the 0.01 level, is there a reduction in the average cost – or can the difference of three dollars ($60-$57) be attributed to the sample we happened to pick? Hypotheses: Test statistic: p-value: Conclusion: c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 March 12, 2007 35 Example 6. The amount of lead in a certain type of soil, when released by a standard extraction method, averages 86 parts per million (ppm). A new extraction method is tried, and they wondered if it would extract a significantly different amount of lead. 41 specimens were obtained, with a mean of 83 ppm lead and a sd of 10 ppm. Hypotheses: Test statistic: p-value: Conclusion: Example 7. The final stage of a chemical process is sampled and the level of impurities determined. The final stage is recycled if there are too many impurities, and the controls are readjusted if there are too few impurities (which is an indication that too much catalyst is being added). If it is concluded that the mean impurity level=.01 gram/liter, the process is continued without interruption. A sample of n = 100 specimens is measured, with a mean of 0.0112 and a sd of 0.005 g/l. Should the process be interrupted? Use a level of significance of .05. Hypotheses: Test statistic: p-value: Conclusion: c 2007 Randall Pruim ([email protected]) March 13, 2007 Math 243 – Spring 2007 36 Warm-up problem Example 1. A sample of 20 mini-boxes of raisins reveals a mean count of 24.2 raisins per box with a standard deviation of 1.5. Is this enough evidence to doubt the raisin company’s claim that there is an average of 25 raisins per box? Statistical Significance If the p-value is small enough, we say that we have evidence against the null hypothesis that is statistically significant. The cut-off point for statistical significance is called the significance level and is denoted α. So if α = 0.05, we will call any test with a p-value smaller than 0.05 statistically significant. The Rejection Region Approach 0. Determine α (significance level). 1. State null and alternative hypotheses. 2. Determine the rejection region (for which values of our test statistic will we reject the null hypothesis?) 3. Compute test statistic from data. 4. Compare test statistic with rejection region. Example 2. Repeat Example 20 using the rejection region approach. The two approaches yield identical conclusions. The rejection region approach is useful for thinking about statistical power (see below). Type I Error, Type II Error, and the Power of a Test Four possible situations: H0 is true H0 is false reject H0 do not reject H0 Two of these are good situations. Two of them are errors. These are errors are referred to as type I and type II errors. • P(reject H0 when H0 is true) = α. So α is the type I error rate. • β = P(don’t reject H0 when H0 is false) can’t be computed. Why? ◦ β = type II error rate can only be computed if ◦ 1 − β is called the power of the test (against a particular alternative). c 2007 Randall Pruim ([email protected]) . March 13, 2007 Math 243 – Spring 2007 37 Power Calculations In general, power calculations are difficult. But when σ is known, power calculations are straightforward. Example 3. Suppose σ = 2; H0 : µ = 0; Ha : µ > 0; α = 0.05. Using power.t.test() R provides a function to determine any one of power, sample size, effect size, standard deviation or significance if all the others are specified. Example 4. Suppose σ is unknown; H0 : µ = 0; Ha : µ > 0; α = 0.05. Let’s guess that s will be approximately 2. • What is the power of this test against an alternative of Ha : µ = 1 when the sample is of size 20? power.t.test(sd=2,sig.level=.05,delta=1,n=20,type=’one.sample’) One-sample t test power calculation n delta sd sig.level power alternative = = = = = = 20 1 2 0.05 0.5644829 two.sided • That’s not so good (almost 50% of the time our test will fail to detect an effect of this magnitude). How large must the sample size be to have 90% power? > power.t.test(sd=2,sig.level=.05,delta=1,power=.9,type=’one.sample’) One-sample t test power calculation n delta sd sig.level power alternative = = = = = = 43.99552 1 2 0.05 0.9 two.sided c 2007 Randall Pruim ([email protected]) March 13, 2007 Math 243 – Spring 2007 38 • We can also look at power graphically by varying one quantity. Let’s see how sample size affects power: > xyplot(power.t.test(n=5:1000,sd=2,delta=.5,type=’one.sample’)$power~5:1000, type=’l’,lwd=2,xlab="effect size",ylab="power", main="Power of t test when sd=2 and effect size is .5") • What about effect size and power? > xyplot(power.t.test(n=40,sd=2,delta=seq(0,3,by=.05), type=’one.sample’)$power~seq(0,3,by=.05), type=’l’,lwd=2,xlab="effect size",ylab="power", main="Power of t test when sd=2 and n=40") power power 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 1.5 effect size 600 effect size 400 2.0 2.5 800 Power for test if sd=2 and n=40 0.5 200 3.0 1000 Power of t test when magnitude of effect is 1, sd = 2 0 c 2007 Randall Pruim ([email protected]) March 14, 2007 Math 243 – Spring 2007 39 Hypothesis Tests for Proportions 1. State Null and Alternative Hypotheses: • H0 : p = p0 • Ha : p 6= p0 [ or p < p0 or p > p0 ] Note: If H0 is true, and X = # of “successes” • X ∼ Binom(n, p) p • X ≈ N(np, np(1 − p)) q • X/n ≈ N(p, p(1−p) n ) [provided n is “large enough”] [provided n is “large enough”] 2. Calculate the test statistic • x = sample count (of successes) • p̂ = x/n = sample proportion r p̂ − p0 p0 (1 − p0 ) • z= , where SE = SE n 3. Calculate p-value. If H0 is true, then (a) Z ≈ N(0, 1) [z-test, prop.test()] • Rule of Thumb: approximation is good enough provided np0 ≥ 10 and n(1 − p0 ) ≥ 10. (b) X ∼ Binom(n, p0 ) [binom.test()] 4. Draw a conclusion. (Is the p-value small enough to reject the null hypothesis?) Example 1. You are assigned the task of controlling an incoming supply of parts for quality. Your company will accept a shipment of parts if no more than 5% are defective. You can’t inspect every part, so you sample 200. You find 16 to be defective. What should you do? Example 2. You are playing a game that involves rolling dice. In the first 20 rolls, a 6 comes up 7 times. Should you suspect loaded dice? c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 March 14, 2007 40 Comparing significance tests and confidence intervals Example 3. The drained weights for a sample of 30 cans of fruit have mean 12.087 oz, and standard deviation 0.2 oz. Test the hypothesis that on average, a 12-oz. drained weight standard is being maintained. Now let’s compute a 95% CI. What is the relationship between confidence intervals and two-sided hypothesis tests? Example 4. Compute a 95% confidence interval for the percentage of defective parts in example 21. c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 March 14, 2007 41 Better Confidence Intervals for Proportions The simple approximate confidence intervals for proportions that we developed the other day: r ± p̂ |{z} estimate p̂(1 − p̂) {zn } crit.val. | z∗ |{z} (20) eSE are not very good – especially for small and moderate sample sizes. • problem: coverage rates are not very accurate • cause: eSE is a poor estimator for SE unless n is large. A Better Approach We could derive a 1 − α confidence interval in a different way by asking the question: What values of p0 would we not reject if we tested the null hypothesis H0 : p = p0 (with a two-sided alternative)? • These values are “plausible values” for p. • p0 is plausible if the test statistics isn’t too big; that is, if |p̂ − p0 | ≤ zα/2 zq p̂0 (1−p0 ) n • The corresponding equation is quadratic in p0 , so we can solve using the quadratic formula. I’ll spare you the gory details; here’s the solution: p̂ + p0 = z∗ 2n ± z∗ q p̂q̂ n + z∗2 4n2 1 + z∗2 /n (21) • For large n, some terms have negligible magnitude, and this is very close to our easier formula (22). • For a 95% confidence interval, z∗ ≈ 2, and (21) is very nearly (see problem 7.27) r p̃ ± z∗ p̃(1 − p̃) n (22) x+2 where p̃ = n+4 . This is sometimes called the plus four confidence interval or the Wilson confidence interval. It has been shown that this confidence interval is very accurate even for quite small sample sizes, and since it is now more difficult to do than the (22), it is recommended that it be used for all sample sizes when the confidence level is near 95% (say between 90% and 99%). • For other confidence levels and computer calculations, (21) can be used. c 2007 Randall Pruim ([email protected]) March 26, 2007 Math 243 – Spring 2007 42 Review & Preview: Design of Statistical Studies It is going to become increasing important to pay attention to study design. In general, by study design we mean answers to questions like the following: • What population are we studying? How will we sample? • What variables will be measured? How will they be measured? • Observational study or experiment? • What statistical analysis will we do? What assumptions are required for this analysis? How will we check the assumptions? Before spring break the situations we considered all had the following simple univariate study design: • random sample (SRS) from a population • one variable recorded for each individual (two flavors: quantitative or categorical) ◦ for quantitative variables there were typically some normality assumptions (less important as sample size gets larger). • inference via confidence intervals and/or hypothesis tests The rest of the semester will consist of variations on these themes that are appropriate for other study designs. What study designs are appropriate in the following examples? For which ones do we already have the necessary tools? Example 1. What is the average length of an erruption of Old Faithful? (data(faithful)) Example 2. Are healthy women who quit smoking more likely to gain weight or lose weight? How much weight? Example 3. If school kids are asked which is more important to them – good grades, athletic ability, or popularity – 1. What percentage would choose atheletic ability? 2. Which response would be selected most often? 3. Would the responses be different for boys and girls? 4. Would the responses be different in rural, urban, and suburban schools? 5. Would the responses be different for kids of different races? > read.table(’http://www.calvin.edu/~rpruim/data/dasl/schoolkids.txt’,header=T) -> kids > source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’) > getData(’dasl/schoolkids’) -> kids c 2007 Randall Pruim ([email protected]) March 26, 2007 Math 243 – Spring 2007 43 Example 4. Gosset (the man who invented Student’s t) did an experiment to test whether kiln dried seeds have better yield (lbs/acre) than regular seeds. How would you design such an experiment? > getData(’dasl/gosset’) -> corn Example 5. Tire treadwear can be measured two different ways: by weighing the tire and by measuring the depth of the grooves. 1. Do these methods give comparable results? 2. Can we predict one measurement from the other? (With what degree of accuracy?) > getData(’dasl/tirewear’) -> tire Paired Study Design For several of our examples a paired study design is appropriate: • random sample (SRS) from a population • two quantitative measurements for each individual • interested in the difference between the two measurements (how much improvement, weight loss, change, etc.) Key idea: Once we form A − B, we are back to a univariate situation, so we already know what to do! A Correction The power.t.test() examples from March 13 have been updated. I thought that one-sample tests were the default, but it turns out that two-sample tests (one of our topics for this week) are the default. The updated examples are available online. c 2007 Randall Pruim ([email protected]) March 26, 2007 Math 243 – Spring 2007 44 Normal Quantile Plots, Normal Probability Plots (Sec. 4.6) We have been making normality assumptions for many of our procedures. How does one check that these assumptions are reasonable? One useful graphical check is called a normal quantile plot. The idea is to compare data to the theoretical quantiles of a standard normal distribution. Example 6. Sample data (in sorted order): 83 92 96 96 97 99 100 101 107 108 • Using our ruler method with data centered in each unit, we see that 83 corresponds to the 0.05-quantile of our data (see second row of table below). • What values would we expect for the 0.05-quantile, the 0.15-quantile, etc. if this data came from a normal distribution? We can use qnorm(seq(.05,.95,by=.1),mean=mean(x),sd=sd(x)) to find out. • If there is a good fit, plotting the first row against the second should give approximately the straight line with slope = 1 and intercept = 0. (y = x) • Since all normal distributions are linear transformations of a standard normal distribution (y = µ+σz), we could use qnorm(seq(.05,.95,by=.1)) instead. Now a good fit will be roughly linear (x = y = µ + σz) the slope and intercept will be the mean and standard deviation of a normal distribution (if we put our data on the vertical axis). Notice that the graphs look identical except for the scale. 1 83.00 0.05 86.08 −1.64 data (x) probability normal quantiles (y) st normal quantiles (z) 2 92.00 0.15 90.45 −1.04 3 96.00 0.25 93.05 −0.67 4 96.00 0.35 95.13 −0.39 data (x) 5 97.00 0.45 97.00 −0.13 6 99.00 0.55 98.80 0.13 7 100.00 0.65 100.67 0.39 8 101.00 0.75 102.75 0.67 9 107.00 0.85 105.35 1.04 data (x) 105 100 95 90 85 105 100 95 90 85 85 ● ● ● −1 ● 90 ● ● ● ● ● 0 ● ● ● 100 ● ● ● theoretical quantiles (z) ● 95 theoretical quantiles (y) ● 1 ● 105 ● ● 110 Of course, R can do this all automatically for you:1 require(lattice); qqmath(x) or qqnorm(x). 1 Actually, R uses a slightly different ruler method for small data sets, but the basic idea is the same c 2007 Randall Pruim ([email protected]) 10 108.00 0.95 109.72 1.64 March 27, 2007 Math 243 – Spring 2007 45 Variations on the theme The same basic ideas that we used when computing confidence intervals and evaluating hypothesis tests for means of a quantitative variable can be applied in a number of related situations. The best way to think of these different situations is as variations on an inference theme. To make this easier, we will use a systematic notation scheme throughout: • parameters (population) ◦ π, proportion (of a categorical variable) ◦ µ, mean (of quantitative variable) ◦ σ, standard deviation • statistics (sample) ◦ n, sample size ◦ X, count (of a categorical variable) ◦ p= X n, proportion (of a categorical variable) ◦ x̄, mean (of quantitative variable) ◦ s, standard deviation • sampling distribution ◦ SE, standard deviation of the sampling distribution (σx̄ or σp̂ are also used for this) ◦ eSE, estimated standard error of the sampling distribution (an estimate for SE) ◦ µp , µx̄ , mean of sampling procedure (for when determining p̂ and x̄, respectively) Subscripts will be used to indicate The procedures involving the z (normal) and t distributions are all very similar. • To do a hypothesis test, compute t or z = data estimate − hypothesis value , SE or eSE and compare with the appropriate distribution (using tables or computer). • To compute a confidence interval, first determine the critical value for the desired level of confidence (z ∗ or t∗ ), then the confidence interval is data estimate ± (critical value)(SE or eSE) . c 2007 Randall Pruim ([email protected]) March 27, 2007 Math 243 – Spring 2007 46 Two Sample Procedures A two-sample problem is one in which: 1. the goal is to compare the responses in two groups 2. each group is considered to be a sample from a distinct population 3. the responses in each group are independent from those in the other group The difference between two-sample problems and matched pairs problems is that in matched pairs we have two measurements that are dependent in some way, whereas in two-sample problems we have independent measurements from two distinct populations Suppose we want to compare µ1 with µ2 . We do this by drawing a sample from each population and calculating x̄1 and x̄2 . In order to know what x̄1 and x̄2 tell us about the difference between µ1 and µ2 we need to know about the sampling distribution for X̄1 − X̄2 . Assuming each population has a normal distribution with means µi and standard deviations σi , we already know that σ1 σ2 X̄1 ∼ N µ1 , √ and X̄2 ∼ N µ2 , √ n1 n2 Using our rules for combining means and variances, we see that s X̄1 − X̄2 ∼ N (µ1 − µ2 , SE) , where SE = σ12 σ22 + n1 n2 Of course, we won’t usually know σ1 and σ2 , so we need to estimate SE using s s21 s2 eSE = + 2 n1 n2 and . Unfortunately the t statistic computed from this does does not have a t-distribution with n1 + n2 − 2 degrees of freedom, as we might have hoped. Why? A t distribution replaces a N(0,1) distribution only when a single population standard deviation σ is replaced by a single sample standard deviation s. In this case, we replaced two standard deviations (σ1 and σ2 ) with their estimates (s1 and s2 ). The resulting distribution is still approximately a t-distribution, but with the following degrees of freedom: df = ν = (s21 /n1 + s22 /n2 )2 (s21 /n1 )2 n1 −1 + (s22 /n2 )2 n2 −1 = (eSE12 + eSE22 )2 eSE14 df1 + eSE24 df2 Some algebra shows that • If n1 = n2 and s1 = s2 , then ν = n1 + n2 − 2 • ν ≤= n1 + n2 − 2 (sum of degrees of freedom) • ν ≥= min(n1 − 1, n2 − 1) (smaller degrees of freedom) c 2007 Randall Pruim ([email protected]) March 27, 2007 Math 243 – Spring 2007 47 Now that we know the distribution involved and a value for eSE, we are all set to do hypothesis testing or to compute CIs. Example 1. An agronomist has developed a new plant food. She hopes it will improve yield. To find out, she treated 48 plants with the new food and obtained a mean yield of 24.4 lbs (s=4.8 lbs). 45 identical plants were untreated and had a mean yield of 22.3 lbs (s =2.3 lbs). Do the data provide sufficient evidence to determine that the new plant food is better than no treatment? Example 2. The weather bureau measured the ozone level at 5 random locations in Orange City before a cool front moved through and at 5 different random locations afterward. Test whether there is a significant drop in the ozone level after the front has moved through. Time Before front After front n 5 5 mean 0.122 .094 variance .00067 .00016 Question. How could the design of these studies be changed to make them matched pairs designs? What would be the advantages/disadvantages of such changes? c 2007 Randall Pruim ([email protected]) March 27, 2007 Math 243 – Spring 2007 48 More Examples Some of these are two-sample designs, others are paired designs. Example 3. In testing food products for palatability, General Foods employed a 7-point scale from -3 (terrible) to +3 (excellent) with 0 representing ”average”. Their standard method for testing palatability was to conduct a taste test with 50 persons - 25 men and 25 women. Does the amount of liquid used in the samples matter? > > > > source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’) getData(’dasl/tastetest’) -> taste t.test(score~liq,data=taste) t.test(score~liq,data=taste,paired=T) Example 4. Gosset (the man who invented Student’s t) did an experiment to test whether kiln dried seeds have better yield (lbs/acre) than regular seeds. How would you design such an experiment? > getData(’dasl/gosset’) -> corn > with(corn,t.test(reg,kiln)) > with(corn,t.test(reg,kiln,paired=T)) Example 5. Tire treadwear can be measured two different ways: by weighing the tire and by measuring the depth of the grooves. 1. Do these methods give comparable results? 2. Can we predict one measurement from the other? (With what degree of accuracy?) > getData(’dasl/tirewear’) -> tire > with(tire,t.test(weight,groove) > with(tire,t.test(weight,groove,paired=T)) Example 6. Two varieties of oats were compared in an experiment to determine which variety had the higher yield. Since soil type also affects yield, the experimenter blocked out its effect by planting each variety of oats in seven different types of soil. With the data paired by soil types as given below, does it appear that variety A has the higher mean yield? H0 : Yield Soil type 1 2 3 4 5 6 7 A 71.2 72.6 47.8 76.9 42.5 49.6 62.8 mean(x) = 3.7286 B 65.2 60.7 42.8 73.0 41.7 56.6 57.3 x = A-B 6.0 11.9 5.0 3.9 0.8 -7.0 5.5 sd(x) = 5.7792 Ha : eSE = p-value = 95% CI for difference: c 2007 Randall Pruim ([email protected]) March 30, 2007 Math 243 – Spring 2007 49 Letting R do the work R can, of course, do all of the procedures we have been learning (and many more). Some useful R commands for this are • prop.test(), binom.test() • t.test(), power.t.test() • qqmath() (and xqqmath() which adds a reference line [require(lattice)] • histogram() and bwplot() [require(lattice)] • summary() for computing numerical summaries of data [require(Hmisc)] • table() and xtabs() for tabulating counts of data R Formulas Lattice graphics and many statistical procedures in R are based on formula interface. Simple formulas in R have the following form: y~x|z or ~x|z. Often the |z is not needed. Often this can be thought of as “y is modeled by x (conditioned by z)”. When y is missing, it usually means that R will be caculating something for you. For example, in a histogram, R computes the heights of the bars (the y-coordinate). In xtabs(), R computes the frequencies for the cross-table. Some R Examples • histogram( y~x|z, data=mydata ) will make a scatterplot of y by x with separate plots for each level of a categorical variable z. All the variables are taken from the data frame mydata. • bwplot( ~x|z, data=mydata ) will make a boxplot of x with separate plots for each level of a categorical variable z. bwplot( y~x, data=mydata ) will make side-by-side boxplots of (One of x and y should be categorical, the other quantiative). • summary(y~x, data=mydata, fun=favstats) will compute summary statistics of y for each level of x. favstats is a little function I wrote. You can put things like mean, sd, quantile in there as well. (favstats does all three.) • t.test(y~x,data=mydata) will do a two-sample t-test if x has two-levels. If you add paired=TRUE it will do a paired t-test, assuming that the order in the data set determines the pairs. • t.test(mydata$x) or with(mydata, t.test(x)) will do a one-sample t-test. t.test(mydata$x, mydata$y) or with(mydata, t.test(x,y)) will do a two-sample or paired t-tests (depending on how you set paired). • You can do pooled t-tests (remember, these assume that σ1 = σ2 and are not very robust against violations of that assumption), by adding var.equal=TRUE to a t.test(). • Other useful parameters for t.test() include alternative, mu, and conf.level. See ?t.test for details. c 2007 Randall Pruim ([email protected]) March 30, 2007 Math 243 – Spring 2007 50 • prop.test() and binom.test() work with summarized data. For example, prop.test(25,100) will do a 1-proportion test and CI where your data had 25 successes and a sample size of 100. See the help for these functions for more details. • For help in tabluation, table(x) and xtabs(~x+y,data=mydata) are useful commands. More Examples Here are some questions that we can now answer using the procedures we have been developing. Example 1. On average, how much weight did a healthy non-smoker gain during the course of a study? Example 2. Are healthy women who quit smoking more likely to gain weight or lose weight? How much weight? Example 3. Are women who quit smoking more likely to gain weight than smokers who do not quit? Do they gain more weight on average? If so, how much more? Example 4. If school kids are asked which is more important to them – good grades, athletic ability, or popularity – 1. What percentage would choose atheletic ability? 2. Which response would be selected most often? 3. Would the responses be different for boys and girls? 4. Would the responses be different in rural, urban, and suburban schools? 5. Would the responses be different for kids of different races? > source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’) > getData(’dasl/schoolkids’) -> kids Example 5. In testing food products for palatability, General Foods employed a 7-point scale from -3 (terrible) to +3 (excellent) with 0 representing ”average”. Their standard method for testing palatability was to conduct a taste test with 50 persons - 25 men and 25 women. Does the amount of liquid used in the samples matter? > getData(’dasl/tastetest’) -> taste > t.test(score~liq,data=taste) > t.test(score~liq,data=taste,paired=T) Example 6. Gosset (the man who invented Student’s t) did an experiment to test whether kiln dried seeds have better yield (lbs/acre) than regular seeds. How would you design such an experiment? > getData(’dasl/gosset’) -> corn > with(corn,t.test(reg,kiln)) > with(corn,t.test(reg,kiln,paired=T)) c 2007 Randall Pruim ([email protected]) March 30, 2007 Math 243 – Spring 2007 51 Example 7. Tire treadwear can be measured two different ways: by weighing the tire and by measuring the depth of the grooves. 1. Do these methods give comparable results? 2. Can we predict one measurement from the other? (With what degree of accuracy?) > getData(’dasl/tirewear’) -> tire > with(tire,t.test(weight,groove) > with(tire,t.test(weight,groove,paired=T)) Example 8. Does spending a weekend with a group of men banging on drums increase one’s masculinity? > getData(’m243/drumbeating’) -> tire R Output Here is a log of the R commands as done in class today (with my typos removed): > source(’http://www.calvin.edu/~rpruim/R/courses/m243.R’) # or getCalvin(’m243’) > getData(’dasl/schoolkids’) -> kids Getting data from: http://www.calvin.edu/~rpruim/data/dasl/schoolkids.txt # Reference: # Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social # Determinant for Children," Research Quarterly for Exercise and Sport, 63, # 418-424 # > table(kids$Goals) Grades Popular Sports 247 141 90 > prop.test(90,247+141+90) 1-sample proportions test with continuity correction data: 90 out of 247 + 141 + 90, null probability 0.5 X-squared = 184.5377, df = 1, p-value < 2.2e-16 alternative hypothesis: true p is not equal to 0.5 95 percent confidence interval: 0.1548111 0.2268756 sample estimates: p 0.1882845 > prop.test(90,247+141+90,p=1/3) 1-sample proportions test with continuity correction c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 March 30, 2007 data: 90 out of 247 + 141 + 90, null probability 1/3 X-squared = 44.6049, df = 1, p-value = 2.411e-11 alternative hypothesis: true p is not equal to 0.3333333 95 percent confidence interval: 0.1548111 0.2268756 sample estimates: p 0.1882845 > xtabs(~Race+Goals,data=kids) Goals Race Grades Popular Sports Other 22 9 5 White 225 132 85 > prop.test(c(5,85),c(22+9+5, 225+132+85)) 2-sample test for equality of proportions with continuity correction data: c(5, 85) out of c(22 + 9 + 5, 225 + 132 + 85) X-squared = 0.3212, df = 1, p-value = 0.5709 alternative hypothesis: two.sided 95 percent confidence interval: -0.18723283 0.08039522 sample estimates: prop 1 prop 2 0.1388889 0.1923077 > xtabs(~wt.gain+smoke.status,data=hws) smoke.status wt.gain new smoker nonsmoker quitter smoker no 2 142 4 42 yes 4 150 20 43 > prop.test(c(20,43), c(24,85)) 2-sample test for equality of proportions with continuity correction data: c(20, 43) out of c(24, 85) X-squared = 6.9395, df = 1, p-value = 0.008431 alternative hypothesis: two.sided 95 percent confidence interval: 0.1176301 0.5372719 sample estimates: prop 1 prop 2 0.8333333 0.5058824 > xtabs(~wt.gain+smoke.status,data=hws) smoke.status wt.gain new smoker nonsmoker quitter smoker no 2 142 4 42 yes 4 150 20 43 c 2007 Randall Pruim ([email protected]) 52 March 30, 2007 Math 243 – Spring 2007 > x=c(4,42,20,43); dim(x) = c(2,2) > x [,1] [,2] [1,] 4 20 [2,] 42 43 > fisher.test(x) Fisher’s Exact Test for Count Data data: x p-value = 0.004687 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.04756274 0.69157777 sample estimates: odds ratio 0.2075003 > summary(wt.chg~smoke.status,data=hws,fun=favstats) wt.chg N=407 +------------+----------+---+-----+------+-----+------+----+---------+---------+ | | |N |0% |25% |50% |75% |100%|mean |sd | +------------+----------+---+-----+------+-----+------+----+---------+---------+ |smoke.status|new smoker| 6| -2.0| 2.500| 6.25|10.000|13.0| 6.000000| 5.648008| | |nonsmoker |292|-32.0|-0.125| 5.00|10.625|63.0| 5.637671|10.794352| | |quitter | 24| -3.5| 6.375|13.25|16.875|33.5|12.395833| 8.362675| | |smoker | 85|-48.0| 1.000| 5.00|10.500|40.0| 5.597647|12.346572| +------------+----------+---+-----+------+-----+------+----+---------+---------+ |Overall | |407|-48.0| 0.000| 5.50|11.000|63.0| 6.033170|11.043234| +------------+----------+---+-----+------+-----+------+----+---------+---------+ > bwplot(wt.chg~smoke.status,hws) > t.test(wt.chg~smoke.status,hws,subset=smoke.status %in% c(’smoker’,’quitter’)) Welch Two Sample t-test data: wt.chg by smoke.status t = 3.1333, df = 54.383, p-value = 0.002784 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 2.449030 11.147342 sample estimates: mean in group quitter mean in group smoker 12.395833 5.597647 > qqmath(~wt.chg|smoke.status,hws) c 2007 Randall Pruim ([email protected]) 53 April 5, 2007 Math 243 – Spring 2007 54 Statistics with Two or More Variables Explanatory and Response Variables • Response variable (also called dependent variable): • Explanatory variable (also called independent variable): If we are interested in determining causal relationship, the ideal situation is to have an experiment where the researchers determine the values of the explanatory variable(s) and measure the response, but the terms explanatory and response are also used in observational studies. Roadmap Which statistical procedures we will use will depend upon the kind of variables (categorical vs. quantitative, response vs. explanatory). We’re going to focus on 4 main situations (and some varitions on those themes): Situation two categorical varialbes two quantitative variables categorical explanatory, quantitative response quantitative explanatory, categorical response Procedure Chi-square simple linear regression 1-way ANOVA logistic regression Extensions These methods can be extended to deal with more than 2 variables as well. In these situations we are typically interested in one response variable and several explanatory variables. Often (but not always) we are are primarily interested in one of the explanatory variables and the others are included as covariates. Some reasons for including covariates in a study: • • • Examples response explanatory clicking speed hand used drug effect dosage getting disease or not version of a gene winning or losing a game rating difference between teams possible covariates c 2007 Randall Pruim ([email protected]) April 5, 2007 Math 243 – Spring 2007 55 The Cartoon Guide to One-way ANOVA The basic ANOVA situation • Main Question: Do the means of the quantitative variable depend on which group (given by categorical variable) the individual is in? Or are they all the same? • If categorical variable has only 2 values: • ANOVA allows for 3 or more groups/treatments/sub-populations Example 1. Treating Blisters. • Subjects: 25 patients with blisters • Treatments: Treatment A, Treatment B, Placebo (P) • Measurement: # of days until blisters heal • Data [and means]: ◦ A: 5,6,6,7,7,8,9,10 [7.25] ◦ B: 7,7,8,9,9,10,10,11 [8.875] ◦ P: 7,9,9,10,10,10,11,12,13 [10.11] Question: Are these differences significant? or would we expect differences this large just by random chance? Ex_1 Ex_2 12 ● 10 days days 12 ● 8 10 ● ● ● ● 8 ● ● ● 6 6 A B P A B Whether differences between the groups are significant depends on • • • We need to develop a test statistic that takes these things into account. c 2007 Randall Pruim ([email protected]) P A B P April 5, 2007 Math 243 – Spring 2007 Some notation for ANOVA • n = number of individuals all together • I = number of groups • xij = value for individual j in group i • x̄ = sample mean of quant. variable for entire data set (grand mean) • s = sample s.d. of quant. variable for entire data set Group i has • ni = # of individuals in group i • x̄i = sample mean for group i (group mean) • si = sample standard deviation for group i From our example (I = 3; 3 groups) • n1 = 8, n2 = 8 n3 = 9 • x̄1 = 7.25, x̄2 = 8.875, x̄3 = 10.11 > summary(days~treatment,data=blisters,fun=favstats) # requires m243 stuff days N=25 +---------+-+--+--+----+---+-----+----+--------+--------+ | | |N |0%|25% |50%|75% |100%|mean |sd | +---------+-+--+--+----+---+-----+----+--------+--------+ |treatment|A| 8|5 |6.00| 7 | 8.25|10 | 7.25000|1.669046| | |B| 8|7 |7.75| 9 |10.00|11 | 8.87500|1.457738| | |P| 9|7 |9.00|10 |11.00|13 |10.11111|1.763834| +---------+-+--+--+----+---+-----+----+--------+--------+ |Overall | |25|5 |7.00| 9 |10.00|13 | 8.80000|1.979057| +---------+-+--+--+----+---+-----+----+--------+--------+ The ANOVA model xij = µi + ij |{z} |{z} |{z} DATA FIT ERROR ij ∼ N (0, σ) Assumptions of this model It is assumed that each group (subpopulation) . . . • is normally distributed about its group mean (µi ) • has the same standard deviation That is, Xij ∼ N(µi , σ) c 2007 Randall Pruim ([email protected]) 56 April 5, 2007 Math 243 – Spring 2007 57 Checking the assumptions • Equal Variances Check. ◦ Rule of Thumb: +---------+-+--+--+----+---+-----+----+--------+--------+ | | |N |0%|25% |50%|75% |100%|mean |sd | +---------+-+--+--+----+---+-----+----+--------+--------+ |treatment|A| 8|5 |6.00| 7 | 8.25|10 | 7.25000|1.669046| | |B| 8|7 |7.75| 9 |10.00|11 | 8.87500|1.457738| | |P| 9|7 |9.00|10 |11.00|13 |10.11111|1.763834| +---------+-+--+--+----+---+-----+----+--------+--------+ • Overall Normality Check: ◦ uses residual = xij − xi (how different is each data value from its group mean?) 3 ● ● ●● ● 1 ●● ●● ●● ●●●●● ● ● ●●● −1 −3 Sample Quantiles Normal Q−Q Plot for Blister Residuals ● ● ● ● −2 −1 0 1 2 Theoretical Quantiles What does ANOVA do? At its simplest (there are extensions) ANOVA tests the following hypotheses: • H0 : The means of all the groups are equal. (µ1 = µ2 = · · · = µI ) • Ha : Not all the means are equal. ◦ doesn’t say how or which ones differ ◦ can follow up with “multiple comparisons” if we reject H0 A quick look at the ANOVA table treatment Residuals Df 2 22 Sum Sq 34.74 59.26 Mean Sq 17.37 2.69 F value 6.45 Pr(>F) 0.0063 Conclusion: c 2007 Randall Pruim ([email protected]) April 5, 2007 Math 243 – Spring 2007 58 How ANOVA works – some details ANOVA measures two sources of variation in the data and compares their relative sizes • variation BETWEEN groups (treatment effect) ◦ for each data value look at the difference between its group mean and the overall mean X X (group mean − overall mean)2 = (x̄i − x̄)2 = SST r • variation WITHIN groups (“error”) ◦ for each data value we look at the difference between that value and the mean of its group X X (individual measurement − group mean)2 = (xij − x̄i )2 = SSE • SST r and SSE are then adjusted to account for sample sizes and the number of groups: ◦ DF T r = I − 1 = degrees of freedom for numerator ◦ DF E = n − I = degrees of freedom for denominator ◦ M ST r = SST r/DF T r; M SE = SSE/DF E The ANOVA F-statistic is a ratio of the Between Group Variaton (explained by FIT) divided by the Within Group Variation (Residuals): F = Between explained variation M ST r = = Within unexplained variation M SE • A large value of F is evidence H0 , since it indicates that • If H0 is true (and the model assumptions are also true), then the sampling distribution for F has an F-distribution. • In fact, if H0 is true (and the model assumptions are also true), then E(M ST r) = E(M SE) = σ 2 so F will tend to be about 1. • When H0 is not true (but model assumptions are still true), then E(M ST r) > E(M SE) = σ 2 so F will tend to be > 1. c 2007 Randall Pruim ([email protected]) April 5, 2007 Math 243 – Spring 2007 59 Computing F (an example) Let’s try an even smaller example. Suppose we have three groups Group 1: 5.3, 6.0, 6.7 [x̄1 = 6.00] Group 2: 5.5, 6.2, 6.4, 5.7 [x̄2 = 5.95] Group 3: 7.5, 7.2, 7.9 [x̄3 = 7.53] overall mean: 6.44; F = 2.5528/0.25025 = 10.21575 The ANOVA table revisited The traditional way to summarize all these calculations is in an ANOVA table (which matches the inforation above up to round-off error): group Residuals Df 2 7 Sum Sq 5.13 1.76 Mean Sq 2.56 0.25 F value 10.22 Pr(>F) 0.0084 F value Pr(>F) 0.0063 Now let’s fill in the ANOVA table for our blister example: Df treatment Residuals Sum Sq 34.74 59.26 Mean Sq c 2007 Randall Pruim ([email protected]) April 5, 2007 Math 243 – Spring 2007 60 SST, MST, and R2 Let’s add another row to our ANOVA table for the blister example: treatment Residuals Total Df 2 22 24 Sum Sq 34.74 59.26 94.00 Mean Sq 17.37 2.69 3.92 F value 6.45 Pr(>F) 0.0063 √ DF T = 24 SST = 94 M ST = 3.92 > favstats(blisters$days) 0% 25% 50% 75% 100% 5.000000 7.000000 9.000000 10.000000 13.000000 R2 = mean 8.800000 M ST = 1.98 sd 1.979057 SST r = proportion of variation explained by groups SST Getting R to do ANOVA For simple ANOVA, the key function is lm() (stands for linear model). The basic form is model <- lm(response~explanatory,data=mydata) # saving results in a variable called model Here’s how to analyze our blister data in R (We’ll learn some more ANOVA-related R commands next week). > model <- lm(days~treatment,data=blisters) > bwplot(days~treatment,data=blisters) # make side-by-side boxplots > anova(model) Analysis of Variance Table Response: days Df Sum Sq Mean Sq F value Pr(>F) treatment 2 34.736 17.368 6.4474 0.006256 ** Residuals 22 59.264 2.694 > plot(model) # some diagnostic plots > qqmath(resid(model)) # normal quantile plot of residuals > summary(model) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.2500 0.5803 12.494 1.83e-11 *** treatmentB 1.6250 0.8206 1.980 0.06033 . treatmentP 2.8611 0.7975 3.588 0.00164 ** Residual standard error: 1.641 on 22 degrees of freedom Multiple R-Squared: 0.3695,Adjusted R-squared: 0.3122 F-statistic: 6.447 on 2 and 22 DF, p-value: 0.006256 c 2007 Randall Pruim ([email protected]) April 9, 2007 Math 243 – Spring 2007 61 Dot notation • x·· = grand mean (all items in all groups) • xi· = mean for group i Where’s the difference? If the ANOVA test gives a small p-value, the natural follow-up question is: which groups differ from which? An easy idea is to do a bunch of 2-sample t tests (on each pair of groups) to answer this question. This doesn’t quite work because of . . . The problem of multiple comparisons Example 1. If you do a study with 5 groups, how many pairs of groups are there? Example 2. If you do 10 independent hypothesis tests with α = 0.05, and the null hypothesis is true for each test, what is the probablity that at least one will be “significant at the α = 0.05 level” just by random chance? What if we make 10 confidence intervals? One solution to this problem was proposed by Tukey. We won’t go into details, but here are the key ideas: • compare each pair of groups by forming confidence intervals • adjust the critical value (t∗ ) to widen the intervals so that the probability that all the confidence intervals formed by a random sample correctly contain the parameter is the desired confidence level (95%, for example). [simultaneous confidence intervals] • consider two groups to be significantly different if the confidence interval for that pair does not include 0. • if each group as J observations, the CI’s have the following form p xi· − xj· ± Q∗ M SE/J where Q∗ is our adjusted critical value and J is the size of each group. • distribution for Q is called the Studentized range distribution. • this can be adjusted to work for groups of similar but not exactly equal size. R can automate this: • TukeyHSD(aov(response~explanatory,data=mydata)) • TukeyHSD(aov(model)) where model is the result of lm(). • simint() in the multcomp package can handle Tukey’s method as well as several other methods for simultaneous confidence intervals: simint(response~explanatory,data=mydata,type="Tukey") c 2007 Randall Pruim ([email protected]) April 9, 2007 Math 243 – Spring 2007 62 Attracting Bugs NumTrap 60 50 40 30 20 10 > getData(’m243/bugs’) -> bugs > summary(NumTrap~Color,bugs,fun=favstats) NumTrap N=24 ● B ● G +-------+-+--+--+-----+----+-----+----+--------+---------+ | | |N |0%|25% |50% |75% |100%|mean |sd | +-------+-+--+--+-----+----+-----+----+--------+---------+ |Color |B| 6| 7|11.75|15.0|19.00|21 |14.83333| 5.344779| | |G| 6|15|26.75|34.5|38.50|41 |31.50000| 9.914636| | |W| 6|12|13.25|15.5|17.00|21 |15.66667| 3.326660| | |Y| 6|38|45.25|46.5|47.75|59 |47.16667| 6.794606| +-------+-+--+--+-----+----+-----+----+--------+---------+ |Overall| |24| 7|14.75|21.0|39.50|59 |27.29167|14.947674| +-------+-+--+--+-----+----+-----+----+--------+---------+ > bwplot(NumTrap~Color,bugs) > bugs.lm <- lm(NumTrap~Color,bugs) > bugs.aov <- aov(NumTrap~Color,bugs) > summary(bugs.lm) Median 0.1667 3Q 5.2083 ● W Residuals: Min 1Q -16.5000 -2.9167 Max 11.8333 Residual standard error: 6.784 on 20 degrees of freedom Multiple R-Squared: 0.8209,Adjusted R-squared: 0.794 F-statistic: 30.55 on 3 and 20 DF, p-value: 1.151e-07 ● ● ● Y Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 14.8333 2.7696 5.356 3.05e-05 *** ColorG 16.6667 3.9168 4.255 0.000387 *** ColorW 0.8333 3.9168 0.213 0.833671 ColorY 32.3333 3.9168 8.255 7.16e-08 *** residual 10 5 0 −5 −10 > summary(bugs.aov) Df Sum Sq Mean Sq F value Pr(>F) Color 3 4218.5 1406.2 30.552 1.151e-07 *** Residuals 20 920.5 46.0 0 ● ● > anova(bugs.lm) Analysis of Variance Table ● ● ● 5 Response: NumTrap Df Sum Sq Mean Sq F value Pr(>F) Color 3 4218.5 1406.2 30.552 1.151e-07 *** Residuals 20 920.5 46.0 ● ● ● ● ● ● ● ● ● ● 15 Fit: aov(formula = NumTrap~Color, data = bugs) 10 ● order of observation > plot(bugs.lm) > plot(bugs.aov) > > TukeyHSD(bugs.aov) Tukey multiple comparisons of means 95% family-wise confidence level ● $Color ● diff lwr upr p adj G-B 16.6666667 5.703670 27.629663 0.0020222 W-B 0.8333333 -10.129663 11.796330 0.9964823 Y-B 32.3333333 21.370337 43.296330 0.0000004 W-G -15.8333333 -26.796330 -4.870337 0.0032835 Y-G 15.6666667 4.703670 26.629663 0.0036170 Y-W 31.5000000 20.537004 42.462996 0.0000006 ● ● 20 ● ● ● ● 25 > plot(TukeyHSD(bugs.aov)) > simint(NumTrap~Color,bugs,type=’Tukey’) Simultaneous confidence intervals: Tukey contrasts 95 % confidence intervals Estimate 2.5 % 97.5 % ColorG-ColorB 16.667 5.704 27.630 ColorW-ColorB 0.833 -10.130 11.796 ColorY-ColorB 32.333 21.370 43.296 ColorW-ColorG -15.833 -26.796 -4.870 ColorY-ColorG 15.667 4.704 26.630 ColorY-ColorW 31.500 20.537 42.463 > plot(simint(NumTrap~Color,bugs,type=’Tukey’)) c 2007 Randall Pruim ([email protected]) April 9, 2007 Math 243 – Spring 2007 ColorG−ColorB ColorW−ColorB ● ● ● ● ● ● ● ● ● ● ● ● 15 ColorY−ColorB 5 10 ColorW−ColorG 0 ColorY−ColorG ColorY−ColorW Residuals −10 63 ( 20 −20 ● ( ) ( ) ● ● 35 ( 20 ( Tukey contrasts ● ● ● ● ●9 ( Residuals vs Fitted 30 ) ) 40 ● ● ) ) 20 ● ● ● ● ● 23 ● 40 45 W−G 0 95 % two−sided confidence intervals 25 Fitted values lm(formula) Y−W ChickWeight W−B −20 −10 0 10 20 Differences in mean levels of Color 30 95% family−wise confidence level > > > > > data(ChickWeight) bwplot(weight~Diet,data=ChickWeight,subset=Time==21) chick.lm <- lm(weight~Diet,data=ChickWeight,subset=Time==21) xyplot(weight~Time|Chick,data=ChickWeight,type=’b’,lwd=2) xyplot(weight~Time|Diet,groups=Chick,data=ChickWeight,type=’b’,lwd=2) 40 c 2007 Randall Pruim ([email protected]) April 10, 2007 Math 243 – Spring 2007 64 Checking Model Assumptions The ANOVA assumptions are about the population, not about the samples, so we can’t directly check them. But we can check to see if our data look like a reasonable sample from a population with • normal distriubutions for each group • the same variance in each group We do this by looking at the resiuduals (xi j − xi· ). plot(model) – where model is the results of lm() – will show some diagnostic plots. Fixing Problems Dealing with outliers • Outliers that can be determined to be clear errors of some sort should be fixed, if possible. • If we are sure the value is wrong, but have no information about what the correct value is, we may choose to remove the value from the data. • We can’t just remove outliers because that makes the statistics work better. • Winsorizing: moving the top α of the data to the 1 − α-quantile and the bottom α of the data to the α-quantile (where α is not too big) is a way to make extreme data values less influential. Data transformations • Sometimes applying a function to the data before analysis improves normality and/or homeoscadsticity (equal variance in the different groups). Common transformations include: powers, roots, logarithms, and exponentials. • Some trasformations are suggested by the context of the problem. Square root often works well for Poisson data. • Sometimes transformations to fix heteroscadasticity hurt normality and vice versa. • If the assumptions of ANOVA don’t seem to be met and no (reasonable) transformation fixes the problem, then we need to turn to other methods. ANOVA with 2 groups vs. 2-sample t ANOVA with two groups is exactly the same thing as a two-sample t-test that assumes a common variance in the two groups. (Recall that this is not as robust as the two sample t-test that does not make this assumption). In fact, in this case F = t2 . c 2007 Randall Pruim ([email protected]) April 10, 2007 Math 243 – Spring 2007 65 A little linear algebra The Big Picture corrected observation : y − y·· fitted error: y − yi· observation: y fit: yi· treatment yi· − y·· model space overall mean: y·· Key ideas: • The important triangle is the red-green-blue one showing y − y·· = yi· − y·· + y − yi· • It is pretty easy to show that yi· − y·· ⊥ y − yi· . • This means that (by the Pythagorean identity) |y − y·· |2 = |yi· − y·· |2 + |y − yi· |2 • But this is the same as P ij (Yij − Y ·· )2 = SST = P ij (Y ·j − Y ·· )2 + SST r P ij (Yij + − Y ·j )2 SSE • Degrees of freedom are just dimensions of subspaces: ◦ Since the sum of the components in y−y·· must be 0 (why?) y−y·· lives in an (n−1)-dimensional subspace. (Now you know why degrees of freedom is n − 1.) ◦ yi· lives in an I-dimensional space called the model space since there are I group means to specify. yi· is the closest vector in the model space to y. (It can be obtained by projection onto the model space.) ◦ This means that yi· − y·· lives in an (I − 1)-dimensional space. This explains why DF T r = I − 1. ◦ y − yi· lives in an (n − I)-dimensional space. This explains why DF E = n − I. • Our test statistic is F = SST r/DF T r SSE/DF E c 2007 Randall Pruim ([email protected]) April 10, 2007 Math 243 – Spring 2007 66 The primarily the distinctions between different types of ANOVA are • how one further subdivides the green treatment vector, and • what restrictions are placed on the model space (which affects the purple fit vector). These restrictions will typically specify proposed (types of) relationships between the group means. An ANOVA example using linear algebra Here is a very small example. 1 2 3 4 5 6 pollution 124.00 110.00 107.00 115.00 126.00 138.00 location Hill Suburb Hill Suburb Plains Suburb Plains Suburb Central City Central City The data consist of 6 measurements of air pollution – 2 each at 3 locations in a metropolitan area. In our data, the largest values occur in the central city, but perhaps that is just do to random chance. We can use the method just described to test the null hypothesis that there is no difference in air quality between the 3 locations. This can be done easily in R. > getCalvin(’m344’); pol <- getData(’m344/airpollution’) > pol.lm <- lm(pollution~location,data=pol) > anova(pol.lm) Analysis of Variance Table Response: pollution Df Sum Sq Mean Sq F value Pr(>F) location 2 468 234 3.48 0.17 Residuals 3 202 67 SSTr and SSE are located in the second column; these are divided by the appropriate degrees of freedom (subspace dimension) in the third column to give MSTR and MSE, from which the test statistic F is computed in the fourth column. The p-value is also listed. In this case the evidence is not strong enough to reject the hypothesis that air quality is the same at all three locations. This isn’t too surprising given such a small sample size. Notice that SST is missing from this table and that R uses the word ‘residuals’ instead of ‘error’. c 2007 Randall Pruim ([email protected]) April 10, 2007 Math 243 – Spring 2007 67 Show me the vectors! A suitable set of vectors to use as a basis for R6 in this case is 1 1 1 1 0 0 1 1 1 −1 0 0 1 −1 1 0 1 0 , , , , 1 −1 1 0 −1 , 0 1 0 −2 0 0 1 1 0 −2 0 0 −1 We would, of course, normalize to unit length to get our vectors U1 , . . . , U6 . A little matrix algebra gives y = 124 110 107 115 126 138 = 120 120 120 120 120 120 + 3 3 −3 −3 0 0 + −6 −6 −6 −6 12 12 + 7 −7 0 0 0 0 + 0 0 −4 4 0 0 + 0 0 0 0 −6 6 And our Pythagorean decomposition is 124 110 107 115 126 138 124 110 107 115 126 138 2 3 3 −3 = −3 0 0 + 670 = + 670 = − − 120 120 120 120 120 120 120 120 120 120 120 120 2 2 = 36 −6 −6 −6 −6 12 12 2 432 468 −3 −3 −9 −9 12 12 + 2 7 −7 0 0 0 0 + 98 + 2 Notice how these results correspond to the R output above. c 2007 Randall Pruim ([email protected]) 2 0 0 −4 + 4 0 0 + 0 0 0 0 −6 6 + + 72 32 202 2 7 −7 −4 + 4 −6 6 2 Math 243 – Spring 2007 Friday the Thirteenth of April, 2007 68 A new situation For ANOVA, we had a categorical explanatory variable and a quantitative response variable. Regression (at least our first form of regression) deals with two quantitative variables. Taking a look As with ANOVA, we’ll begin with pictures: situation ANOVA regression plot side-by-side boxplots scatter plot R command bwplot(y~x, data=mydata) xyplot(y~x, data=mydata) xyplot(y~x|z, data=mydata) xyplot(y~x, data=mydata, groups=z) Examples > data(iris) ; data(ToothGrowth) ; data(iris) # built-in datasets > require(MASS) ; data(GAGurine) ; data(Animals) # datasets in package MASS The simple linear regression model The ANOVA model was quite simple – we were just trying to tell if a bunch of means were the same or different. Now things are a bit more complicated. There are many different relationships that could exist bewteen two quantitative variables. The model for simple linear regression is Yi = β0 + β1 xi + i |{z} | {z } |{z} data fit i ∼ N (0, σ) error The assumptions for the linear regression model are 1. 2. 3. c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 Friday the Thirteenth of April, 2007 69 The best fit line When this sort of linear relationship between our variables seems plausible, how do we decide which line is the best fit for our data? In other words, how do we estimate the parameters β0 and β1 and how reliable are those estimates? The most common answer to this questions is the least squares line, that is the line that minimizes the sum of the squared residuals.2 That is, we want to choose b0 and b1 so that SSE = X [yi − (b0 + b1 xi )]2 = i X residual2i i is as small as possible. So to find b0 and b1 we need to solve an optimization problem. Fortunately, calculus provides a nice solution to this problem. Take a couple partial derivatives, do a little algebra, and we learn . . . • y = b0 + b1 x, so (x, y) is always on the regression line, and b0 = y − b1 x. P (yi −y) (xi −x) P (yi −y) (xi −x) P · sx · sx sy Sxy sy sy i sy i sy i − y)(xi − x) i (y P P · • b1 = = = = · =r 2 2 (xi −x) i Sxx x) s n − 1 s s (x − x x x i i 2 sx So for a given value of x, our regression line predicition for y is ŷ = b0 + b1 x = βˆ0 + βˆ1 x Correlation Coefficient P i (yi −y) (xi −x) · s s y y The expression r = is called the correlation coefficient. Notice that the terms in the numerator n−1 are products of z-scores for x and y. 1. The value of r is a measure of the strength and direction of the association between x and y. 2. r is symmetric in x and y, so it doesn’t depend on which variable you consider to be explanatory and which to be response. 3. −1 ≤ r ≤ 1 P P P SSR SSE =1− , where SSR = i (ŷ − y)2 , SSE = i (yi − ŷ)2 , and SST = i (yi − y)2 (just SST SST like in ANOVA). 4. r2 = 2 Linear algebra note: in terms of linear algebra, this is the closest point in the model space to the data point. c 2007 Randall Pruim ([email protected]) Friday the Thirteenth of April, 2007 Math 243 – Spring 2007 70 Estimating σ 2 We now have our least squares estimates for β0 and β1 , what about for the other parameter in the model, namely σ 2 ? SSE σ̂ = = n−2 2 P − ŷ)2 n−2 i (yi Why n − 2? • This is the correct denominator to make the estimate unbiased: E(σ̂ 2 ) = σ 2 . • This is the correct degrees of freedom3 , since (roughly) we lose a degree of freedom for estimating each of β0 and β1 . As usual, we will let s = √ σ̂ 2 . R and regression The basic command for doing regression in R is the same as the one for doing ANOVA. In fact, ANOVA is really just a special case of regression (more on that idea later). Example 1. We want to predict ACT scores from SAT scores. We sample scores from 60 sutdents who have taken both tests. > getData("m243/act-sat") -> act > xyplot(ACT~SAT,act,panel=panel.lm) > act.lm <- lm(ACT~SAT,data=act) > plot(act.lm) > anova(act.lm) Analysis of Variance Table > summary(act.lm) Response: ACT Df Sum Sq Mean Sq F value Pr(>F) SAT 1 874.37 874.37 116.16 1.796e-15 *** Residuals 58 436.56 7.53 Residual standard error: 2.744 on 58 degrees of freedom Multiple R-Squared: 0.667,Adjusted R-squared: 0.6612 F-statistic: 116.2 on 1 and 58 DF, p-value: 1.796e-15 ● 30 ● ● ● ●● lm.actAdj <- lm(ACT~SAT,actAdj) 25 ● Estimate Std. Error t value Pr(>|t|) (Intercept) -1.427642 1.691981 -0.844 0.402 SAT 0.024498 0.001807 13.556 <2e-16 *** ACT > act[47,] Student SAT ACT 47 47 420 21 > actAdj <- act[-47,]; > summary(lm.actAdj) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.626282 1.844230 0.882 0.382 SAT 0.021374 0.001983 10.778 1.80e-15 *** ● ● 15 Residual standard error: 2.333 on 57 degrees of freedom Multiple R-Squared: 0.7632,Adjusted R-squared: 0.7591 F-statistic: 183.8 on 1 and 57 DF, p-value: < 2.2e-16 ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● ● ● ● ● 10 3 ● ● ● ● ●● ● ● ● ● ● ● 20 ● ● ● ● ● ●● ●● ● ● ● ● 400 600 linear algebra note: degrees of freedom equals dimension of subspace c 2007 Randall Pruim ([email protected]) 800 1000 SAT 1200 1400 Friday the Thirteenth of April, 2007 Math 243 – Spring 2007 71 Since the model is based on normal distributions and we don’t know σ . . . 1. Regression is going to be sensitive to outliers. Outliers with especially large or small values of the independent variable are especially influential. 2. We can check if the model is reasonable by looking at our residuals: (a) Histograms and normal quantile plots indicate overall normality. We are looking for a roughly bell-shaped histogram or a roughly linear normal quantile plot. (b) Plots of • residuals vs x, or • residuals vs. order, or • residuals vs. fit note: fit = indicate whether the standard deviation appears to remain constant throughout. We are looking to NOT see any clear pattern in these plots. A pattern would indicate something other than randomness is influencing the residuals. 3. We can do inference for β0 , β1 , etc. using the t distributions, we just need to know the corresponding eSE and degrees of freedom. parameter estimator eSE df s x̄2 1 +P n (xi − x̄)2 s = pP (xi − x̄)2 β0 βˆ0 = b0 eSEb0 = s n−2 β1 βˆ1 = b1 eSEb1 n−2 We won’t ever compute these eSE’s by hand, but notice that they are made up of pieces that look familiar (square roots, n in the denominator, square of differences from the mean, all the usual stuff.) Furthermore, just by looking at the formulas, we can learn something about the behavior of the confidence intervals and hypothesis tests involved. eSEb0 and eSEb1 are easy to identify in the computer output. We can also find the (two-sided) p-values for two hypothesis tests. (a) H0 : β0 = 0 [usually not interesting] (b) H0 : β1 = 0 This is much more interesting for two reasons. First, the slope is often a very interesting parameter to know because Second, this is a measure of how useful the model is for making predictions because β1 = 0 only if 4. Confidence intervals for β0 (usually not interesting) and β1 (usually interesting) have a familiar form: estimate ± t∗ eSE c 2007 Randall Pruim ([email protected]) April 16, 2007 Math 243 – Spring 2007 72 Inference for Regression Four inference situations: β0 (usually least interesting), β1 (and model utility), predicting the mean response for a given value of the explanatory variable, predicting an individual response for a given value of the explanatory variable. All of these are based on the normal and t- distributions (because of the model). In each of the cases below parameter estimate − parameter ∼ tn−2 eSE estimator eSE df s β0 βˆ0 = b0 β1 βˆ1 = b1 µy|x µ̂y|x∗ = ŷ = βˆ0 + βˆ1 x∗ ŷ (individual prediction) ŷ = ŷ = βˆ0 + βˆ1 x∗ 1 x̄2 +P n (xi − x̄)2 s eSEb1 = pP (xi − x̄)2 s 1 x∗ − x̄ eSEµ̂ = s +P n (xi − x̄)2 s 1 x∗ − x̄ eSEŷ = s 1 + + P n (xi − x̄)2 eSEb0 = s Confidence and prediction intervals in R > act.lm <- lm(ACT~SAT,data=act) > predict(lm.act,newdata=data.frame(SAT=1000),interval=’confidence’) fit lwr upr [1,] 22.99997 22.21076 23.78917 > predict(lm.act,newdata=data.frame(SAT=1000),interval=’prediction’) fit lwr upr [1,] 22.99997 17.45177 28.54816 > xyplot(ACT~SAT,data=act,panel=panel.lmbands) Model Checking for Regression • If a line doesn’t fit, don’t fit a line. (Anscombe’s examples) • Look at various residual plots. Where does regression go from here? • Regression with transformed data • More than one explanatory variable • Categorical explanatory variables (ANOVA is regression) • Robust regression • Logistic regression (allows for categorical response) c 2007 Randall Pruim ([email protected]) n−2 n−2 n−2 n−2 April 16, 2007 Math 243 – Spring 2007 73 Example: Skin Thickness and Body Density There are many reasons why one would like to the the fat content of a human body. The most accurate way to estimate this is by determining the body density (weight per unit volume). Since fat is less dense than other body tissue, a lower density indicates a higher relative fat content. The best way to estimate body density is difficult to measure directly (the standard method requires weighing the subject underwater), so scientists have looked for other measurements that can accurately predict body density. One such measurement we will call skinfold thickness, and is actually the logarithm of the sum of four skinfold thicknesses measured at different points on the body. ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ●● ●● ● ●● 1.06 ● ● ● ● ● ● ●● ● ● ●●● ● ●● ● ● ● ●● ●●● ● ● ● 1.04 ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● 1.0 1.2 1.4 1.6 1.8 ● 70 ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●●●● ●● ●● ● ● ● ●●● ●● ● ●● ● ● ●●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● 1.03 2.0 1.05 1.09 −2 −1 0 1 2 Residuals vs Leverage ● ● 70 61 ● 9● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● 1.05 1.07 ● 1.09 Fitted values 3Q 0.004949 ● Scale−Location 1.03 Residuals: Min 1Q Median -0.018967 -0.005092 -0.000498 ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● Theoretical Quantiles 1.5 1.0 0.5 1.07 ● 61 ● 70 ● ●● 9 Fitted values ● 0.0 Standardized residuals skthick ● ● Normal Q−Q 0 1 2 3 density ●● 61 ● 9● −2 ● ● 0.02 ● ● 0.00 1.08 Residuals vs Fitted ● ● ● ●● −0.02 ●● ● Max 0.023679 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.16300 0.00656 177.3 <2e-16 skthick -0.06312 0.00414 -15.2 <2e-16 Residual standard error: 0.00854 on 90 degrees of freedom Multiple R-Squared: 0.72, Adjusted R-squared: 0.717 F-statistic: 232 on 1 and 90 DF, p-value: <2e-16 Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) skthick 1 0.01691 0.01691 232 <2e-16 Residuals 90 0.00656 0.00007 c 2007 Randall Pruim ([email protected]) 0 1 2 3 ● ● Residuals ● 0.5 ● 70 ● ●● 42 ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● Cook's distance −2 ● ● Standardized residuals ● Standardized residuals Use the R output below to answer the question: How well does skinfold thickness predict body density? 23 0.00 0.04 Leverage 0.08 April 20, 2007 Math 243 – Spring 2007 74 Regression with transformed data Reasons to transform data • better fit • better residual behavior • theoretical model Some common transformations If we transform the data by x0 = f (x) and y 0 = g(y), then we can fit the model: y 0 = β 0 + β 1 x 0 + ε0 and then back-transform to get a model in terms of the original variables. transformation linear form back transformation residuals x0 = log(x), y 0 = y y = β0 + β1 log(x) y = β0 + β1 log(x) + ε0 ε = ε0 ∼ N(0, σ) y 0 = log(y), x0 = x log(y) = β0 + β1 x y = β0 · eβ1 x ε log(ε) = ε0 ∼ N(0, σ) x0 = log(x), y 0 = log(y) log(y) = β0 + β1 log(x) y = β0 · xβ1 ε log(ε) = ε0 ∼ N(0, σ) x0 = 1/x, y 0 = y y = β0 + β1 /x y = β0 + β1 /x + ε ε = ε0 ∼ N(0, σ) .. . .. . .. . .. . Some examples c 2007 Randall Pruim ([email protected]) April 23, 2007 Math 243 – Spring 2007 75 Logistic regression: Regression with a categorical response The logistic regression model • Idea: predict p(x) = probability of success for given value of explanatory variable x. • Problem: p(x) ∈ [0, 1], but the range of β0 + β1 x is (−∞, ∞) • A Fix: transform p(x) to something with range (−∞, −∞) ◦ odds: p(x) 1−p(x) ◦ log odds: log ∈ [0, ∞) p(x) 1−p(x) [ assuming p(x) 6= 1 ] ∈ (∞, ∞) [assuming p(x) 6= 0 and p(x) 6= 1] • logistic model (fit and back transformation) p(x) log = β0 + β1 x 1 − p(x) eβ0 +β1 x p(x) = 1 + eβ0 +β1 x Fitting the logistic regression model Logistic regression is not fit using a least squares method. Instead it uses the maximum likelihood method. • data: (1, S), (2, S), (3, F ) • likelihood: L(b0 , b1 ) = p(1) · p(2) · (1 − p(3)) = eb1 +b0 ·2 eb1 +b0 ·3 eb1 +b0 ·1 · · (1 − ) 1 + eb1 +b0 ·1 1 + eb1 +b0 ·2 1 + eb1 +b0 ·3 • We want to find b1 and b0 that make L(b0 , b1 ) as large as possible. ◦ maximum likelihood estimates for logistic regression are approximated by numerical methods. For simple linear regression, the least squares estimates and the maximum likelihood estimates are the same. Using R R is happy to fit a logistic regression model for you. Example 1. Space Shuttle O-rings. Here is an example using the space shuttle O-ring data (Example 13.6). Note: the data in the book do not match the electronic data. As far as I can tell, the printed data in the book are incorrect. c 2007 Randall Pruim ([email protected]) April 23, 2007 Math 243 – Spring 2007 76 > data(xmp13.06) > glm(Failure~Temperature,data=xmp13.06,family=’binomial’) -> glm.shuttle > summary(glm.shuttle) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 11.74641 6.02142 1.951 0.0511 . Temperature -0.18843 0.08909 -2.115 0.0344 * > predict(glm.shuttle,newdata=data.frame(Temperature=31)) [1] 5.905015 > predict(glm.shuttle,newdata=data.frame(Temperature=31),type=’response’) [1] 0.9972817 > > temps <- seq(30,100,by=2) > xyplot(predict(glm.shuttle,type=’response’,newdata=data.frame(Temperature=temps))~temps) Example 2. Baseball How well does average margin of victory predict winning percentage for major league baseball teams? > getData("m243/runswins04") -> bb > bb$runmargin = (bb$R - bb$OR) / (bb$G) # data has summarized data for each team, so different syntax here: > glm(cbind(W,L)~runmargin,data=bb,family=’binomial’) -> glm.bb > summary(glm.bb) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.001175 0.029057 -0.04 0.968 runmargin 0.447539 0.041646 10.75 <2e-16 *** > bb$winP <- bb$W/bb$G > bb$predWinP <- predict(glm.bb,newdata=data.frame(runmargin=bb$runmargin),type=’response’) > bb[,c(1,22,23)] TEAM winP predWinP 1 Anaheim 0.5679012 0.5696953 2 Boston 0.6049383 0.6221895 3 Baltimore 0.4814815 0.5079932 4 Cleveland 0.4938272 0.5003968 . . . 28 29 30 > New York 0.4382716 0.4672926 Montreal 0.4135802 0.4082120 Milwaukee 0.4161491 0.4150606 Note: In this particular example, a simple linear regression works quite well too because winning percentages for baseball teams are quite close to .500, and the middle of the curves modeled by logistic regression are pretty flat. c 2007 Randall Pruim ([email protected]) April 26, 2007 Math 243 – Spring 2007 77 Interpreting the parameters in logistic regression Once again, the more interesting parameter is β1 . According to the logistic regression model, p(x+1) p(x + 1) p(x) 1−p(x+1) β1 = log − log = log p(x) = log odds ratio 1 − p(x + 1) 1 − p(x) 1−p(x) and eβ1 = odds ratio That is, for each increase of 1 unit for the explanatory variable, the odds increase by a factor of eβ1 . We can use the output from R to construct a confidence interval for this odds ratio: > glm(Failure~Temperature,data=xmp13.06,family=’binomial’) -> glm.shuttle > summary(glm.shuttle) Estimate Std. Error z value Pr(>|z|) (Intercept) 11.74641 6.02142 1.951 0.0511 . Temperature -0.18843 0.08909 -2.115 0.0344 * > coef(glm.shuttle)[2] + c(-1,1) * 0.08909 * qnorm(.975) [1] -0.36304535 -0.01381897 > exp(coef(glm.shuttle)[2] + c(-1,1) * 0.08909 * qnorm(.975)) [1] 0.6955549 0.9862761 From which we determine the confidence intervals 95% CI for β1 : 0. − 19943 ± 1.96 · 0.08909 = (−0.363, −0.0138) (e−0.363 , e−0.0138) 95% CI for odds ratio (eβ1 ): = (0.696, 0.986) Relative risk If we have two probabilties p and q (that don’t necesarily add to 1), then • odds ratio: • relative risk: p 1−p q 1−q p q Notice that the odds ratio satisfies p 1−p q 1−q = p 1−q · = So that relative risk and odds ratio are nearly the q 1−p same when 1−p 1−q ≈ 1. This is often the case when p and q are small – rates of diseases, for example. Since relative risk is easier to interpret (I’m 3 times as likely . . . ) than odds ratio (my odds are 3 times greater that . . . ), we can use relative risk to help us get a feeling for odds ratios in some situations. c 2007 Randall Pruim ([email protected]) April 26, 2007 Math 243 – Spring 2007 78 Multiple Regression Multiple regression handles more than one explanatory variable. A typical model with k explanatory variables (often called predictors or regressors) has the form Y = β0 + βi xi + βi xi + · · · βi xi + ε where ε ∼ N(0, σ). This model is called the general additive model. Higher order terms and interaction Many interesting regression models are formed by using transformations or combinations of explanatory variables as predictors. Suppose we have two predictors. Here are some possible models • First-order model (same as above): Y = β0 + βi xi + βi xi + ε • Second-order, no interaction: Y = β0 + βi xi + βi xi + β3 x21 + β4 x22 + ε • First-order, plus interaction: Y = β 0 + β i xi + β i xi + β 3 x1 x2 + ε • Complete Second-order model: Y = β0 + βi xi + βi xi + β3 x21 + β4 x22 + β5 x1 x2 + ε Categorical predictors We can even handle categorical predictors! We do this by introducing dummy variables (less pejoratively called indicator variables). If a variable v has only two possible values – A and B – we can build an indicator variable for v as follows: ( 1 v=B x= 0 v= 6 B That is, we simply code the possbilities as 0 and 1. If we have more than two possible values (levels), we introduce multiple dummy variables4 (one less than the number of levels). For a variable with three levels (A, B and C), one standard encoding (and the one that R uses by default) is ( 1 x1 = 0 v=B v 6= B ( 1 x2 = 0 v=C v 6= C We don’t need to have 3 dummy variables, since the intercept term captures the effect of one of the levels. Notice that ANOVA is simply a first-order linear model with this sort of encoding. R conveneintly takes care of the recoding for us. 4 There are models that do other things. Coding with a single variable with values 0, 1, 2 is possible, for example, but is a very different model. In this alternative there is an implied order among the groups and the effect of moving from category 0 to category 1 is the same size as the effect of moving from category 1 to category 2. This model should only be used if these assumptions make sense. c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 April 26, 2007 79 Interpreting the parameters • βi for i > 0 can be thought of as adjustments to the baseline affect given by β0 . (This is especially useful when the predictors are categorical and the intercept has a natural interpretation.) • When there are no interaction or higher order terms in the model, the parameter βi can be interpretted as the amount we expect the response to change if we increase xi by 1 and leave all other predictors fixed. The effects due to different predictors are additive and do not depend on the values of the other predictors. Graphically this gives us parallel lines or planes. • With higher order terms in the model, the dependence of the response on one predictor (with all other predictors fixed) may not be linear. • With interaction terms, the model is no longer additive: the effect of changing one predictor may depend on the values of the other predictors. Graphically, our curves are no longer parallel. c 2007 Randall Pruim ([email protected]) April 26, 2007 Math 243 – Spring 2007 80 Fitting the model The model can again be fit using least squares (or maximum likelihood) estimators.5 As you can imagine, the formulas for estimated standard errors become quite complicated, but statistical software will easily output this information for you, so we will focus on using the output from R and interpreting the results. Example 1. Let’s fit a second-order model with and without interaction for the data in Example 13.13. The variables are strength of concrete, % limestone (x1), and water-cement ratio (x2). > summary(lm(strength~x1*x2,xmp13.13)) (Intercept) x1 x2 x1:x2 > summary(lm(strength~x1+x2,xmp13.13)) Estimate Std. Error t value Pr(>|t|) 6.217 30.304 0.205 0.8455 5.779 2.079 2.779 0.0389 * 51.333 50.434 1.018 0.3555 -9.357 3.461 -2.704 0.0426 * Estimate Std. Error t value Pr(>|t|) (Intercept) 84.8167 12.2415 6.929 0.000448 *** x1 0.1643 0.1431 1.148 0.294673 x2 -79.6667 20.0349 -3.976 0.007313 ** Residual standard error: 2.423 on 5 degrees of freedom Residual standard error: 3.47 on 6 degrees of freedom Multiple R-Squared: 0.8946,Adjusted R-squared: 0.8314 Multiple R-Squared: 0.7406,Adjusted R-squared: 0.6541 F-statistic: 14.15 on 3 and 5 DF, p-value: 0.00706 F-statistic: 8.565 on 2 and 6 DF, p-value: 0.01746 > anova(lm(strength~x1*x2,xmp13.13)) > anova(lm(strength~x1+x2,xmp13.13)) Response: strength Df Sum Sq Mean Sq F value Pr(>F) x1 1 15.870 15.870 2.7037 0.161039 x2 1 190.403 190.403 32.4376 0.002328 ** x1:x2 1 42.903 42.903 7.3090 0.042605 * Residuals 5 29.349 5.870 Response: strength Df Sum Sq Mean Sq F value Pr(>F) x1 1 15.870 15.870 1.3179 0.294673 x2 1 190.403 190.403 15.8117 0.007313 ** Residuals 6 72.252 12.042 Model utility test • H0 : βi = 0 for all i > 0 (all but the intercept coefficient). ◦ if H0 is true, then there is random variation about a overall mean response (β0 ), but this variation is not explained by any of our predictors – so our model is useless to predict the response. • Ha : βi 6= 0 for at least one i > 0. • Test statistic: F = M SM M SE , where SSE = P P (yi − ŷi )2 , SST = (yi − y)2 , and SSM = SST − SSE. • degrees of freedom for model with k predictors (k + 1 parameters): ◦ intercept (β0 ) : 1 ◦ model (β1 . . . βk ): k ◦ residuals : n−1−k 5 If there are k predictors (k + 1 parameters including β0 ), then using the least squares approach will lead to a system of k + 1 equations in k + 1 unknowns, so standard linear algebra can be used to find the estimates. c 2007 Randall Pruim ([email protected]) April 27, 2007 Math 243 – Spring 2007 81 Fitting the multiple regression model (cont’d) Individual parameters These should look familiar by now. Estimates and estimated standard errors for each parameter are part of the summary() output. We can test hypotheses of the form H0 : βi = 0 or build confidence intervals for βi from this information. Note that these tests depend on the model as well as the parameter being tested. That is, we are testing whether or not βi = 0 in a model with a specific set of parameters. The parameter βi may be significantly non-zero in some models, but not in others. There are methods similar to the multiple comparisons method of Tukey (that we used in the ANOVA context) that allow one to simultaneous confidence intervals or confidence ellipsoids for sets of parameters. We won’t cover these in this course. The other F tests In our simple model (two first-order predictors, no interaction) the F -tests that appear in the ANOVA table are equivalent to the t tests in the coefficients table. In fact, F = t2 and the degrees of freedom for the denominator of F are the same as the degrees of freedom for t. In more complex situations the F -tests and p-valus are not as easy to interpret and this table is not so directly useful, but you can see how the degrees of freedom of the model utility test correspond to a dividing up of SST . This output will be more interesting when we have two categorical predictors (stay tuned). Model Comparison Multiple regression has opened up a world in which there is (seemingly) no end to the number of models one could choose to fit to a data set. How does one choose which one to actually use in a particular analysis? The coeficient of determination: r2 As in simple linear regression r2 = SSM SST measures the fraction of total variation explained by the model. A larger value of r2 indicates that more of the variation is explained by the model, so generally bigger is better. But there are some difficulties in interpreting this number: • The absolute size of a “good” r2 is highly dependent on the area of application. If there is a lot of variation in the population at each possible setting of the explanatory variables, then it is impossible to get high r2 values. • If we add parameters to a model, r2 will always increase. (Technically it could stay the same, but it can’t decrease.) So when comparing models with different numbers of parameters, the model with more parameters has an edge. The “adjusted r2 ” shown in R output is one way to compensate for this. (See page 594 of Devore6 for the formula used.) c 2007 Randall Pruim ([email protected]) April 27, 2007 Math 243 – Spring 2007 82 Individual parameters If the p-value from a test for an individual parameter is not small, then we don’t have convincing evidence that the parameter is non-zero. Such parameters become candidates for removal from the model. Comparing nested models Two models are said to be nested if the smaller model (the one with fewer parameters) is obtained from the larger model by setting some of the parameters to 0. For this situation there is a formal test: • H0 : βi = 0 for all i > l (the last k − l parameters in some ordering) ◦ if H0 is true, then the “reduced” model is correct. • Ha : βi 6= 0 for at least one i > l. ◦ if Ha is true, then the “full” model is better because at least some of the extra parameters are not 0. • Test statistic: F = M SEdif f M SEf ull The idea of this test is to take the unexplained variation from the reduced model and split it into two pieces: the portion explained by the full model but not by the reduced model (SSEdif f = SSEreduced − SSEf ull ) and the portion unexplained even in the full model (SSEf ull ). The degrees of freedom for the numerator is the difference in the number of parameters for the two models. The degrees of freedom for the denominator is the residual degrees of freedom for the full model. > lm(strength~x1*x2,xmp13.13) -> lm.full > lm(strength~x1+x2,xmp13.13) -> lm.noint > anova(noint,full) Analysis of Variance Table Model 1: Model 2: Res.Df 1 6 2 5 strength ~ x1 + x2 strength ~ x1 + x2 + x1:x2 RSS Df Sum of Sq F Pr(>F) 72.252 29.349 1 42.903 7.309 0.04260 * Here RSS is what we have typically called SSE and stands for Residual Sum of Squares. Notice that F = (RSSreduced − RSSf ull )/dfdif f 42.903/1 = = 7.309 RSSf ull /dff ull 29.349/5 In this case we get a small p-value because a large portion of the variation that the smaller model cannot explain, is explained by the larger model. So it seems we should include the interaction term. Now let’s compare the full model to a model without a term for x2: > lm(strength~x1+x1:x2,xmp13.13) -> lm.nox2; > summary(lm.nox2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.0167 1.6200 22.850 4.6e-07 *** c 2007 Randall Pruim ([email protected]) April 27, 2007 Math 243 – Spring 2007 x1 x1:x2 3.7478 -5.9725 0.5863 0.9628 6.392 -6.203 0.00069 *** 0.00081 *** Residual standard error: 2.43 on 6 degrees of freedom Multiple R-Squared: 0.8728,Adjusted R-squared: 0.8304 F-statistic: 20.58 on 2 and 6 DF, p-value: 0.002058 83 > summary(lm.full) # see page 79 > anova(lm.nox2, lm.full) Analysis of Variance Table Model 1: Model 2: Res.Df 1 6 2 5 strength ~ x1 strength ~ x1 RSS Df Sum 35.430 29.349 1 + x1:x2 + x2 + x1:x2 of Sq F Pr(>F) 6.081 1.036 0.3555 The evidence suggests that dropping x2 from the model (but leaving the interaction term x1:x2) is apropriate. Relationship to 1-parameter tests If we compare two models, one our (larger) model of interest and the other a model with one parameter removed, we get another way to do our test for a single paramter. These tests are equivalent, and F = t2 . > lm(strength~x1*x2,xmp13.13) -> lm.full > lm(strength~x2+x1:x2,xmp13.13) -> lm.nox1 > anova(lm.nox1,lm.full) > lm(strength~x1+x2,xmp13.13) -> lm.noint > lm(strength~x2,xmp13.13) -> lm.x2 > anova(lm.x2,lm.noint) Analysis of Variance Table Analysis of Variance Table Model 1: Model 2: Res.Df 1 6 2 5 Model 1: Model 2: Res.Df 1 7 2 6 strength ~ x2 + x1:x2 strength ~ x1 * x2 RSS Df Sum of Sq F Pr(>F) 74.694 29.349 1 45.345 7.7251 0.03893 * strength ~ x2 strength ~ x1 + x2 RSS Df Sum of Sq F Pr(>F) 88.122 72.252 1 15.870 1.3179 0.2947 > summary(lm.full) > summary(lm.noint) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.217 30.304 0.205 0.8455 x1 5.779 2.079 2.779 0.0389 * x2 51.333 50.434 1.018 0.3555 x1:x2 -9.357 3.461 -2.704 0.0426 * Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 84.8167 12.2415 6.929 0.000448 *** x1 0.1643 0.1431 1.148 0.294673 x2 -79.6667 20.0349 -3.976 0.007313 ** Residual standard error: 2.423 on 5 degrees of freedom Residual standard error: 3.47 on 6 degrees of freedom Multiple R-Squared: 0.8946,Adjusted R-squared: 0.8314 Multiple R-Squared: 0.7406,Adjusted R-squared: 0.6541 F-statistic: 14.15 on 3 and 5 DF, p-value: 0.00706 F-statistic: 8.565 on 2 and 6 DF, p-value: 0.01746 Notice that this indicates that it might be OK to drop x1 from the x1+x2 model, but not from the larger model. Whether or not a parameter can be dropped depends on the rest of the model. c 2007 Randall Pruim ([email protected]) April 27, 2007 Math 243 – Spring 2007 84 Diagnositics The same sorts of diagnostic checks can be done for multiple regression as for simple linear regression. Most of these involve looking at the residuals. As before plot(lm(...)) will make several types of residual plots. Consideration of the normality and homeoscadasicity assumption of regression should also be a part of selecting a model. The final choice So how do we choose a model? There are no hard and fast rules, but here are some things that play a role: • A priori theory. Some models are chosen by there is some scientific theory that predicts a relationship of a certain form. Statistics is used to find the most likely parameters in a model of this form. If there are competing theories, we can fit multiple models and see which seems to fit better. • Previous experience. Models that have worked well in other similar situations may work well again. • The data. Especially in new situations, we may only have the data to go on. Regression diagnostics, adjusted r2 , various hypothesis tests, and other methods like the commonly used information criteria AIC and BIC can help us choose between models. In general, it is good to choose the simplest model that works well. There are a number of methods that have been proposed to automate the proces of searching through many models to find the “best” one. One commonly used one is called stepwise regression. Stepwise regression works by repeatedly dropping or adding a single term from the model until there are no such single parameter changes that improve the model (based on some criterion; AIC is the default in R.) The function step() will do this in R. If the number of parameters is small enough, one could try all possible subsets of the parameters. This could find a “better” model than the one found by stepwise regression. AIC: Aikeke’s Information Criterion AIC = 2k + n ln(RSS/n) where k is the number of parameters and RSS is the residual sum of squares (SSE). Smaller is better. There are theoretical reasons for this particular formula (base don likelihood methods), but notice that the first addend increases with each parameter in the model and the second decreases; so this forms a kind of balance betweeen the expected gains for adding new parameters against the costs of complication and potential for over-fitting. The scale of AIC is only meaningful relative to a fixed data set. There are several other criteria that have been proposed. c 2007 Randall Pruim ([email protected]) April 30, 2007 Math 243 – Spring 2007 85 More regression examples Example 1. Where regression got its name. In the 1890’s Karl Pearson gathered data on over 1100 British families. Among other things he measured the heights of parents and children. The analysis below comes from his data is for heights of mothers and daughters. Can you see why this data (and analysis) led him to coin the phrase “regression toward the mean”? > > > > > Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 29.91744 1.62247 18.44 <2e-16 *** Mheight 0.54175 0.02596 20.87 <2e-16 *** getCalvin(’m243’) require(alr3) data(heights) xyplot(Dheight~Mheight,heights) summary(lm(Dheight~Mheight,heights)) Residual standard error: 2.266 on 1373 degrees of freedom Multiple R-Squared: 0.2408,Adjusted R-squared: 0.2402 F-statistic: 435.5 on 1 and 1373 DF, p-value: < 2.2e-16 Example 2. Rats were given a dose of a drug proportional to their body weight. The rats were then slaughtered and the amount of drug in the liver and the weight of the liver were measured. > require(alr3); data(rat) > summary(lm(y~BodyWt*LiverWt,rat)) Call: lm(formula = y ~ BodyWt * LiverWt, data = rat) Estimate Std. Error t value Pr(>|t|) (Intercept) 1.573822 1.468416 1.072 0.301 BodyWt -0.007759 0.008568 -0.906 0.379 LiverWt -0.177329 0.198202 -0.895 0.385 BodyWt:LiverWt 0.001095 0.001138 0.962 0.351 Residuals: Min 1Q Median -0.133986 -0.043370 -0.007227 Residual standard error: 0.09193 on 15 degrees of freedom Multiple R-Squared: 0.1001,Adjusted R-squared: -0.07985 F-statistic: 0.5563 on 3 and 15 DF, p-value: 0.6519 Coefficients: 3Q 0.036389 Max 0.184029 > plot(lm(y~BodyWt*LiverWt,rat)) None of the parameters looks significantly non-zero. What is the interpretation? c 2007 Randall Pruim ([email protected]) April 30, 2007 Math 243 – Spring 2007 86 Example 3. Home field advantage? A professor from the University from Minnesota ran tests to see if adjusting the air conditioning in the Metrodome could affect the distance a batted ball travels. Notice that Cond is categorical and indicates whether there was a (artificial) headwind or tailwind. > summary(lm.dome01) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 181.7443 335.6959 0.541 0.59252 Velocity 1.7284 0.5433 3.181 0.00357 ** Angle -1.6014 1.7995 -0.890 0.38110 BallWt -3.9862 2.6697 -1.493 0.14659 BallDia 190.3715 62.5115 3.045 0.00502 ** CondTail 7.6705 2.4593 3.119 0.00418 ** > require(alr3); data(domedata) > summary(domedata) Cond Head:19 Tail:15 Velocity Min. :149.3 1st Qu.:154.1 Median :155.5 Mean :155.2 3rd Qu.:156.3 Max. :160.9 Angle Min. :48.30 1st Qu.:49.50 Median :50.00 Mean :49.98 3rd Qu.:50.60 Max. :51.00 ... > lm.dome01 <- lm( Dist~Velocity+Angle+BallWt+BallDia+Cond, data=domedata) Residual standard error: 6.805 on 28 degrees of freedom Multiple R-Squared: 0.5917,Adjusted R-squared: 0.5188 F-statistic: 8.115 on 5 and 28 DF, p-value: 7.81e-05 It looks like the angle and ball weight don’t matter much. Velocity and direction of wind do (this makes sense). So does ball diameter; that’s a bit unfortunate. Let’s do some model comparisons. > step(lm.dome01,direction="both") Start: AIC= 135.8 Dist ~ Velocity + Angle + BallWt + BallDia + Cond - Angle <none> - BallWt - BallDia - Cond - Velocity Df Sum of Sq RSS 1 36.67 1333.24 1296.57 1 103.24 1399.81 1 429.46 1726.03 1 450.46 1747.03 1 468.68 1765.26 AIC 134.75 135.80 136.40 143.53 143.94 144.29 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -27.9887 83.9460 -0.333 0.7411 Velocity 2.4388 0.5404 4.513 8.63e-05 *** CondTail 6.5118 2.5935 2.511 0.0175 * Residual standard error: 7.497 on 31 degrees of freedom Multiple R-Squared: 0.4513,Adjusted R-squared: 0.4159 F-statistic: 12.75 on 2 and 31 DF, p-value: 9.118e-05 Step: AIC= 134.75 Dist ~ Velocity + BallWt + BallDia + Cond Df Sum of Sq <none> - BallWt + Angle - BallDia - Velocity - Cond 1 1 1 1 1 111.05 36.67 408.92 481.14 499.48 RSS 1333.24 1444.30 1296.57 1742.16 1814.38 1832.72 > lm.dome02 <- lm(Dist~Velocity+Cond,data=domedata) > summary(lm.dome02) > anova(lm.dome02,lm.dome01) Analysis of Variance Table AIC 134.75 135.47 135.80 141.84 143.22 143.56 Model 1: Model 2: Res.Df 1 31 2 28 Dist ~ Velocity + Cond Dist ~ Velocity + Angle + BallWt + BallDia + Cond RSS Df Sum of Sq F Pr(>F) 1742.45 1296.57 3 445.88 3.2096 0.03812 * Call: > lm.dome03 <- lm(Dist~Velocity+Cond+BallDia,data=domedata) lm(formula = Dist ~ Velocity + BallWt + BallDia + Cond, > anova(lm.dome02,lm.dome03) data = domedata) Analysis of Variance Table Coefficients: (Intercept) 133.824 CondTail 7.990 Velocity 1.750 BallWt -4.127 BallDia 184.842 Model 1: Model 2: Res.Df 1 31 2 30 Dist ~ Velocity + Cond Dist ~ Velocity + Cond + BallDia RSS Df Sum of Sq F Pr(>F) 1742.45 1444.30 1 298.15 6.193 0.01860 * c 2007 Randall Pruim ([email protected]) Math 243 – Spring 2007 April 30, 2007 87 Two-way ANOVA Two-way ANOVA is used to measure the dependence of a quantitative response on two categorical predictors. It can be thought of as multiple regression with 2 categorical variables and interaction. In the simplest design (often attainable in experimental situations), we measure the response at every combination of the two categorical predictors the same number of times. The number of times per combination must be at least two, else the model has as many paramters as the data set has values, so although we can fit the model, it will always fit perfectly and there will be no degrees of freedom left over to estimate the accuracy of the predictions. If the number of replicates at each combination of explanatory variables is not equal, then a more complicated analysis is necessary. Example 4. More rats. Rats were given three types of poison in 4 types of treatment. The time until death was recorded. An initial fit reveals violations of our homeoscadisticity assumption, so we’ll try some transformations before fitting the model. > > > > > > lm(time~poison*treat,rats) -> lm.rats plot(lm.rats) # reveals some problems lm(log(time)~poison*treat,rats) -> lm.rats2 plot(lm.rats2) # better, but not perfect lm(I(1/time)~poison*treat,rats) -> lm.rats3 plot(lm.rats3) # looking pretty good > anova(lm.rats3) Analysis of Variance Table Response: I(1/time) Df Sum Sq Mean Sq F value Pr(>F) poison 2 34.877 17.439 72.6347 2.310e-13 treat 3 20.414 6.805 28.3431 1.376e-09 poison:treat 6 1.571 0.262 1.0904 0.3867 > summary(lm.rats3) Coefficients: (Intercept) poisonII poisonIII treatB treatC treatD poisonII:treatB poisonIII:treatB poisonII:treatC poisonIII:treatC poisonII:treatD poisonIII:treatD Estimate Std. Error t value Pr(>|t|) 2.48688 0.24499 10.151 4.16e-12 0.78159 0.34647 2.256 0.030252 2.31580 0.34647 6.684 8.56e-08 -1.32342 0.34647 -3.820 0.000508 -0.62416 0.34647 -1.801 0.080010 -0.79720 0.34647 -2.301 0.027297 -0.55166 0.48999 -1.126 0.267669 -0.45030 0.48999 -0.919 0.364213 0.06961 0.48999 0.142 0.887826 0.08646 0.48999 0.176 0.860928 -0.76974 0.48999 -1.571 0.124946 -0.91368 0.48999 -1.865 0.070391 Residual standard error: 0.49 on 36 degrees of freedom Multiple R-Squared: 0.8681,Adjusted R-squared: 0.8277 F-statistic: 21.53 on 11 and 36 DF, p-value: 1.289e-12 The final model has a nice interpretation since the reciprical of time is a rate of death. Let’s look at the ANOVA table. It displays the results of three hypothesis tests. • A test for interaction. This is a model comparison between models with and without the (6) interaction terms. Since it is not significant, we can proceed to look at main effects. If the p-value had been small, then we cannot really interpret the main effects in isolation. • Two “main effects” tests. These show a hightly significant main effect for both factors: It matters both which poison and which treatment method are used, and the effects of these are largely independent. (The “better” treatments for one poison are better for the other poisons as well, etc.) Although the table does not show a row for residuals, each of these tests uses a statistic of the form MSeffect F = . Notice that we are dividing our variation into three pieces rather than just two. MSresiduals c 2007 Randall Pruim ([email protected]) April 30, 2007 Math 243 – Spring 2007 88 We can follow up by fitting a model without interaction and by making interaction plots. (The ineraction plots could also have been used from the start as a quick look, much like we used side-by-side boxplots for 1-way ANOVA.) > lm(I(1/time)~poison+treat,rats) -> lm.rats4 > summary(lm.rats4) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.6977 0.1744 15.473 < 2e-16 poisonII 0.4686 0.1744 2.688 0.01026 poisonIII 1.9964 0.1744 11.451 1.69e-14 treatB -1.6574 0.2013 -8.233 2.66e-10 treatC -0.5721 0.2013 -2.842 0.00689 treatD -1.3583 0.2013 -6.747 3.35e-08 Residual standard error: 0.4931 on 42 degrees of freedom Multiple R-Squared: 0.8441,Adjusted R-squared: 0.8255 F-statistic: 45.47 on 5 and 42 DF, p-value: 6.974e-16 *** * *** *** ** *** > with(rats,interaction.plot(treat,poison,1/time)) > with(rats,interaction.plot(poison,treat,1/time)) More Examples Example 5. Gas pump emissions. > require(alr3); data(sniffer) Example 6. > getData(’m243/ratsleep’) ->sleep; Example 7. > getData(’m243/drumbeating’) -> drum; c 2007 Randall Pruim ([email protected]) May 7–9, 2007 Math 243 – Spring 2007 89 Chi-squared goodness of fit Example 1. Golfballs > golfballs <- c(137, 138, 107, 104) > chisq.test(golfballs) Chi-squared test for given probabilities data: golfballs X-squared = 8.4691, df = 3, p-value = 0.03725 # extra information displayed using tool from m243 > show.chisq.test(golfballs) Chi-squared test for given probabilities data: golfballs X-squared = 8.4691, df = 3, p-value = 0.03725 137.00 (121.50) [1.98] < 1.41> 138.00 (121.50) [2.24] < 1.50> 107.00 (121.50) [1.73] <-1.32> 104.00 (121.50) [2.52] <-1.59> key: observed (expected) [contribution to X-squared] <residual> Example 2. M&M’s > mm <- c(114,123,137,84,79,74) > nullProb <- c(.24,.20,.16,.14,.13,.13) > show.chisq.test(mm,p=nullProb) Chi-squared test for given probabilities data: mm X-squared = 23.4223, df = 5, p-value = 0.0002802 114.00 (146.64) [ 7.2652] <-2.695> 123.00 (122.20) [ 0.0052] < 0.072> 137.00 ( 97.76) [15.7506] < 3.969> 84.00 ( 85.54) [ 0.0277] <-0.167> 79.00 ( 79.43) [ 0.0023] <-0.048> key: observed (expected) [contribution to X-squared] <residual> Chi-squared for two-way tables Example 3. Smoking > getData(’m243/familySmoking’) -> smoke > names(smoke) [1] "Student" "Parents" > xtabs(~Student+Parents,data=smoke) Parents Student BothSmoke NeitherSmokes OneSmokes DoesNotSmoke 1380 1168 1823 Smokes 400 188 416 > xtabs(~Student+Parents,data=smoke) -> smokeTab > show.chisq.test(smokeTab) Pearson’s Chi-squared test data: smokeTab X-squared = 37.5663, df = 2, p-value = 6.96e-09 1380.00 (1447.51) [ 3.1488] <-1.774> 1168.00 (1102.71) [ 3.8655] < 1.966> 1823.00 (1820.78) [ 0.0027] < 0.052> 400.00 ( 332.49) [13.7086] < 3.703> 188.00 ( 253.29) [16.8288] <-4.102> 416.00 ( 418.22) [ 0.0118] <-0.109> key: observed (expected) [contribution to X-squared] <residual> c 2007 Randall Pruim ([email protected]) 74.00 ( 79.43) [ 0.3712] <-0.609> May 7–9, 2007 Math 243 – Spring 2007 90 Example 4. On time arrival of airlines > getData(’m243/airlineArrival’)-> air > xtabs(~Result+Airline,air) -> airTab > airTab Airline Result Alaska AmericaWest Delayed 501 787 OnTime 3274 6438 # row percentages are less interesting (why?) > col.perc(airTab) Airline Result Alaska AmericaWest Delayed 0.1327152 0.1089273 OnTime 0.8672848 0.8910727 > chisq.test(airTab) Pearson’s Chi-squared test with Yates’ continuity correction data: airTab X-squared = 13.3426, df = 1, p-value = 0.0002594 > show.chisq.test(airTab,correct=F) Pearson’s Chi-squared test data: airTab X-squared = 13.5717, df = 1, p-value = 0.0002296 501.00 ( 442.02) [7.87] < 2.81> 787.00 ( 845.98) [4.11] <-2.03> 3274.00 (3332.98) [1.04] <-1.02> 6438.00 (6379.02) [0.55] < 0.74> key: observed (expected) [contribution to X-squared] residual Chi-squared applied to 2 × 2 tables tends to give p-values that a too small when the cell counts are small. Yates proposed a “continuity correction” that subtracts 0.5 from (observed - expected) before squaring. His correction tends to produce p-values that are a bit too large. When cell counts are large, there is very little difference, but the Yates p-value will always be larger than the uncorrected p-value. This is another example that shows Simpson’s paradox: > col.perc(xtabs(~Result+Airline,air, subset=Airport=="LosAngeles")) Airline Result Alaska AmericaWest Delayed 0.1109123 0.1442663 OnTime 0.8890877 0.8557337 > col.perc(xtabs(~Result+Airline,air, subset=Airport=="Phoenix")) Airline Result Alaska AmericaWest Delayed 0.05150215 0.07897241 OnTime 0.94849785 0.92102759 > col.perc(xtabs(~Result+Airline,air, subset=Airport=="SanFrancisco")) Airline Result Alaska AmericaWest Delayed 0.1685950 0.2873051 OnTime 0.8314050 0.7126949 > col.perc(xtabs(~Result+Airline,air, subset=Airport=="Seattle")) Airline Result Alaska AmericaWest Delayed 0.1421249 0.2328244 OnTime 0.8578751 0.7671756 > col.perc(xtabs(~Result+Airline,air, subset=Airport=="SanDiego")) Airline Result Alaska AmericaWest Delayed 0.0862069 0.1450893 OnTime 0.9137931 0.8549107 c 2007 Randall Pruim ([email protected]) May 7–9, 2007 Math 243 – Spring 2007 91 A mosaic plot provides a visual representation of what is happening in the data. > levels(air$Airport) [1] "LosAngeles" "Phoenix" "SanDiego" [4] "SanFrancisco" "Seattle" > levels(air$Airport)[4] <- "SanF" > levels(air$Airport)[3] <- "SanD" > levels(air$Airport)[1] <- "LAX" > levels(air$Airline) [1] "Alaska" "AmericaWest" > levels(air$Airline)[2] = "AmWest" > mosaic(~Airport+Result+Airline,air,shade=T) > levels(air$Airline)[1] = "AK" > levels(air$Airline)[2] = "AW" > levels(air$Result) [1] "Delayed" "OnTime" > levels(air$Result)[1] = "Lt" > levels(air$Result)[2] = "OT" > levels(air$Result)[1] = "Lt" > require(vcd); mosaic(~Airport+Airline+Result,air) > mosaicplot(~Airport+Airline+Result,air) Example 5. Gun Deaths guns <- getData(’m243/gunDeaths’) > names(guns) [1] "Death" "Firearm" > xtabs(~Death+Firearm,guns) -> gunsTab > gunsTab Firearm Death Handgun Rifle Shotgun Unknown Homicides 468 15 28 13 Suicides 124 24 22 5 > gunsTab[,-4] Firearm Death Handgun Rifle Shotgun Homicides 468 15 28 Suicides 124 24 22 > show.chisq.test(gunsTab[,-4]) Pearson’s Chi-squared test data: gunsTab[, -4] X-squared = 42.6264, df = 2, p-value = 5.544e-10 468.00 (444.22) [ 1.27] < 1.13> 15.00 ( 29.26) [ 6.95] <-2.64> 28.00 ( 37.52) [ 2.41] <-1.55> 124.00 (147.78) [ 3.83] <-1.96> 24.00 ( 9.74) [20.90] < 4.57> 22.00 ( 12.48) [ 7.26] < 2.69> key: observed (expected) [contribution to X-squared] residual Entering a table by hand > x <- c(468,15,28,124,24,22) > matrix(x,nrow=2,byrow=T) -> xt > xt [,1] [,2] [,3] [1,] 468 15 28 [2,] 124 24 22 > chisq.test(xt) Pearson’s Chi-squared test data: xt X-squared = 42.6264, df = 2, p-value = 5.544e-10 c 2007 Randall Pruim ([email protected])
© Copyright 2026 Paperzz