Math 143: Introduction to Biostatistics
R Pruim
Spring 2012
2
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Contents
0
What Is Statistics?
0.1 Some Definitions of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.2 A First Example: The Lady Tasting Tea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.3 Coins and Cups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Statistics and Samples
1.1 Data . . . . . . . . . . . . .
1.2 Samples and Populations .
1.3 Types of Statistical Studies
1.4 Important Distinctions . . .
2
3
6
5
5
6
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
13
Graphical Summaries of Data
2.1 Getting Started With RStudio . . . . . . .
2.2 Data in R . . . . . . . . . . . . . . . . . .
2.3 Graphing the Distribution of One Variable
2.4 Looking at multiple variables at once . . .
2.5 Exporting Plots . . . . . . . . . . . . . . .
2.6 A Few Bells and Whistles . . . . . . . . .
2.7 Getting Help in RStudio . . . . . . . . . .
2.8 Graphical Summaries – Important Ideas .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
18
20
22
26
26
31
32
Numerical Summaries
3.1 Tabulating Data . . . . . . . . . . . . . . . . . . . .
3.2 Working with Pre-Tabulated Data . . . . . . . . . .
3.3 Summarizing Distributions of Quantitative Variables
3.4 Measures of Center . . . . . . . . . . . . . . . . . . .
3.5 Measures of Spread . . . . . . . . . . . . . . . . . . .
3.6 Summary Statistics and Transforming Data . . . . .
3.7 Summarizing Categorical Variables . . . . . . . . . .
3.8 Relationships Between Two Variables . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
37
41
42
42
43
48
48
49
Hypothesis Testing
6.1 The Four Step Process .
6.2 Some Additional Details
6.3 More Examples . . . . .
6.4 What’s Next? . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
54
56
58
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
.
.
.
.
.
.
.
.
.
.
.
.
4
5
Probability
5.1 Key Definitions and Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Computing Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Probability Axioms and Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
59
59
60
7
Inference for a Proportion
7.1 Binomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 The Binomial Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
70
72
8
Goodness of Fit Testing
8.1 Testing Simple Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Testing Complex Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Reporting Results of Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
73
78
80
R Commands
Last Modified: February 24, 2012
83
Math 143 : Spring 2012 : Pruim
What Is Statistics?
5
0
What Is Statistics?
0.1
Some Definitions of Statistics
This is a course primarily about statistics, but what exactly is statistics? In other words, what is this course
about?1 Here are some definitions of statistics from other people:
• a collection of procedures and principles for gaining information in order to make decisions when faced
with uncertainty (J. Utts [Utt05]),
• a way of taming uncertainty, of turning raw data into arguments that can resolve profound questions (T.
Amabile [fMA89]),
• the science of drawing conclusions from data with the aid of the mathematics of probability (S. Garfunkel
[fMA86]),
• the explanation of variation in the context of what remains unexplained (D. Kaplan [Kap09]),
• the mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of a population’s characteristics by inference from sampling (American Heritage Dictionary [AmH82]).
While not exactly the same, these definitions highlight four key elements of statistics.
Data – the raw material
Data are the raw material for doing statistics. We will learn more about different types of data, how to collect
data, and how to summarize data as we go along.
Information – the goal
The goal of doing statistics is to gain some information or to make a decision. Statistics is useful because it
helps us answer questions like the following:
• Which of two treatment plans leads to the best clinical outcomes?
1 As we will see, the words statistic and statistics get used in more than one way. More on that later.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
6
What Is Statistics?
• Are men or women more successful at quitting smoking? And does it matter which smoking cessation
program they use?
• Is my cereal company complying with regulations about the amount of cereal in its cereal boxes?
In this sense, statistics is a science – a method for obtaining new knowledge.
Uncertainty – the context
The tricky thing about statistics is the uncertainty involved. If we measure one box of cereal, how do we know
that all the others are similarly filled? If every box of cereal were identical and every measurement perfectly
exact, then one measurement would suffice. But the boxes may differ from one another, and even if we measure
the same box multiple times, we may get different answers to the question How much cereal is in the box?
So we need to answer questions like How many boxes should we measure? and How many times should we
measure each box? Even so, there is no answer to these questions that will give us absolute certainty. So we
need to answer questions like How sure do we need to be?
Probability – the tool
In order to answer a question like How sure do we need to be?, we need some way of measuring our level of
certainty. This is where mathematics enters into statistics. Probability is the area of mathematics that deals
with reasoning about uncertainty.
0.2
A First Example: The Lady Tasting Tea
There is a famous story about a lady who claimed that tea with milk tasted different depending on whether the
milk was added to the tea or the tea added to the milk. The story is famous because of the setting in which
she made this claim. She was attending a party in Cambridge, England, in the 1920s. Also in attendance were
a number of university dons and their wives. The scientists in attendance scoffed at the woman and her claim.
What, after all, could be the difference?
All the scientists but one, that is. Rather than simply dismiss the woman’s claim, he proposed that they decide
how one should test the claim. The tenor of the conversation changed at this suggestion, and the scientists
began to discuss how the claim should be tested. Within a few minutes cups of tea with milk had been prepared
and presented to the woman for tasting.
Let’s take this simple example as a prototype for a statistical study. What steps are involved?
1. Determine the question of interest.
Just what is it we want to know? It may take some effort to make a vague idea precise. The precise
questions may not exactly correspond to our vague questions, and the very exercise of stating the question
precisely may modify our question. Sometimes we cannot come up with any way to answer the question
we really want to answer, so we have to live with some other question that is not exactly what we wanted
but is something we can study and will (we hope) give us some information about our original question.
In our example this question seems fairly easy to state: Can the lady tell the difference between the two
tea preparations? But we need to refine this question. For example, are we asking if she always correctly
identifies cups of tea or merely if she does better than we could do ourselves (by guessing)?
2. Determine the population.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
What Is Statistics?
7
Just who or what do we want to know about? Are we only interested in this one woman or women in
general or only women who claim to be able to distinguish tea preparations?
3. Select measurements.
We are going to need some data. We get our data by making some measurements. These might be physical
measurements with some device (like a ruler or a scale). But there are other sorts of measurements too,
like the answer to a question on a form. Sometimes it is tricky to figure out just what to measure. (How
do we measure happiness or intelligence, for example?) Just how we do our measuring will have important
consequences for the subsequent statistical analysis. The recorded values of these measurements are called
variables (because the values vary from one individual to another).
In our example, a measurement may consist of recording for a given cup of tea whether the woman’s claim
is correct or incorrect.
4. Determine the sample.
Usually we cannot measure every individual in our population; we have to select some to measure. But
how many and which ones? These are important questions that must be answered. Generally speaking,
bigger is better, but it is also more expensive. Moreover, no size is large enough if the sample is selected
inappropriately.
Suppose we gave the lady one cup of tea. If she correctly identifies the mixing procedure, will we be
convinced of her claim? She might just be guessing; so we should probably have her taste more than one
cup. Will we be convinced if she correctly identifies 5 cups? 10 cups? 50 cups?
What if she makes a mistake? If we present her with 10 cups and she correctly identifies 9 of the 10, what
will we conclude? A success rate of 90% is, it seems, much better than just guessing, and anyone can
make a mistake now and then. But what if she correctly identifies 8 out of 10? 80 out of 100?
And how should we prepare the cups? Should we make 5 each way? Does it matter if we tell the woman
that there are 5 prepared each way? Should we flip a coin to decide even if that means we might end up
with 3 prepared one way and 7 the other way? Do any of these differences matter?
5. Make and record the measurements.
Once we have the design figured out, we have to do the legwork of data collection. This can be a timeconsuming and tedious process. In the case of the lady tasting tea, the scientists decided to present her
with ten cups of tea which were quickly prepared. A study of public opinion may require many thousands
of phone calls or personal interviews. In a laboratory setting, each measurement might be the result of a
carefully performed laboratory experiment.
6. Organize the data.
Once the data have been collected, it is often necessary or useful to organize them. Data are typically
stored in spreadsheets or in other formats that are convenient for processing with statistical packages.
Very large data sets are often stored in databases.
Part of the organization of the data may involve producing graphical and numerical summaries of the
data. These summaries may give us initial insights into our questions or help us detect errors that may
have occurred to this point.
7. Draw conclusions from data.
Once the data have been collected, organized, and analyzed, we need to reach a conclusion. Do we believe
the woman’s claim? Or do we think she is merely guessing? How sure are we that this conclusion is
correct?
Eventually we will learn a number of important and frequently used methods for drawing inferences from
data. More importantly, we will learn the basic framework used for such procedures so that it should
become easier and easier to learn new procedures as we become familiar with the framework.
8. Produce a report.
Typically the results of a statistical study are reported in some manner. This may be as a refereed article
in an academic journal, as an internal report to a company, or as a solution to a problem on a homework
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
8
What Is Statistics?
assignment. These reports may themselves be further distilled into press releases, newspaper articles,
advertisements, and the like. The mark of a good report is that it provides the essential information about
each of the steps of the study.
As we go along, we will learn some of the standard terminology and procedures that you are likely to see
in basic statistical reports and will gain a framework for learning more.
At this point, you may be wondering who the innovative scientist was and what the results of the experiment
were. The scientist was R. A. Fisher, who first described this situation as a pedagogical example in his 1925 book
on statistical methodology [Fis25]. Fisher developed statistical methods that are among the most important
and widely used methods to this day, and most of his applications were biological.
0.3
Coins and Cups
You might also be curious about how the experiment came out. How many cups of tea were prepared? How
many did the woman correctly identify? What was the conclusion?
Fisher never says. In his book he is interested in the method, not the particular results. But let’s suppose we
decide to test the lady with ten cups of tea. We’ll flip a coin to decide which way to prepare the cups. If we
flip a head, we will pour the milk in first; if tails, we put the tea in first. Then we present the ten cups to the
lady and have her state which ones she thinks were prepared each way.
It is easy to give her a score (9 out of 10, or 7 out of 10, or whatever it happens to be). It is trickier to figure
out what to do with her score. Even if she is just guessing and has no idea, she could get lucky and get quite a
few correct – maybe even all 10. But how likely is that?
Let’s try an experiment. I’ll flip 10 coins. You guess which are heads and which are tails, and we’ll see how you
do.
..
.
Comparing with your classmates, we will undoubtedly see that some of you did better and others worse.
Now let’s suppose the lady gets 9 out of 10 correct. That’s not perfect, but it is better than we would expect
for someone who was just guessing. On the other hand, it is not impossible to get 9 out of 10 just by guessing.
So here is Fisher’s great idea: Let’s figure out how hard it is to get 9 out of 10 by guessing. If it’s not so hard
to do, then perhaps that’s just what happened, so we won’t be too impressed with the lady’s tea tasting ability.
On the other hand, if it is really unusual to get 9 out of 10 correct by guessing, then we will have some evidence
that she must be able to tell something.
But how do we figure out how unusual it is to get 9 out of 10 just by guessing? We’ll learn another method
later, but for now, let’s just flip a bunch of coins and keep track. If the lady is just guessing, she might as well
be flipping a coin.
So here’s the plan. We’ll flip 10 coins. We’ll call the heads correct guesses and the tails incorrect guesses.
Then we’ll flip 10 more coins, and 10 more, and 10 more, and . . . . That would get pretty tedious. Fortunately,
computers are good at tedious things, so we’ll let the computer do the flipping for us using a tool in the mosaic
package. This package is already installed in our RStudio server. If you are running your own installation of R
you can install mosaic using the following command:
install.packages("mosaic")
The rflip() function can flip one coin
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
What Is Statistics?
9
require(mosaic)
rflip()
Flipping 1 coins [ Prob(Heads) = 0.5 ] ...
H
Result: 1 heads.
or a number of coins
rflip(10)
Flipping 10 coins [ Prob(Heads) = 0.5 ] ...
H T H H H T H H H H
Result: 8 heads.
and show us the results.
Typing rflip(10) a bunch of times is almost as tedious as flipping all those coins. But it is not too hard to
tell R to do() this a bunch of times.
do(2) * rflip(10)
n heads tails
1 10
3
7
2 10
3
7
Let’s get R to do() it for us 10,000 times and make a table of the results.
results <- do(10000) * rflip(10)
table(results$heads)
0
5
1
102
2
3
4
5
6
7
467 1203 2048 2470 2035 1140
perctable(results$heads)
0
0.05
1
1.02
#
8
415
9
108
the table in percents
2
3
4
5
6
7
4.67 12.03 20.48 24.70 20.35 11.40
proptable(results$heads)
#
Math 143 : Spring 2012 : Pruim
10
7
8
4.15
9
1.08
10
0.07
the table in proportions (i.e., decimals)
Last Modified: February 24, 2012
10
What Is Statistics?
0
1
2
3
4
5
6
7
8
9
10
0.0005 0.0102 0.0467 0.1203 0.2048 0.2470 0.2035 0.1140 0.0415 0.0108 0.0007
We could also use mtable() for this.
mtable(~ heads, data = results)
0
5
1
102
2
467
3
1203
4
2048
5
2470
6
2035
7
1140
8
415
9
108
10 Total
7 10000
mtable(~ heads, data = results, format = "percent")
0
0.05
1
1.02
2
4.67
3
12.03
4
20.48
5
24.70
6
20.35
7
11.40
8
4.15
9
1.08
10 Total
0.07 100.00
mtable(~ heads, data = results, format = "proportion")
0
1
2
3
4
5
6
7
8
9
10 Total
0.0005 0.0102 0.0467 0.1203 0.2048 0.2470 0.2035 0.1140 0.0415 0.0108 0.0007 1.0000
You might be surprised to see that the number of correct guesses is exactly 5 (half of the 10 tries) only 25% of
the time. But most of the results are quite close to 5 correct. 67% of the results are 4, 5, or 6, for example.
And 1% of the results are between 3 and 7 (inclusive). But getting 8 correct is a bit unusual, and getting 9 or
10 correct is even more unusual.
So what do we conclude? It is possible that the lady could get 9 or 10 correct just by guessing, but it is not
very likely (it only happened in about 1.2% of our simulations). So one of two things must be true:
• The lady got unusually “lucky”, or
• The lady is not just guessing.
Although Fisher did not say how the experiment came out, others have reported that the lady correctly identified
all 10 cups! [Sal01]
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Statistics and Samples
11
1
Statistics and Samples
1.1
Data
Imagine data as a 2-dimensional structure (like a spreadsheet).
• Rows correspond to observational units (people, animals, plants, or other objects we are collecting
data about).
• Columns correspond to variables (measurements collected on each observational unit).
Observational units go by many names, depending on the kind of thing being studied. Popular names include
subjects, individuals, and cases. Whatever you call them, it is important that you always understand what your
observational units are.
Variable terminology
categorical variable a variable that places observation units into one of two or more categories (examples: color,
sex, case/control status, species, etc.)
These can be further sub-divided into ordinal and nominal variables If the categories have a natural and
meaningful order, we will call them ordered or ordinal variables. Otherwise, they are nominal variables.
quantitative variable a variable that records measurements along some scale (examples: weight, height, age,
temperature) or counts something (examples: number of siblings, number of colonies of bacteria, etc.)
Quantitative variables can be continuous or discrete. Continuous variables can (in principle) take on any
real-number value in some range. Values of discrete variables are limited to some list and “in-between
values” are not possible. Counts are a good example of discrete variables.
response variable a variable we are trying to predict or explain
explanatory variable a variable used to predict or explain a response variable
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
12
Statistics and Samples
Distributions
The distribution of a variable answers two questions:
• What values can the variable have?
• With what frequency does each value occur?
The frequency may be described in terms of counts, proportions (often called relative frequency),
or densities (more on densities later).
A distribution may be described using a table (listing values and frequencies) or a graph (e.g., a histogram).
1.2
Samples and Populations
population the collection of animals, plants, objects, etc. that we want to know about
sample the (smaller) set of animals, plants, objects, etc. about which we have data
parameter a number that describes a population or model.
statistic a number that describes a sample.
Much of statistics centers around this question:
What can we learn about a population from a sample?
Estimation
Often we are interested in knowing (approximately) the value of some parameter. A statistic used for this
purpose is called an estimate. For example, if you want to know the mean length of the tails of lemurs (that’s
a parameter ), you might take a sample of lemurs and measure their tails. The mean length of the tails of the
lemurs in your sample is a statistic. It is also an estimate, because we use it to estimate the parameter.
Statistical estimation methods attempt to
• reduce bias, and
• increase precision.
bias the systematic tendency of sample estimates to either overestimate or underestimate population parameters;
that is, a systematic tendency to be off in a particular direction.
precision the measure of how close estimates are to the thing being estimated (called the estimand).
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Statistics and Samples
13
Sampling
Sampling is the process of selecting a sample. Statisticians use random samples
• to avoid (or at least reduce) bias, and
• so they can quantify sampling variability (the amount samples differ from each other), which in turn allows
us to quantify precision.
The simplest kind of random sample is called a simple random sample (aren’t statisticians clever about naming
things?). A simple random sample is equivalent to putting all individuals in the population into a big hat,
mixing thoroughly, and selecting some out of the hat to be in the sample. In particular, in a simple random
sample, every individual has an equal chance to be in the sample, in fact, every subset of the population of a
fixed size has an equal chance to be in the sample.
Other sampling methods include
convenience sampling using whatever individuals are easy to obtain
volunteer sampling using people who volunteer to be in the sample
systematic sampling sampling done in some systematic way (every tenth unit, for example).
stratified sampling sampling separately in distinct sub-populations (called strata)
1.3
Types of Statistical Studies
Statisticians use the word experiment to mean something very specific. In an experiment, the researcher
determines the values of one or more (explanatory) variables, typically by random assignment. If there is no
such assignment by the researcher, the study is an observational study.
1.4
Important Distinctions
When learning the vocabulary in this section, it is useful to focus on the distinctions being made:
categorical vs. quantitative
(nominal vs. ordinal)
(discrete vs. continuous)
experiment vs. observational study
population vs. sample
parameter vs. statistic
biased vs. unbiased
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
14
Last Modified: February 24, 2012
Statistics and Samples
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
15
2
Graphical Summaries of Data
2.1
Getting Started With RStudio
You can access the Calvin RStudio server via links from our course web site (http://www.calvin.edu/~rpruim/
courses/m143/S12/) or from http://dahl.calvin.edu.
2.1.1
Logging in and changing your password
You will be prompted to login. Your login and password are both your Calvin userid. Once you are logged in,
you will see something like Figure 2.1.
You should change your password. Here’s how.
1. From the Tools menu, select Shell
2. Type yppasswd
3. You will be prompted for your old password, then your new password twice.
4. If you give a sufficiently strong new password (at least six letters, at least one capital, etc.) you will receive
notice that your password has been reset. If there was a problem, you will see a message about it and can
try again.
5. Once you have reset your password, click on Close to close the shell and get back to RStudio.
2.1.2
Using R as a calculator
Notice that RStudio divides its world into four panels. Several of the panels are further subdivided into multiple
tabs. The console panel is where we type commands that R will execute.
R can be used as a calculator. Try typing the following commands in the console panel.
5 + 3
[1] 8
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
16
Graphical Summaries of Data
Figure 2.1: Welcome to RStudio.
15.3 * 23.4
[1] 358
sqrt(16)
[1] 4
You can save values to named variables for later reuse
product = 15.3 * 23.4
product
# save result
# show the result
[1] 358
product <- 15.3 * 23.4
product
# <- is assignment operator, same as =
[1] 358
15.3 * 23.4 -> newproduct
newproduct
# -> assigns to the right
[1] 358
.5 * product
Last Modified: February 24, 2012
# half of the product
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
17
[1] 179
log(product)
# (natural) log of the product
[1] 5.881
log10(product)
# base 10 log of the product
[1] 2.554
log(product,base=2)
# base 2 log of the product
[1] 8.484
The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and
print a value all in one go:
product <- 15.3 * 23.4
product # save result and show it
[1] 358
2.1.3
Loading packages
R is divided up into packages. A few of these are loaded every time you run R, but most have to be selected.
This way you only have as much of R as you need.
In the Packages tab, check the boxes next to the following packages to load them:
• mosaic (a package from Project MOSAIC)
• fastR (a package from a different text book)
• abd (a package for our text book)
Four Things to Know About R
1. R is case-sensitive
If you mis-capitalize something in R it won’t do what you want.
2. Functions in R use the following syntax:
functionname(argument1, argument2, ...)
• The arguments are always surrounded by (round) parentheses and separated by commas.
Some functions (like data()) have no required arguments, but you still need the parentheses.
• If you type a function name without the parentheses, you will see the code for that function – which
probably isn’t what you want at this point.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
18
Graphical Summaries of Data
3. TAB completion and arrows can improve typing speed and accuracy.
If you begin a command and hit the TAB key, R will show you a list of possible ways to complete the
command. If you hit TAB after the opening parenthesis of a function, it will show you the list of arguments
it expects. The up and down arrows can be used to retrieve past commands.
4. If you get into some sort of mess typing (usually indicated by extra ’+’ signs along the left edge), you can
hit the escape key to get back to a clean prompt.
2.2
2.2.1
Data in R
Data in Packages
Most often, data sets in R are stored in a structure called a data frame. There are a number of data sets built
into R and many more that come in various add on packages. The abd package, for example, contains all the
data sets from our text book.
You can see a list of them using
data(abd)
To make things even easier, there is a function called abdData() that can use or provide information about
where the data sets are used in the book.
abdData("human")
# find all data sets that have ’human’ in their name
name chapter
type number sub
24 HumanGeneLengths
4 Example
1
43
HumanBodyTemp
11 Example
3
abdData(chapter = 2)
# find all data sets in chapter 2
name chapter
type number sub
1
TeenDeaths
2 Example
1
a
2
DesertBirds
2 Example
1
b
3
SockeyeFemales
2 Example
1
c
4
GreatTitMalaria
2 Example
3
5
Hemoglobin
2 Example
4
6
Guppies
2 Example
5
a
7
Lynx
2 Example
5
b
8
EndangeredSpecies
2 Problem
6
9
ShuttleDisaster
2 Problem
10
10
Convictions
2 Problem
16
11 ConvictionsAndIncome
2 Problem
17
12
Fireflies
2 Problem
18
13
NeotropicalTrees
2 Problem
23
You can find a longer list of all data sets available in any loaded package using
data()
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
2.2.2
19
The HELPrct data set
The HELPrct data frame from the mosaic package contains data from the Health Evaluation and Linkage to
Primary Care randomized clinical trial. You can find out more about the study and the data in this data frame
by typing
?HELPrct
Among other things, this will tell us something about the subjects in this study:
Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine
as their first or second drug of choice, resided in proximity to the primary care clinic to which
they would be referred or were homeless. Patients with established primary care relationships they
planned to continue, significant dementia, specific plans to leave the Boston area that would prevent
research participation, failure to provide contact information for tracking purposes, or pregnancy
were excluded.
Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were
undertaken every 6 months for 2 years.
It is often handy to look at the first few rows of a data frame. It will show you the names of the variables and
the kind of data in them:
head(HELPrct)
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
age anysubstatus anysub cesd d1 daysanysub dayslink drugrisk e2b female
sex g1b
37
1
yes
49 3
177
225
0 NA
0
male yes
37
1
yes
30 22
2
NA
0 NA
0
male yes
26
1
yes
39 0
3
365
20 NA
0
male no
39
1
yes
15 2
189
343
0
1
1 female no
32
1
yes
39 12
2
57
0
1
0
male no
47
1
yes
6 1
31
365
0 NA
1 female no
homeless i1 i2 id indtot linkstatus link
mcs
pcs pss_fr racegrp satreat sexrisk
housed 13 26 1
39
1 yes 25.112 58.41
0
black
no
4
homeless 56 62 2
43
NA <NA> 26.670 36.04
1
white
no
7
housed 0 0 3
41
0
no 6.763 74.81
13
black
no
2
housed 5 5 4
28
0
no 43.968 61.93
11
white
yes
4
homeless 10 13 5
38
1 yes 21.676 37.35
10
black
no
6
housed 4 4 6
29
0
no 55.509 46.48
5
black
no
5
substance treat
cocaine
yes
alcohol
yes
heroin
no
heroin
no
cocaine
no
cocaine
yes
That’s plenty of variables to get us started with exploration of data.
2.2.3
Using your own data
We will postpone for now a discussion about getting your own data into RStudio, but any data you can get into
a reasonable format (like csv) can be imported into RStudio pretty easily.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
20
Graphical Summaries of Data
2.3
Graphing the Distribution of One Variable
Recall that a distribution is described by what values occur and with what frequency. That is, the distribution
answers two questions:
• What values?
• How often?
Statisticians have devised a number of graphs to help us see distributions visually.
The general syntax for making a graph of one variable in a data frame is
plotname(~ variable, data = dataName)
In other words, there are three pieces of information we must provide to R in order to get the plot we want:
• The kind of plot (histogram(), bargraph(), densityplot(), bwplot(), etc.)
• The name of the variable
• The name of the data frame this variable is a part of.
2.3.1
Histograms (and density plots) for quantitative variables
Histograms are a way of displaying the distribution of a quantitative variable.
Here are a couple examples:
Percent of Total
Percent of Total
histogram(~ weight, data = Mosquitoes)
histogram(~ age, data = HELPrct)
40
30
20
10
0
0.2
0.3
0.4
weight
25
20
15
10
5
0
20
30
40
50
60
age
We can control the (approximate) number of bins using the nint argument, which may be abbreviated as n.
The number of bins (and to a lesser extent the positions of the bins) can make a histogram look quite different.
histogram(~ age, data = HELPrct, n = 8)
histogram(~ age, data = HELPrct, n = 15)
histogram(~ age, data = HELPrct, n = 30)
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
30
25
20
15
10
5
0
20
30
40
50
Percent of Total
21
Percent of Total
Percent of Total
Graphical Summaries of Data
15
10
5
0
60
20
30
40
age
50
10
8
6
4
2
0
60
20
30
age
40
50
60
age
The mhistogram() function in the mosaicManip package can be used to experiment with the bins using sliders,
and the xhistogram()1 function in the mosaic package lets you describe the bins in terms of center and width
instead of in terms of the number of bins. This is especially nice for count data
xhistogram(~ age, data = HELPrct, width = 10)
xhistogram(~ age, data = HELPrct, width = 5)
xhistogram(~ age, data = HELPrct, width = 2)
0.05
0.03
0.02
0.03
0.02
0.01
0.01
0.00
0.00
10
20
30
40
50
0.06
0.04
Density
Density
Density
0.04
0.04
0.02
0.00
60
20
age
30
40
50
60
20
30
40
age
50
60
age
R also provides a “smooth” version called a density plot; just change the function name from histogram() to
densityplot().
densityplot(~ weight, data = Mosquitoes)
densityplot(~ age, data = HELPrct)
0.05
Density
Density
6
4
2
0.04
0.03
0.02
0.01
●
●●
●●●
● ●
●
●●● ● ● ●
●
0
0.1
0.2
0.3
0.4
weight
2.3.2
0.00
●
0.5
●●
●●
●●
●●
●●
●●●●●●●●●●
●●●●●●
●●●
●●
●●
●
●●
●●●●●●●●●●●●●
●●●●●
●●
20
30
40
50
60
age
The shape of a distribution
If we make a histogram of our data, we can describe the overall shape of the distribution. Keep in mind that
the shape of a particular histogram may depend on the choice of bins. Choosing too many or too few bins can
1 The xhistogram() function has some eXtra features (like the width argument) but otherwise is very similar to histogram().
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
22
Graphical Summaries of Data
hide the true shape of the distribution. (When in doubt, make more than one histogram.)
Here are some words we use to describe shapes of distributions.
symmetric The left and right sides are mirror images of each other.
skewed The distribution stretches out farther in one direction than in the other. (We say the distribution is
skewed toward the long tail.)
uniform The heights of all the bars are (roughly) the same. (So the data are equally likely to be anywhere
within some range.)
unimodal There is one major “bump” where there is a lot of data.
bimodal There are two “bumps”.
outlier An observation that does not fit the overall pattern of the rest of the data.
We’ll learn about another graph used for quantitative variables (a boxplot, bwplot() in R) soon.
2.3.3
Bar graphs for categorical variables
Bar graphs are a way of displaying the distribution of a categorical variable.
Freq
bargraph(~ substance, data = HELPrct)
bargraph(~ substance, data = HELPrct, horizontal = TRUE)
150
heroin
100
cocaine
50
alcohol
0
alcohol
cocaine
heroin
0
50
100
150
Freq
Statisticians rarely use pie charts because they are harder to read.
2.4
2.4.1
Looking at multiple variables at once
Conditional plots
The formula for a lattice plot can be extended to create multiple panels based on a “condition”, often given
by another variable. The general syntax for this becomes
plotname(~ variable | condition, data = dataName)
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
23
For example, we might like to see how the ages of men and women compare in the HELP study, or whether the
distribution of weights of male mosquitoes is different from the distribution for females.
xhistogram(~ age | sex, HELPrct, width = 5)
xhistogram(~ weight | sex, Mosquitoes, width = 0.04)
20 30 40 50 60
male
0.2
female
0.06
0.05
0.04
0.03
0.02
0.01
0.00
Density
Density
female
0.1
20 30 40 50 60
0.3
0.4
male
10
8
6
4
2
0
0.1
0.2
age
0.3
0.4
weight
We can do the same thing for bar graphs.
bargraph(~ substance | sex, data = HELPrct)
female
male
Freq
100
50
0
alcoholcocaine heroin alcoholcocaine heroin
2.4.2
Grouping
Another way to look at multiple groups simultaneously is by using the groups argument. What groups does
depends a bit on the type of graph, but it will put the information in one panel rather than multiple panels.
Using groups with histogram() doesn’t work so well because it is difficult to overlay histograms.2 Density
plots work better for this.
Here are some examples. We use auto.key=TRUE to build a simple legend so we can tell which groups are which.
bargraph(~ substance, groups = sex, data = HELPrct, auto.key = TRUE)
densityplot(~ age, groups = sex, data = HELPrct, auto.key = TRUE)
2 The mosaic function xhistogram() does do something meaningful with groups in some situations.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
24
Graphical Summaries of Data
female
male
female
male
0.06
0.05
0.04
0.03
0.02
0.01
0.00
Density
Freq
100
50
●●
●●●●●●●●●●●●●●●
●●●●
●●●●●●●●●
●●
0
alcohol
cocaine
20
heroin
30
40
●●●
50
60
age
We can even combine grouping and conditioning in the same plot.
densityplot(~ age | sex, groups = substance, data = HELPrct,
auto.key = TRUE)
alcohol
cocaine
heroin
10
20
30
Density
female
40
50
60
70
male
0.06
0.04
0.02
●
0.00
10
20
30
40
● ● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●
●●
●●●● ●●●● ●●●● ● ●●●●
50
60
70
age
densityplot(~ age | substance, groups = sex, data = HELPrct,
auto.key = TRUE, layout = c(3, 1))
female
male
10
alcohol
20
30
40
50
60
70
cocaine
heroin
Density
0.06
0.04
0.02
●
0.00
10
20
●●●●●●
●●●●●●●●●●●●●
30
40
50
●
●●●●
●●●●
●●●
●●
●●●●●●● ●
●●
●●
60
70
●●●●●●● ●●●●●●●●
●●●●●●
10
20
30
40
50
●
60
70
age
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
25
This plot shows that for each substance, the age distributions of men and women are quite similar, but that the
distributions differ from substance to substance.
2.4.3
Scatterplots
The most common way to look at two quantitative variables is with a scatter plot. The lattice function for
this is xyplot(), and the basic syntax is
xyplot(yvar ~ xvar, data = dataName)
Notice that now we have something on both sides of the ˜ since we need to tell R about two variables.
xyplot(mcs ~ age, data = HELPrct)
60
mcs
50
40
30
20
10
●
●
●
●● ●
●
● ●
●●●
●● ●
●●●
●
●● ●
●●●●
●
●●●
● ●●●●
●●
●
●● ●●
●
●
●
●
●
●
●
●● ● ● ●●
●
●●●●
●
●●
●●●
●●
●
●
●
●● ●
●●●●●●
●● ● ● ●
●
●
●●
●●● ●
● ●●●
● ●●● ● ●
●
●
●●
●●●●
●●
● ●●●
●●● ● ●
●●●
● ●●
●●
●● ●
●●
●● ●●
● ●●●●
●
●
● ●●●●
●●●
● ●● ●●●●
●●
●●
●●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●●●●●●
●●●●
●●●
●●
●
●
● ●●●●●●● ●● ● ●
●●
●●
●●●
●●
●
●●●
●●●
●●
● ●● ●
●●
●●●
●●●
●●
●●
●●●
●●
●●
●
●●
●●
●●●
●●
●●
● ●●
●●
●●● ●
●●● ●
●
●● ●
●● ●
●
●●
●
● ●
●
●
●
●
● ●● ● ●
20
30
40
50
60
age
Grouping and conditioning work just as before. With large data set, it can be helpful to make the dots semitransparent so it is easier to see where there are overlaps. This is done with alpha. We can also make the dots
smaller (or larger) using cex.
xyplot(mcs ~ age | sex, groups = substance, data = HELPrct,
alpha = 0.6, cex = 0.5, auto.key = TRUE)
alcohol
cocaine
heroin
●
20
female
30
40
50
60
male
60
mcs
50
40
30
20
10
20
30
40
50
60
age
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
26
2.5
Graphical Summaries of Data
Exporting Plots
You can save plots to files or copy them to the clipboard using the Export menu in the Plots tab. It is quite
simple to copy the plots to the clipboard and then paste them into a Word document, for example. You can
even adjust the height and width of the plot first to get it the shape you want.
R code and output can be copied and pasted as well. It’s best to use a fixed width font (like Courier) for R code
so that things align properly.
2.6
A Few Bells and Whistles
There are lots of arguments that control how these plots look. Here are just a few examples, some of which we
have already seen.
2.6.1
auto.key
auto.key=TRUE turns on a simple legend for the grouping variable. (There are ways to have more control, if
you need it.)
xyplot(Sepal.Length ~ Sepal.Width, groups = Species, data = iris,
auto.key = TRUE)
setosa
versicolor
virginica
●
Sepal.Length
8
7
6
● ●
●
●
●● ●
● ●
●
●
●●
●
●●●●
●
●
●
●
●
●●
●
●● ● ●
●● ● ●
●●
● ●
5
●
2.0
2.5
3.0
3.5
4.0
4.5
Sepal.Width
2.6.2
alpha, cex
Sometimes it is nice to have elements of a plot be partly transparent. When such elements overlap, they get
darker, showing us where data are “piling up.” Setting the alpha argument to a value between 0 and 1 controls the
degree of transparency: 1 is completely opaque, 0 is invisible. The cex argument controls “character expansion”
and can be used to make the plotting “characters” larger or smaller by specifying the scaling ratio.
Here is another example using data on 150 iris plants of three species.
xyplot(Sepal.Length ~ Sepal.Width, groups = Species, data = iris,
auto.key = list(columns = 3), alpha = 0.5, cex = 1.3)
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
27
setosa
versicolor
●
virginica
Sepal.Length
8
7
6
5
2.0
2.5
3.0
3.5
4.0
4.5
Sepal.Width
main, sub, xlab, ylab
You can add a title or subtitle, or change the default labels of the axes.
xyplot(Sepal.Length ~ Sepal.Width, groups = Species, data = iris,
main = "Some Iris Data", sub = "(R. A. Fisher analysized this data in 1936)",
xlab = "sepal width (cm)", ylab = "sepal length (cm)", alpha = 0.5,
auto.key = list(columns = 3))
Some Iris Data
setosa
versicolor
●
virginica
sepal length (cm)
8
7
6
5
2.0
2.5
3.0
3.5
4.0
4.5
sepal width (cm)
(R. A. Fisher analysized this data in 1936)
layout
layout can be used to control the arrangement of panels in a multi-panel plot. The format is
layout = c(cols, rows)
where cols is the number of columns and rows is the number of rows. (Columns first because that is the
x-coordinate of the plot.)
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
28
Graphical Summaries of Data
lty, lwd, pch, col
These can be used to change the line type, line width, plot character, and color. To specify multiples (one for
each group), use the c() function (see below).
densityplot(~ age, data = HELPrct, groups = sex, lty = 1,
lwd = c(2, 4))
histogram(~ age, data = HELPrct, col = "green")
# Three are 25 numbered plot symbols
0.06
0.04
0.03
0.02
0.01
0.00
25
Percent of Total
Density
0.05
●●●●●
●●
●●●●●●
●●●●
●●
●●●
●●●●
●●●●●●
●●●●
20
30
40
20
15
10
5
●●●
50
0
60
20
30
age
40
50
60
age
xyplot( mcs ~ age, data=HELPrct, groups=sex,
pch=c(1,2), col=c(’brown’, ’darkgreen’), cex=.75 )
60
●
50
●
mcs
●
40
●
30
●
20
●
10
●
20
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
● ●
●
● ● ●
●
●
●
● ● ● ● ●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
● ● ●
●
● ●
●
● ●
● ●
●
●
●
●
●
●
30
40
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
50
60
age
Note: If you change this this way, they will not match what is generated in the legend using auto.key=TRUE.
So it can be better to set these things in a different way if you are using groups. See below.
You can a list of the hundreds of available color names using
colors()
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
2.6.3
29
trellis.par.set()
Default settings for lattice graphics are set using trellis.par.set(). Don’t like the default font sizes? You
can change to a 7 point (base) font using
trellis.par.set(fontsize = list(text = 7))
# base size for text is 7 point
Nearly every feature of a lattice plot can be controlled: fonts, colors, symbols, line thicknesses, colors, etc.
Rather than describe them all here, we’ll mention only that groups of these settings can be collected into a
theme. show.settings() will show you what the theme looks like.
trellis.par.set(theme = col.whitebg())
show.settings()
●
●
●
●
●
●
●
●
●
●
●
●
●
●
superpose.symbol
# a theme in the lattice package
superpose.line
strip.background
●
strip.shingle
World
●
●
●
●
Hello
●
dot.[symbol, line]
box.[dot, rectangle, umbrella]
add.[line, text]
reference.line
plot.[symbol, line]
plot.shingle[plot.polygon]
histogram[plot.polygon]
barchart[plot.polygon]
superpose.polygon
regions
●●●
●
●●
● ●
●
●
●
●
●
●
● ●
●●
●
●
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
30
Graphical Summaries of Data
trellis.par.set(theme = col.abd())
show.settings()
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
superpose.symbol
# a theme in the abd package
superpose.line
strip.background
●
strip.shingle
World
●
●
●
●
Hello
●
dot.[symbol, line]
box.[dot, rectangle, umbrella]
add.[line, text]
reference.line
plot.[symbol, line]
plot.shingle[plot.polygon]
histogram[plot.polygon]
barchart[plot.polygon]
superpose.polygon
regions
●
●●●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●●
trellis.par.set(theme = col.fastR())
show.settings()
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
superpose.symbol
# a theme in the fastR package
superpose.line
strip.background
●
World
●
●
●
●
Hello
●
dot.[symbol, line]
strip.shingle
box.[dot, rectangle, umbrella]
add.[line, text]
reference.line
plot.[symbol, line]
plot.shingle[plot.polygon]
histogram[plot.polygon]
barchart[plot.polygon]
superpose.polygon
regions
●
●●●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●●
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
31
trellis.par.set(theme = col.fastR(bw = TRUE))
show.settings()
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
superpose.symbol
superpose.line
# black and white version of previous theme
strip.background
●
World
●
●
●
●
Hello
●
dot.[symbol, line]
strip.shingle
box.[dot, rectangle, umbrella]
add.[line, text]
reference.line
plot.[symbol, line]
plot.shingle[plot.polygon]
histogram[plot.polygon]
barchart[plot.polygon]
superpose.polygon
regions
●
●●●
● ●●
●
●
●
●
●
●
●
●
●
● ●
●●
trellis.par.set(theme = col.abd()) # back to the abd theme
trellis.par.set(fontsize = list(text = 9)) # and back to a 10 point font
Want to save your settings?
# save current settings
mySettings <- trellis.par.get()
# switch to abd defaults
trellis.par.set(theme = col.abd())
# switch back to my saved settings
trellis.par.set(mySettings)
2.7
2.7.1
Getting Help in RStudio
The RStudio help system
There are several ways to get RStudio to help you when you forget something. Most objects in packages have
help files that you can access by typing something like:
?bargraph
?histogram
?HELPrct
You can search the help system using
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
32
Graphical Summaries of Data
help.search("Grand Rapids")
# Does R know anything about Grand Rapids?
This can be useful if you don’t know the name of the function or data set you are looking for.
2.7.2
Tab completion
As you type the name of a function in RStudio, you can hit the tab key and it will show you a list of all the
ways you could complete that name, and after you type the opening parenthesis, if you hit the tab key, you will
get a list of all the arguments and (sometimes) some helpful hints about what they are.)
2.7.3
History
If you know you have done something before, but can’t remember how, you can search your history. The history
tab shows a list of recently executed commands. There is also a search bar to help you find things from longer
ago.
2.7.4
Error messages
When things go wrong, R tries to help you out by providing an error message. If you can’t make sense of the
message, you can try copying and pasting your command and the error message and sending to me in an email.
One common error message is illustrated below.
fred <- 23
frd
Error: object 'frd' not found
The object frd is not found because it was mistyped. It should have been fred. If you see an “object not found”
message, check your typing and check to make sure that the necessary packages have been loaded.
2.8
2.8.1
Graphical Summaries – Important Ideas
Patterns and Deviations from Patterns
The goal of a statistical plot is to help us see
• potential patterns in the data, and
• deviations from those patterns.
2.8.2
Different Plots for Different Kinds of Variables
Graphical summaries can help us see the distribution of a variable or the relationships between two (or more)
variables. The type of plot used will depend on the kinds of variables involved. There is a nice summary of
these on page 48. You can use demo() to see how to get R to make the plots in this section.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
33
Later, when we do statistical analysis, we will see that the analysis we use will also depend on the kinds of
variables involved, so this is an important idea.
2.8.3
Side-by-side Plots and Overlays Can Reveal Importance of Additional Factors
The lattice graphics plots make it particularly easy to generate plats that divide the data into groups and
either produce a panel for each group (using |) or display each group in a different way (different colors or
symbols, using the groups argument). These plots can reveal the possible influence of additional variables –
sometimes called covariates.
2.8.4
Area = (relative) frequency
Many plots are based on the key idea that our eyes are good at comparing areas. Plots that use area (e.g.,
histograms, mosaic plots, bar charts, pie charts) should always obey this principle
Area = (relative) frequency
Plots that violate this principle can be deceptive and distort the true nature of the data.
An Example: Histogram with unequal bin widths
It is possible to make histograms with bins that have different widths. But in this case it is important that the
height of the bars is chosen so that area (NOT height) is proportional to frequency. Using height instead of area
would distort the picture.
When unequal bin sizes are specified, histogram() by default chooses the density scale:
histogram(~ Sepal.Length, data = iris, breaks = c(4, 5, 5.5,
5.75, 6, 6.5, 7, 8, 9))
Density
0.4
0.3
0.2
0.1
0.0
4
5
6
7
8
9
Sepal.Length
The density scale is important. It tells R to use a scale such that the area (height × width) of the rectangles is
equal to the relative frequency. For example, the bar from 5.0 to 5.5 has width 12 and height about 0.36, so the
area is 0.18, which means approximately 18% of the sepal lengths are between 5.0 and 5.5.
It would be incorrect to choose type="count" or type="proportion" since this distorts the picture of the data.
Fortunately, R will warn you if you try:
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
34
Graphical Summaries of Data
histogram(~ Sepal.Length, data = iris, breaks = c(4, 5, 5.5,
5.75, 6, 6.5, 7, 8, 9), type = "count")
Warning message: type='count' can be misleading in this context
Count
30
Never do this!
20
10
0
4
5
6
7
8
9
Sepal.Length
Notice how different this looks. Now the heights are equal to the relative frequency, but this makes the wider
bars have too much area.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Graphical Summaries of Data
35
Exercises
The solutions to these exercises should be done in Word (or something similar). Include both the plots and the
code you used to make them as well as any required discussion. Once you get the plots figured out, feel free to
use some of the bells and whistles to make the plots even better.
1 Use R’s help system to find out what the i1 and i2 variables are in the HELPrct data frame. Make histograms
for each variable and comment on what you find out. How would you describe the shape of these distributions?
Do you see any outliers (observations that don’t seem to fit the pattern of the rest of the data)?
2 Compare the distributions of i1 and i2 among men and women.
3 Compare the distributions of i1 and i2 among the three substance groups.
4 Where do the data in the CPS data frame (in the mosaic package) come from? What are the observational
units? How many are there?
5 Choose a quantitative variable that interests you in the CPS data set. Make an appropriate plot and comment
on what you see.
6 Choose a categorical variable that interests you in the CPS data set. Make an appropriate plot and comment
on what you see.
7 Create a plot that displays two or more variables from the CPS data. At least one should be quantitative and
at least one should be categorical. Comment on what you can learn from your plot.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
36
Last Modified: February 24, 2012
Graphical Summaries of Data
Math 143 : Spring 2012 : Pruim
Numerical Summaries
37
3
Numerical Summaries
3.1
Tabulating Data
A table is one kind of numerical summary of a data set. In fact, you can think of histograms and bar graphs
as graphical representations of summary tables. But sometimes it is nice to have the table itself. R provides
several ways of obtaining such tables.
3.1.1
Tabulating a categorical variable
The formula interface
There are several functions for tabulating categorical variables. xtabs() uses a syntax that is very similar to
bargraph(). We’ll call this method the formula interface. (R calls anything with a wiggle (~) a formula.)
xtabs(~ substance, data = HELPrct)
substance
alcohol cocaine
177
152
heroin
124
The $-interface
table() and its cousins use the $ operator which selects one variable out of a data frame.
Mosquitoes$sex
# general syntax: dataframe$variable
[1] female female female male
[13] female male
male
male
Levels: female male
male
male
female female female male
female female male
male
female female
We’ll call this interface the $-interface.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
38
Numerical Summaries
table(HELPrct$substance)
alcohol cocaine
177
152
heroin
124
perctable(HELPrct$substance)
alcohol cocaine
39.07
33.55
heroin
27.37
proptable(HELPrct$substance)
alcohol cocaine
0.3907 0.3355
# display percents instead of counts
# display proportions instead of counts
heroin
0.2737
Two interfaces
Some functions in R require the formula interface, some require the $-interface, and some allow you to use either
one.1 For example, histogram will also work like this.
Percent of Total
histogram(HELPrct$age)
25
20
15
10
5
0
20
30
40
50
60
HELPrct$age
But notice that the output is not quite as nice, since the default label for the horizontal axis now shows both
the data frame name and the variable name with a $ between. My advice is to use formula interfaces whenever
they are available.
3.1.2
Tabulating a quantitative variable
Although xtabs() and table() work with quantitative variables as well as categorical variables, this is only
useful when there are not too many different values for the variable.
1 One of the things that the mosaic package does is provide a formula interface for many functions that only had a $-interface
before.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Numerical Summaries
39
xtabs(~ age, data = HELPrct)
age
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
1 2 3 8 5 8 7 13 18 15 18 18 20 28 35 18 25 23 20 18 27 10 20 10 13 7 13 5 14 5
49 50 51 52 53 54 55 56 57 58 59 60
8 2 1 1 3 1 2 1 2 2 2 1
Tabulating in bins (optional)
It is more convenient to group them into bins. We just have to tell R what the bins are. For example, suppose
we wanted to group the 20s, 30s, 40s, etc. together.
binnedAge <- cut(HELPrct$age, breaks = c(10, 20, 30, 40,
50, 60, 70))
xtabs(~ binnedAge) # no data frame given because its not in a data frame
binnedAge
(10,20] (20,30] (30,40] (40,50] (50,60] (60,70]
3
113
224
97
16
0
table(binnedAge)
# no data frame given because its not in a data frame
binnedAge
(10,20] (20,30] (30,40] (40,50] (50,60] (60,70]
3
113
224
97
16
0
mtable(~ binnedAge)
# no data frame given because its not in a data frame
(10,20] (20,30] (30,40] (40,50] (50,60] (60,70]
3
113
224
97
16
0
Total
453
That’s not quite what we wanted: 30 is in with the 20s, for example. Here’s how we fix that.
binnedAge <- cut(HELPrct$age, breaks = c(10, 20, 30, 40,
50, 60, 70), right = FALSE)
xtabs(~ binnedAge) # no data frame given because it’s not in a data frame
binnedAge
[10,20) [20,30) [30,40) [40,50) [50,60) [60,70)
1
97
232
105
17
1
table(binnedAge)
# no data frame given because it’s not in a data frame
binnedAge
[10,20) [20,30) [30,40) [40,50) [50,60) [60,70)
1
97
232
105
17
1
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
40
Numerical Summaries
mtable(~ binnedAge)
# no data frame given because it’s not in a data frame
[10,20) [20,30) [30,40) [40,50) [50,60) [60,70)
1
97
232
105
17
1
Total
453
We won’t use this very often, since typically seeing this information in a histogram is more useful.
Labeling a histogram
The xhistogram() function offers you the option of adding the counts to the graph.
xhistogram(~ age, data = HELPrct, label = TRUE, type = "count",
width = 10, center = 5, ylim = c(0, 300), right = FALSE)
232
Count
250
200
150
105
97
100
50
17
1
10
20
30
40
50
1
60
70
age
3.1.3
Cross-tables: Tabulating two or more variables
xtabs() can also compute cross tables for two (or more) variables.
xtabs(~ sex + substance, data = HELPrct)
substance
sex
alcohol cocaine heroin
female
36
41
30
male
141
111
94
We can arrange the table differently by converting it to a data frame.
as.data.frame(xtabs(~ sex + substance, data = HELPrct))
sex substance Freq
1 female
alcohol
36
2
male
alcohol 141
3 female
cocaine
41
4
male
cocaine 111
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Numerical Summaries
5 female
6
male
3.2
heroin
heroin
41
30
94
Working with Pre-Tabulated Data
Sometimes data arrive pre-tabulated. We can use barchart() instead of bargraph() to graph pre-tabulated
data.2
TeenDeaths
cause deaths
1
Accidents
6688
2
Homicide
2093
3
Suicide
1615
4
Malignant tumor
745
5
Heart disease
463
6
Congenital abnormalities
222
7 Chronic respiratory disease
107
8
Influenza and pneumonia
73
9
Cerebrovascular diseases
67
10
Other tumor
52
11
All other causes
1653
barchart(deaths ~ cause, data = TeenDeaths)
barchart(cause ~ deaths, data = TeenDeaths)
Suicide
Other tumor
Malignant tumor
Influenza and pneumonia
Homicide
Heart disease
Congenital abnormalities
Chronic respiratory disease
Cerebrovascular diseases
All other causes
Accidents
deaths
6000
4000
2000
0
Accidents
Cerebrovascular
AllChronic
otherCongenital
causes
respiratory
diseases
Heart
abnormalities
Influenza
disease
disease
Homicide
Malignant
and pneumonia
Other
tumor
tumor
Suicide
0
2000
4000
6000
deaths
Notice that by default the causes are displayed in alphabetical order. R assumes that categorical data is
nominal unless you say otherwise. Here’s one way to make it ordinal using the order in which things appear in
the TeenDeaths data frame.
barchart(ordered(cause, levels = cause) ~ deaths, TeenDeaths)
2 bargraph() converts raw data into a summary table and then calls barchart() to do the plotting.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
42
Numerical Summaries
All other causes
Other tumor
Cerebrovascular diseases
Influenza and pneumonia
Chronic respiratory disease
Congenital abnormalities
Heart disease
Malignant tumor
Suicide
Homicide
Accidents
0
2000
4000
6000
deaths
3.3
Summarizing Distributions of Quantitative Variables
Important Note
Numerical summaries are a convenient way to describe a distribution, but remember that numerical summaries
do not tell you everything there is to know about a distribution any more than know a few facts (e.g., age,
employment, sex) tell you everything there is to know about a person.
Notation
In statistics n (or sometimes N ) almost always means the number of observations (i.e., the number of rows in
a data frame).
If y is a variable in a data set with n observational units, we can denote the n values of y as
• y1 , y2 , y3 , . . . , yn (in the original order of the data).
• y(1) , y(2) , y(3) , . . . y(n) (in sorted order from smallest to largest).
The text uses capital letters for this, but it is more common to use lower case letters for particular data
sets. (Capitals get used for something called random variables.)
The symbol
3.4
X
represents summation (adding up a bunch of values).
Measures of Center
Measures of center attempt to give us a sense of what is a typical value for the distribution.
n
X
mean of y = y =
i=1
n
yi
=
sum of values
number of values
median of y = the “middle” number (after putting the numbers in increasing order)
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Numerical Summaries
43
• The mean is the “balancing point” of the distribution.
• The median is the 50th percentile: half of the distribution is below the median, half is above.
• If the distribution is symmetric, then the mean and median are the same.
• In a skewed distribution, the mean is pulled toward the tail of the distribution (relative to the median).
• A few very large or very small values can change the mean a lot, so the mean is sensitive to outliers and
is a better measure of center when the distribution is symmetric than when it is skewed.
• The median is a resistant measure – it is not affected much by a few very large or very small values.
3.5
Measures of Spread
n
X
(yi − y)2
variance of y = sy2 =
i=1
n
q
standard deviaiton of y = sy =
sy2
= square root of variance
interquartile range = IQR = Q3 − Q1
= difference between first and third quartiles
• Roughly, the standard deviation is the “average deviation from the mean”. (That’s not exactly right
because of the squaring involved.)
• The mean and standard deviation are especially useful for describing normal distributions and other
unimodal, symmetric distributions that are roughly “bell-shaped”. (We’ll learn more about normal distributions later.)
• Like the mean, the variance and standard deviation are sensitive to outliers and less suited for summarizing
skewed distributions.
• The book shows examples of computing the variance and standard deviation by hand. This is mostly
important so we understand how these measures work, but we will typically let R do the calculations for
us.
To get a numerical summary of a variable (a statistic), we need to tell R the variable and the data frame involved.
There several ways we can do this in R. Here are several ways to get the mean, for example:
mean(HELPrct$age)
# this is the most traditional way; should always work
[1] 35.65
mean(~ age, data=HELPrct)
# similar to our plotting methods; only works for some functions
[1] 35.65
with(HELPrct, mean(age))
# this can be handy when you need several variables
[1] 35.65
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
44
Numerical Summaries
Using the first style, we can now compute several different statistics.
mean(HELPrct$age)
[1] 35.65
sd(HELPrct$age)
[1] 7.71
var(HELPrct$age)
[1] 59.45
median(HELPrct$age)
[1] 35
IQR(HELPrct$age)
[1] 10
favstats(HELPrct$age)
0%
25%
50%
75% 100% mean
19.00 30.00 35.00 40.00 60.00 35.65
sd
var
7.71 59.45
# compute favorite stats for each abused substance
summary(age ~ substance, HELPrct, fun = favstats)
age
N=453
+---------+-------+---+--+---+----+-----+----+-----+-----+-----+
|
|
|N |0%|25%|50% |75% |100%|mean |sd
|var |
+---------+-------+---+--+---+----+-----+----+-----+-----+-----+
|substance|alcohol|177|20|33 |38.0|43.00|58 |38.20|7.652|58.56|
|
|cocaine|152|23|30 |33.5|37.25|60 |34.49|6.693|44.79|
|
|heroin |124|19|27 |33.0|39.00|55 |33.44|7.986|63.78|
+---------+-------+---+--+---+----+-----+----+-----+-----+-----+
|Overall |
|453|19|30 |35.0|40.00|60 |35.65|7.710|59.45|
+---------+-------+---+--+---+----+-----+----+-----+-----+-----+
For the individual statistic-computing functions, you can also the following method:
mean(age ~ sex, data = HELPrct)
sex
S
N Missing
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Numerical Summaries
1 female 36.25 107
2
male 35.47 346
45
0
0
sd(age ~ sex, data = HELPrct)
sex
S
N Missing
1 female 7.585 107
0
2
male 7.750 346
0
3.5.1
A word of caution
None of these measures (especially the mean and median) is a particularly good summary if the distribution is
not unimodal. The histogram below shows the lengths of eruptions of the Old Faithful geyser at Yellowstone
National Park.
favstats(faithful$eruptions)
0%
25%
50%
75% 100% mean
sd
var
1.600 2.163 4.000 4.454 5.100 3.488 1.141 1.303
histogram(~ eruptions, data = faithful, n = 20, main = "Old Faithful Eruption Times",
xlab = "duration (min)")
Percent of Total
Old Faithful Eruption Times
12
10
8
6
4
2
0
2
3
4
5
duration (min)
Notice that the mean and median do not represent typical eruption times very well. Nearly all eruptions are
either quite a bit shorter or quite a bit longer. (This is especially true of the mean.)
3.5.2
Box plots
Boxplots (also called box-and-whisker plots) are a graphical representation of a 5-number summary of a quantitative variable. The five numbers are the five quantiles:
• Q0 , the minimum
• Q1 , the first quartile (25th percintile)
• Q2 , the median (50th percentile)
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
46
Numerical Summaries
• Q3 , the third quartile (75th percentile)
• Q4 , the maximum
bwplot(~ age, data = HELPrct)
●
20
●●●●●
30
40
50
60
age
Boxplots really shine when comparing multiple groups. Here is one way to do so that should look familiar from
what we know about historams:
bwplot(~ age | substance, data = HELPrct, layout = c(1, 3))
heroin
●
cocaine
●
●
●●
alcohol
●
20
30
40
50
60
age
But bwplot() has a better way. Put the quantitative variable on one side of the wiggle and the categorical on
the other. The placement determines which goes along the vertical axis and which along the horizontal axis –
just like it did for xyplot().
bwplot(substance ~ age, data = HELPrct)
bwplot(age ~ substance, data = HELPrct)
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Numerical Summaries
47
●
cocaine
●
age
●
heroin
●●
●
●
50
●
40
●
alcohol
60
●
●
●
cocaine
heroin
30
20
20
30
40
50
60
alcohol
age
And we can combine this idea with conditioning:
bwplot(substance ~ age | sex, data = HELPrct)
20
30
female
heroin
●
cocaine
●
20
30
60
●
●
●
●
●
●
●
40
50
male
●
alcohol
40
●
50
●
60
age
3.5.3
Small data sets
When we have relatively small data sets, it does not make sense to use a boxplot. Then it is better to display
all the data. xyplot() allows you to put a categorical variable along one axis and a quantitative variable along
the other. For some data sets, either option can produce a plot that gives a good picture of the data.
sex
xyplot(sex ~ weight, data = Mosquitoes, cex = 0.5, alpha = 0.6)
bwplot(sex ~ weight, data = Mosquitoes)
male
male
female
female
0.15 0.20 0.25 0.30 0.35 0.40 0.45
weight
●
●
0.15
0.20
●
0.25
0.30
0.35
0.40
0.45
weight
R actually provides a separate function for this type of scatter plot (where one variable is categorical).
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
48
Numerical Summaries
stripplot(substance ~ age, data = HELPrct)
stripplot(substance ~ age, data = HELPrct, jitter = TRUE,
alpha = 0.6)
heroin
● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●
cocaine
●●●●●●●●●●●●●●●●●●●●●●● ●●●
alcohol
●●
heroin
●●
● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
20
30
40
50
cocaine
alcohol
60
20
age
3.6
40
50
60
age
Summary Statistics and Transforming Data
shifted data
rescaled data
original data
original data + k
original data × c
mean
y
y +k
y ·c
standard deviation
s
s
s · |c|
variance
s2
s2
s2 · c2
median
m
m+k
m·c
IQR
IQR + k
IQR · c
statistic
interquartile range
3.7
30
Summarizing Categorical Variables
The most common summary of a categorical variable is the proportion:
p̂ =
number in one category
n
Proportions can be expressed as fractions, decimals or percents. For example, if there are 10 observations in
one category and n = 50 observations in all, then
p̂ =
10 2
= = 0.40 = 40%
25 5
If we code our categorical variable using 1 for observations in “the one category” and 0 for observations in any
other category, then a proportion is a sample mean.
1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 10
=
25
25
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Numerical Summaries
3.8
49
Relationships Between Two Variables
It is also possible to give numerical summaries of the relationship between two variables. The most common
one is the correlation coefficient, which we will learn about later.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
50
Last Modified: February 24, 2012
Numerical Summaries
Math 143 : Spring 2012 : Pruim
Hypothesis Testing
51
6
Hypothesis Testing
hypothesis A statement that can be true or false
statistical hypothesis A hypothesis about the parameters of a population or random process
hypothesis testing Using data to assess statistical hypotheses
6.1
The Four Step Process
We will see many hypothesis tests over the rest of the semester. The details differ, but they all following the
same 4 step outline:
1. State the Null and Alternative Hypotheses.
2. Compute a Test Statistic.
3. Compute a p-value.
4. Draw a conclusion.
Let’s illustrate with Example 6.2 (about the “handedness” of toads). Example 6.2 (about the “handedness” of
toads).
Step 1: State the Null and Alternative Hypotheses
Population of interest: all European bufo bufo toads.
Parameter of intersest: p = the proportion of toads that exhibit right handed behavior.
Here are some example hypotheses about the parameter p:
H1 : p = 0.5
Toads are equally likely to be left- or right-“handed”.
H2 : If p > 0.5
More toads are right-handed than left-handed.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
52
Hypothesis Testing
H3 : If p < 0.5
More toads are left-handed than right-handed.
H4 : p , 0.5
Toads are not equally likely to be left-handed or right-handed (but we aren’t making any claim about
which “hand” is more likely to be preferred).
Hypothesis testing makes an important distinction between two hypotheses:
null hypothesis a specific claim about the value of a parameter. It is made for purposes of argument – we are
looking to see whether we have enough evidence in the data to reject the null hypothesis. Abbreviation:
H0
alternative hypothesis typically a more general hypothesis representing what happens if the null hypothesis is
false. Abbreviation: Ha or HA .
Students often get confused about which is the null and which is the alternative. If it helps you think about the
following courtroom analogy. The null hypothesis is that the defendant is innocent, and that is the operating
assumption under which the proceedings take place. Of course, the plaintiff does not believe this hypothesis, so
the plaintiff brings data hoping to show that the null hypothesis of innocence is false. The jury will assess the
data and render a verdict of “guilty” or “not guilty”:
• guilty: the evidence is strong enough to reject the assumption of innocence.
The data are not consistent (within reasonable doubt) with an innocent defendant.
• not guilty: the evidence is not strong enough to reject the assumption of innocence.
The data could reasonably have happened even if the defendant was innocent.
In our example,
H0 : p = 0.50
Ha : p , 0.50 (Or possibly p > 0.50 or p < 0.50; more on that choice later.)
Step 2: Compute a Test Statistic
In statistical hypothesis test, we summarize all the evidence with a single number called a test statistic. The
toad researchers tested a sample of 18 toads and found 14 of them to be right-handed. So x = 14 is a good test
statistic. If the null hypothesis is true, we would expect roughly half (≈ 9) of the toads to be right-handed. The
farther our count is from 9, the stronger the evidence against the null hypothesis.
Note: We could also have used p̂ =
14
18 .
We’ll return to this idea later.
Step 3: Compute a p-value
Since it would be possible to see 14 out of 18 toads be right-handed just by chance, we can’t know for sure if
H0 is true or false. What we really want to know is this:
Assuming the null hypothesis is true (since that is our working assumption),
What are the chances of seeing so many right-handed toads in a random sample of 18?
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Hypothesis Testing
53
This is a probability question. To answer it we need to compare our test statistic to the null distribution
null distribution the distribution of the test statistic assuming the null hypothesis is true.
p-value the probability of obtaining a test statistic that is at least as unusual as the test statistic calculated
from our data, assuming the null hypothesis is true.
Empirical p-values
If we can simulate what happens when the null hypothesis is true, we can calculate approximate p-values using
a computer to perform the simulations many times.
rflip(18)
Flipping 18 coins [ Prob(Heads) = 0.5 ] ...
H H H H T T T H H H T T H T T T T T
Result: 8 heads.
rtoads <- do(10000) * rflip(18) # repeat 10,000 times
proptable(rtoads$heads) # table of proportions
2
3
4
5
6
7
8
9
10
11
12
13
0.0005 0.0031 0.0120 0.0331 0.0670 0.1258 0.1671 0.1822 0.1652 0.1209 0.0724 0.0350
14
15
16
0.0124 0.0025 0.0008
xhistogram(~ heads, data = rtoads, width = 1, xlab = "number of heads")
proptable(rtoads$heads >= 14)
FALSE
TRUE
0.9843 0.0157
proptable(rtoads$heads <= 4)
FALSE
TRUE
0.9844 0.0156
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
54
Hypothesis Testing
Density
0.15
0.10
0.05
0.00
5
10
15
number of heads
From this we see that the probability of having 14 or more right-handed toads (assuming toads are equally likely
to be left- and right-handed) is approximately 0.0157 = 1.57%.
Similarly, the probability of having 14 or more left-handed toads (i.e., 4 or fewer right-handed toads) is approximately 0.0156 = 1.56%.
p-value = Pr(X ≥ 14) ≈ 0.0157
if Ha is p > 0.50
p-value = Pr(X ≥ 14 or X ≤ 4) ≈ 0.0157 + 0.0156 = 0.0313
if Ha is p , 0.50
p-value = Pr(X ≤ 14) ≈ 0.9967
if Ha is p < 0.50
Theoretical P-values
If we can work out the probabilities for the null distribution theoretically, we can compute p-values without
simulations. We will learn how to do this soon.
Step 4: Draw a conclusion
The smaller the p-value, the stronger the evidence against the null hypothesis. In this case, our (two-sided)
p-value is ≈ 0.03. What does this mean? One of the following:
1. The null hypothesis is false.
2. The null hypothesis is true, but we got rather unusual data.
Seeing such a large difference in the number of left- and right-handed toads would happen only about 3%
of the time if there really are equal numbers of left- and right-handed toads.
We can’t know which if these is the case, but when the p-value is small enough, then we reject the null hypothesis
based on our data because our data would be very unlikely if the null hypothesis were true.
6.2
Some Additional Details
6.2.1
Errors in Hypothesis Testing
There are two types of error we can make when making a decision about a null hypothesis:
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Hypothesis Testing
55
reject H0
H0 is true
H0 is false
Type I error
good case
good case
Type II error
don’t reject H0
6.2.2
How small is small?
Here is a table that can help you interpret p-values.
p-value
interpretation
0.20 ≤
p-value
no evidence against the null hypothesis
0.10 ≤
p-value
≤ 0.20
perhaps enough evidence to warrant collecting more data
0.05 ≤
p-value
≤ 0.10
suggestive evidence against the null hypothesis (probably want to
confirm with additional data)
0.01 ≤
p-value
≤ 0.05
evidence against the null hypothesis
0.001 ≤
p-value
≤ 0.01
strong evidence against the null hypothesis
p-value
≤ 0.001
very strong evidence against the null hypothesis
Often researchers will decide in advance what threshold (denoted α and called the significance level) will be used
for the boundary between rejecting and not rejecting the null hypothesis. The most commonly used threshold
is α = 0.05. Another common threshold is α = 0.01.
The choice of α determines the probability of type I error, so it should be selected in consideration of the
consequences of making such an error.
6.2.3
What if the p-value is large?
When the p-value is large, our data are consistent with the null hypothesis (in the sense that they are not that
unlikely when the null hypothesis is true). But this does not imply that the null hypothesis is indeed true.
Especially when sample sizes are small, we may not have enough power to tell one way or the other.
For example, if we tested 4 toads and found 3 to be right handed, the p-value would be 0.625.
number of right-handed toads
probability (assuming p = 0.50)
0
1
2
3
4
0.0625
0.25
0.375
0.25
0.0625
This is consistent with a null hypothesis of equal numbers of left- and right-handed toads, but it is also consistent
with many other hypothesis like p = .75 or even p = .85. There just isn’t enough data say much about the value
of p.
Soon we will learn about confidence intervals, which will give another way to express what we can learn about
p from our sample data.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
56
Hypothesis Testing
6.2.4
Which Alternative?
Because 1-sided p-values are smaller than 2-sided p-values, it can be tempting to use them. To avoid this temptation, some statisticians recommend always using a two-sided alternative. Backing off this extreme position,
here are two questions you must answer whenever you want to use a 1-sided alternative.
Can/should I use a 1-sided alternative?
1. Can I justify the choice of a 1-sided alternative without reference to the data?
If the answer is No, you must use a 2-sided alternative.
2. If the data are unusual in the “wrong direction”, what will I conclude?
If you feel like you should reject the null hypothesis in this case, you must use a 2-sided alternative.
In the case of the toads, if we don’t know whether we should expect more left-handed or more right-handed toads
before we collect that data, we must use a 2-sided alternative. Even supposing we suspected more right-handed
toads (based on analogy to humans or other animals, perhaps) we must ask what we would do if we find many
more left-handed toads in our data. If we would find that scientifically interesting (and somewhat surprising),
we should be doing a 2-sided test.
6.2.5
Statistical Significance and Scientific Significance
When the p-value is below our threshold for rejecting the null hypothesis, we say that our results are statistically
significant. This simply means that data at least as unusual as our data are unlikely to occur by chance. It does
not necessarily mean that anything interesting is going on biologically.
6.3
More Examples
6.3.1
Doris and Buzz
In the 1960s, Jarvis Bastian (U. California) performed some experiments with two dolphins, Doris and Buzz.
He set up four underwater levers, two for each dolphin, and two lights – one that flashed and the other that
shone continuously.
In the first phase of the training, to get a fish reward, the dolphins had to press the right lever when the
continuous light was lit and the left lever when the flashing light was lit. In the second training phase, Buzz
had to push his lever first and both dolphins got a reward if Buzz was correct.
Finally, the tank was divided in two with an opaque barrier. Now only Doris could see the lights, but to get
fish, Buzz had to push the correct lever before Doris pressed hers. Would Doris be able to communicate to Buzz
which lever he should push?
1. State Hypotheses.
• H0 : Buzz is just guessing, so p = 0.50
• Ha : Buzz is not just guessing, so p > 0.50
We can choose to do a one-sided test here since we can distinguish between the two alternatives prior to
collecting data. Buzz being usually correct is different from Buzz being usually incorrect and we would
probably interpret these two situations differently. Some researchers might still argue for a 2-sided test on
the grounds that if Buzz does significantly worse that he would be just guessing, that too is an indication
that something is going on?
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Hypothesis Testing
57
2. Compute the test statistic.
In an early round of testing (more testing was done later), Buzz was correct 15 times out of 16, so x = 15
is our test statistic.
3. Compute the p-value (emprically for now)
rdolphins <- do(5000) * rflip(16)
xhistogram(~ heads, rdolphins, width = 1, xlab = "Correct Pushes by Buzz")
proptable(rdolphins$heads >= 15)
FALSE
TRUE
0.9998 0.0002
Density
0.20
0.15
0.10
0.05
0.00
0
5
10
15
Correct Pushes by Buzz
4. Draw a Conclusion.
It would be pretty unusual for Buzz to be correct 15 or more times out of 16 times if he were just guessing,
so we have quite strong evidence the something else is going on.
Notice that statistics alone does not say what the something else might be. One natural conclusion is
that the dolphins are ”talking” to each other and that Doris is telling Buzz which lever to push. But the
statistics don’t say that.
Subsequent experiments demonstrated that Buzz did not do so well when Doris wasn’t on the other side
of the barrier (so Buzz isn’t psychic or getting cues from the researchers or something like that) but that
the type of communication going on was probably not equivalent to human speech. Turns out that Doris
made one kind of noise when the light was steady and another when the light was flashing. She did this
whether or not Buzz was there; in fact, she had done this already when they were being trained with the
lights. Buzz and learned to associate the two sounds with the two lights.
When the researchers switched which kind of light went with which lever, the two dolphins were readily
able to adapt to the new rules when they could see the lights, but Buzz’s ability to choose the correct
level when only Doris could see the lights fell off dramatically (to just over 50%).
It is important know what statistics can tell you and what it can’t. In general, statistics is better at
pointing out that something is going on than in determine just what the something is. Careful study
designs and additional (non-statistical) information are combined to draw scientific conclusions.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
58
Hypothesis Testing
6.3.2
Tim or Bob?
Quickly decide which of these is Tim and which is Bob:
If we collect data from a number of people, we can test the hypotheses
• H0 : p = 0.50 (People are equally likely to assign the names either way around.)
• Ha : p , 0.50 (People are more likely to assign the names one way around than the other.)
Data from our class: 26 said Tim is on the left; 6 said Tim is on the right.
What can we conclude?
6.3.3
Helper-Hinderer Experiment
Do infants prefer toys that were helpful in a previously seen video? To test this, infants were shown videos
in which simple toys (geometric shapes with faces) were shown helping or hindering each other. They were
subsequently given a choice between the helper toy and the hinderer toy.
1. Is this an experiment or an observational study? (If you don’t have enough information to say, what
additional information would you need?)
2. What are the hypotheses for this study?
3. What conclusions could you draw from a small p-value?
4. What concusions could you draw from a large p-value?
6.4
What’s Next?
1. We will learn some probability and learn how to calculate p-values for the tests in this chapter without
simulations.
2. We will learn how to deal with null hypotheses like p = 0.25 or p = 1/6 instead of just p = 1/2.
3. Eventually we will learn how to test hypotheses about different parameters altogether.
The good news is that the 4-step outline is the same for every test we will see.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Probability
59
5
Probability
5.1
Key Definitions and Ideas
random process A repeatable process that has multiple unpredictable potential outcomes.
Although we sometimes use language that suggests that a particular result is random, it is really the
process that is random, not its results.
outcome A potential result of a random process.
sample space The set of all potential outcomes of a random process.
event A subset of the sample space. That is, a set of outcomes (possibly all or none of the outcomes).
trial One repetition of a random process.
mutually exclusive events. Events that cannot happen on the same trial.
probability A numerical value between 0 and 1 assigned to an event to indicate how often the event occurs (in
the long run).
random variable A random process that results in a numerical value.
Examples:
• Roll a die and record the number.
• Roll two dice and record the sum.
• Flip 100 coins and count the number of heads.
• Sample 1000 people and ask them if they approve of the job the president is doing.
Note: Statisticians use capital letters for random variables.
probability distribution The distribution of a random variable. (Remember that a distribution describes what
values? and with what freqency? )
5.2
Computing Probabilities
There are two main ways to calculate the probability of an event A, denoted Pr(A).
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
60
Probability
5.2.1
Empirical Method: Just Do It
Since random processes are repeatable, we could simply repeat the process over and over and keep track of how
often the event A occurs. For example, we could flip a coin 10,000 times and see what fraction are heads.1
Empirical Probability =
number of times A occured
number of times random process was repeated
Now that we have computers, there is a new wrinkle on empirical probability. If we can simulate our random
process in the computer, then we can repeat the process many times very quickly. We did this on the first day
with our Lady Tasting Tea example.
5.2.2
Theoretical Method: Just Think About It
The theoretical method combines
1. Some basic facts that are true about every probability situation,
2. Some assumptions about the particular situation at hand, and
3. Mathematical reasoning (arithmetic, algebra, logic, etc.).
5.3
5.3.1
Probability Axioms and Rules
The Three Axioms
Let S be the sample space and let A and B be events.
1. Probability is between 0 and 1: 0 ≤ Pr(A) ≤ 1.
2. The probability of the sample space is 1: Pr(S) = 1.
3. Additivity: If A and B are mutually exclusive, then Pr(A or B) = Pr(A) + Pr(B).
5.3.2
Other Probability Rules
These rules all follow from the axioms.
The Addition Rule
If events A and B are mutually exclusive, then
Pr(A or B) = Pr(A) + Pr(B) .
More generally,
Pr(A or B) = Pr(A) + Pr(B) − Pr(A and B) .
1 This has actually been done a couple of times in history, including once by a mathematician who was a prisoner of war and
had lots of time on his hands.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Probability
61
The Complement Rule
Pr(not A) = 1 − Pr(A)
The Equally Likely Rule
If the sample space consists of n equally likely outcomes, then the probability of an event A is given by
Pr(A) =
number of outcomes in A |A|
=
.
n
|S|
One of the most common mistakes in probability is to apply this rule when the outcomes are not equally likely.
Examples
1. Coin Toss: Pr(heads) =
1
2
2. Rolling a Die: Pr(even) =
if heads and tails are equally likely.
3
6
if the die is fair (each of the six numbers equally likely to occur).
3. Sum of two Dice: the sum is a number between 2 and 12, but these numbers are NOT equally likely. But
there are 36 equally likely combinations of two dice:
1,1 2,1 3,1 4,1 5,1 6,1
1,2 2,2 3,2 4,2 5,2 6,2
1,3 2,3 3,3 4,3 5,3 6,3
1,4 2,4 3,4 4,4 5,4 6,4
1,5 2,5 3,5 4,5 5,5 6,5
1,6 2,6 3,6 4,6 5,6 6,6
Let X be the sum of two dice.
• Pr(X = 3) =
2
36
=
1
18
• Pr(X = 7) =
6
36
=
1
6
6
36
=
• Pr(doubles) =
1
6
4. Punnet Squares
A
a
A AA
Aa
a
aa
Aa
In an Aa × Aa cross, if A is the dominant allele, then the probability of the dominant phenotype is 34 , and
the probability of the recessive phenotype is 41 .
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
62
Probability
5.3.3
Conditional Probability and Independence
Example. Q. Suppose a family has two children and one of them is a boy. What is the probability that the
other is a girl?
A. We’ll make the simplifying assumption that boys and girls are equally likely (which is not exactly true).
Under that assumption, there are four equally likely families: BB, BG, GB, and GG. But only three of these
have at least one boy, so our sample space is really {BB, BG, GB}. Of these, two have a girl as well as a boy. So
the probability is 2/3 (see Figure 5.1).
GG
GB
BG
BB
probability = 2/3
Figure 5.1: Illustrating the sample space for Example 5.3.3.
We can also think of this in a different way. In our original sample space of four equally likely families,
Pr(at least one girl) = 3/4 ,
Pr(at least one girl and at least one boy) = 2/4 , and
2/4
= 2/3 ;
3/4
so 2/3 of the time when there is at least one boy, there is also a girl. We will denote this probability as
Pr(at least one girl | at least one boy). We’ll read this as “the probability that there is at least one girl given
that there is at least one boy”. See Figure 5.2 and Definition 5.3.3.
A
B
Figure 5.2: A Venn diagram illustrating the definition of conditional probability. Pr(A | B) is the ratio of the
area of the football shaped region that is both shaded and striped (A and B) to the area of the shaded circle
(B).
Let A and B be two events such that Pr(B) , 0. The conditional probability of A given B is defined by
Pr(A | B) =
Pr(A and B)
.
Pr(B)
If Pr(B) = 0, then Pr(A | B) is undefined.
Example. A class of 5th graders was asked what color should be used for the class T-shirt, red or purple. The
table below contains a summary of the students’ responses:
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Probability
63
Color
Red
Purple
Girls
7
9
Boys
10
8
Q. Suppose we randomly select a student from this class. Let R be the event that a child prefers a red T-shirt.
Let B be the event that the child is a boy, and let G be the event that the child is a girl. Express each of the
following probabilities in words and determine their values:
• Pr(R),
• Pr(B | R),
• Pr(G | R),
• Pr(R | B),
• Pr(R | G),
• Pr(B | G).
A. The conditional probabilities can be computed in two ways. We can use the formula from the definition of
conditional probability directly, or we can consider the condition event to be a new, smaller sample space and
read the conditional probability from the table.
• Pr(R) = 17/34 = 1/2 because 17 of the 34 kids prefer red
This is the probability that a randomly selected student prefers red
10/34 10
=
because 10 of the 18 boys prefer red
18/34 18
This is the probability that a randomly selected boy prefers red
• Pr(R | B) =
10/34 10
=
because 10 of the 17 students who prefer red are boys.
17/34 17
This is the probability that a randomly selected student who prefers red is a boy.
• Pr(B | R) =
7/34
7
=
because 7 of the 16 girls prefer red
16/34 16
This is the probability that a randomly selected girl prefers red
• Pr(R | G) =
7/34
7
=
because 7 of the 17 kids who prefer red are girls.
17/34 17
This is the probability that a randomly selected kid who prefers red is a girl.
• Pr(G | R) =
0
= 0 because none of the girls are boys.
16/34
This is the probability that a randomly selected girl is a boy.
• Pr(B | G) =
One important use of conditional probability is as a tool to calculate the probability of an intersection.
Let A and B be events with non-zero probability. Then
Pr(A and B) = Pr(A) · Pr(B | A)
= Pr(B) · Pr(A | B) .
This follows directly from the definition of conditional probability by a little bit of algebra and can be
generalized to more than two events.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
64
Probability
Example. Q. If you roll two standard dice, what is the probability of doubles? (Doubles is when the two
numbers match.)
A. Let A be the event that we get a number between 1 and 6 on the first die. So Pr(A) = 1. Let B be the event that
the second number matches the first. Then the probability of doubles is Pr(A and B) = Pr(A)·Pr(B | A) = 1· 16 = 16
since regardless of what is rolled on the first die, 1 of the 6 possibilities for the second die will match it.
Example. Q. A 5-card hand is dealt from a standard 52-card deck. What is the probability of getting a flush
(all cards the same suit)?
A. Imagine dealing the cards in order. Let Ai be the event that the ith card is the same suit as all previous
cards. Then
Pr(flush)
=
Pr(A1 and A2 and A3 and A4 and A5 )
=
Pr(A1 ) · Pr(A2 | A1 ) · Pr(A3 | A1 and A2 ) · Pr(A4 | A1 and A2 and A3 )
· Pr(A5 | A1 and A2 and A3 and A4 )
12 11 10 9
·
·
·
.
= 1·
51 50 49 48
Example.
Q. In a bowl are 4 red Valentine hearts and 2 blue Valentine hearts.
If you reach in without looking and select two of the Valentines, let X be the number of blue Valentines. Fill in
the following probability table.
value of X
0
1
2
probability
2
.
A. Pr(X = 2) = Pr(first is blue and second is blue) = Pr(first is blue) · Pr(second is blue | first is blue) = 62 · 15 = 30
4 3
12
Similarly Pr(X = 0) = Pr(first is red and second is red) = Pr(first is red)·Pr(second is red | first is red) = 6 · 5 = 30
16
Finally, Pr(X = 1) = 1 − Pr(X = 0) − Pr(X = 2) = 1 − 14
30 = 30
We can represent this using a tree diagram as well.
1
5
2
6
4
6
r
2
6
· 51 =
2
30
♥
2
6
· 54 =
8
30
r
4
6
· 52 =
8
30
♥
4
6
· 53 =
12
30
r
4
5
2
5
♥
3
5
The edges in the tree represent conditional probabilities which we can multiply together to the probability that
all events on a particular branch happen. The first leve of branching represents what kind of Valentine is selected
first, the second level represents the second selection.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Probability
65
Example. Q. Suppose a test correctly identifies diseased people 99% of the time and correctly identifies healthy
people 98% of the time. Furthermore assume that in a certain population, one person in 1000 has the disease.
If a random person is tested and the test comes back positive, what is the probability that the person has the
disease?
A. We begin by introducing some notation. Let D be the event that a person has the disease. Let H be the
event that the person is healthy. Let + be the event that the test comes back positive (meaning it indicates
disease – probably a negative from the perspective of the person tested). Let − be the event that the test is
negative.
• Pr(D) = 0.001, so Pr(H) = 0.999.
• Pr(+ | D) = 0.99, so Pr(− | D) = 0.01.
Pr(+ | D) is called the sensitivity of the test. (It tells how sensitive the test is to the presence of the
disease.)
• Pr(− | H) = 0.98, so Pr(+ | H) = 0.02.
Pr(− | H) is called the specificity of the test.
• Pr(D | +)
=
Pr(D and +)
Pr(+)
=
Pr(D) · Pr(+ | D)
Pr(D and +) + Pr(H and +)
=
0.001 · 0.99
= 0.0472.
0.001 · 0.99 + 0.999 · 0.02
A tree diagram is a useful way to visualize these calculations.
0.02
+ 0.01998
0.999
0.98
−
0.001
0.99
+ 0.00099
0.01
− 0.00001
H
0.979
D
This low probability surprises most people the first time they see it. This means that if the test result of a
random person comes back positive, the probability that that person has the disease is less than 5%, even
though the test is “highly accurate”. This is one reason why we do not routinely screen an entire population for
a rare disease – such screening would produce many more false positives than true positives.
Of course, if a doctor orders a test, it is usually because there are some other symptoms. This changes the a
priori probability that the patient has the disease.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
66
Last Modified: February 24, 2012
Probability
Math 143 : Spring 2012 : Pruim
Inference for a Proportion
67
7
Inference for a Proportion
7.1
7.1.1
Binomial Distributions
The Binomial Model
We will say that X is a binomial random variable if:
1. The random process consists of n trials (n is known in advance).
2. Each trial has two outcomes (traditionally called success and failure).
3. The probability of success (p) is the same for each trial.
4. The trials are independent.
5. The random variable X counts the number of successes.
We will abbreviate this X ∼ Binom(n, p)
The binomial model is a model of randomly sampling from a large population and recording the result of a
two-level categorical variable for each observational unit.
7.1.2
Some Simple Examples of Binomial Distributions
The simplest binomial distribution is the Binom(1,.5)-distribution. This models (among other things) one toss
of a fair coin. The probabilities are easy to work out:
value
probability
0
1
0.5
0.5
Things are not much harder if we allow any probability p. The Binom(1, p) distribution is summarized in the
following table:
value
probability
Math 143 : Spring 2012 : Pruim
0
1
1−p
p
Last Modified: February 24, 2012
68
Inference for a Proportion
To keep the notation simpler, sometimes we will let q = 1 − p. Then the Binom(1, p)-distribution becomes
value
probability
0
1
1−p
p
Let’s try n = 2.
• Pr(X = 2) = p · p = p2
• Pr(X = 0) = q · q = q2
• so Pr(X = 1) = 1 − p2 − q2 = 1 − p2 − (1 − p)2 = 1 − p2 − 1 + 2p − p2 = −2p2 + 2p = 2(p)(1 − p) = 2pq.
Alternatively, we can use Pr(X = 1) = Pr(SF or FS) = pq + qp = 2pq.
value
0
1
2
probability
q2
2pq
p2
Time for n = 3. Pr(X = 0) = q3 (fail three times in a row) and Pr(X = 3) = p3 (succeed three times in a row) are
relatively simple. For Pr(X = 1) we observe that there 3 possible ways to obtain one success: FFS, FSF, SFF.
Each of these has probability pq2 . A similar argument gives us Pr(X = 2) and allows us to complete the table.
7.1.3
value
0
1
2
3
probability
q3
3pq3
3p2 q
p3
Pr(X = k) =
·p
The General Formula
For X ∼ Binom(n, p), the general formula is
·q
We just need to fill in the three boxes. The two exponents tell us how many successes (k) and how many failures
(n−k) we are asking about. The other box counts the number of ways to rearrange k successes and n−k failures.
Mathematicians have developed special notation for this important number as well as a formula for computing
it based on factorials:
!
n
n!
=
k
k!(n − k)!
Putting this all together we get
Pr(X = k) =
n
k
!
n
· pk · qn−k
k
is read “n choose k”. and counts the number of ways to select a subset of size k from a set of size n.
Example. Q. Compute
5
2 .
A. Let the 5 choices be A, B, C, D, and E. Here is the list of subsets of size 2:
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Inference for a Proportion
69
AB
AC
AD
There are ten, which matches our formula:
AE
BC
5 5!
2 2!3!
=
Example. Let X ∼ Binom(5, 0.25), then Pr(X = 2) =
BD
5·4·3·2·1
2·1·3·2·1
=
BE
5·4
2·1
CD
CE
DE
= 10.
5
2
3
2 (0.25) (0.75)
= 10(0.25)2 (0.75)3 = 0.2637.
dbinom(2, 5, 0.25)
[1] 0.2637
factorial(5)/(factorial(2) * factorial(3)) * 0.25^2 * 0.75^3
[1] 0.2637
choose(5, 2) * 0.25^2 * 0.75^3
[1] 0.2637
7.1.4
Binomial Distributions in R
Let X ∼ Binom(18, 0.5), then
• rbinom(100, 18, 0.5) simulates a Binom(18, 0.5)-random variable 100 times.
rbinom(100, 18, 0.5)
[1] 10 11 11 12 9 7 8 9 10 14 5
[29] 7 9 11 4 7 10 8 8 11 12 10
[57] 10 6 9 7 11 8 14 10 13 7 12
[85] 7 8 11 9 11 6 12 11 6 10 10
7
7
7
9
10 3 8 9
11 9 12 11
10 12 10 9
10 10 6 8
8
8
8
8
8
8
7 12 9 8 13 10 10
6 5 6 10 13 11 8
6 10 13 10 9 7 11
8 10
7 10
9 4
9
9
5
• dbinom(14, 18, 0.5) computes Pr(X = 14) when X ∼ Binom(18, 0.5).
dbinom(14, 18, 0.5)
[1] 0.01167
• pbinom(14, 18, 0.5) computes Pr(X ≤ 14) when X ∼ Binom(18, 0.5).
pbinom(14, 18, 0.5)
[1] 0.9962
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
70
Inference for a Proportion
We can use pbinom() to compute the p-value for our toad study data. Recall that the 2-sided p-value is
Pr(X ≥ 14 or X ≤ 4) = Pr(X ≤ 4) + Pr(X ≥ 14).
pbinom(4, 18, 0.5) + 1 - pbinom(13, 18, 0.5)
# p-value from the toads study
[1] 0.03088
The plotDist() function can be used to display a binomial distribution graphically:
plotDist("binom", params = c(18, 0.4))
plotDist("binom", params = c(18, 0.4), kind = "histogram")
●
0.10
● ● ●
●
●
0
7.2
●
●
●
0.05
0.00
●
Density
●
●
0.15
● ●
● ●
5
10
15
0.15
0.10
0.05
0.00
0
5
10
15
The Binomial Test
The test that uses the binomial distribution to give a p-value for a test with hypotheses
• H0 : p = p0
• Ha : p , p0 (or p < p0 or p > p0 )
is called the binomial test.
R automates the entire process for us if we provide a summary of the data, the value p0 , and whether we want
a 1-sided or 2-sided alternative:
binom.test(14, 18, p = 0.5)
# 2-sided by default
Exact binomial test
data: x and n
number of successes = 14, number of trials = 18, p-value = 0.03088
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5236 0.9359
sample estimates:
probability of success
0.7778
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Inference for a Proportion
71
There is more information in this output than we need at this point, but notice the statement of the alternative
hypothesis (so know we have done a 2-sided test) and the p-value. If you want less output, you can the p-value,
you can get that with
pval(binom.test(14, 18, p = 0.5))
p.value
0.03088
or
pval(binom.test(14, 18, p = 0.5), verbose = TRUE)
Method: Exact binomial test
Null Hypothesis: probability of success = 0.5
Alt. Hypothesis: probability of success <> 0.5
number of successes = 14
(number of trials = 18)
p-value = 0.03088
p.value
0.03088
If we want to do a 1-sided test, R will happily do that for us as well:
pval(binom.test(14, 18, p = 0.5, alternative = "greater"))
# 1-sided test
p.value
0.01544
pval(binom.test(14, 18, p = 0.5, alternative = "less"))
# the other 1-sided
test
p.value
0.9962
The word ’exact’ in the output indicates that this method is not using simulations or approximations but is
calculating the p-value exactly from the sampling distribution. Later we will learn some approximate methods
for this same situation.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
72
7.3
Inference for a Proportion
More Examples
Example. Here is R code for Example 7.2 in the text.
binom.test(10, 25, p = 0.061)
Exact binomial test
data: x and n
number of successes = 10, number of trials = 25, p-value = 9.94e-07
alternative hypothesis: true probability of success is not equal to 0.061
95 percent confidence interval:
0.2113 0.6133
sample estimates:
probability of success
0.4
Note that this p-value doesn’t match the one in the book. That is because R is using a more accurate method
for two-sided tests. See the footnote on page 160.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Goodness of Fit Testing
73
8
Goodness of Fit Testing
Goodness of fit tests test how well a distribution fits some hypothesis.
8.1
Testing Simple Hypotheses
8.1.1
Golfballs in the Yard Example
Allan Rossman used to live along a golf course. One summer he collected 500 golf balls in his back yard and
tallied the number each golf ball to see if the numbers were equally likely to be 1, 2, 3, or 4.
Step 1: State the Null and Alternative Hypotheses.
• H0 : The four numbers are equally likely (in the population of golf balls driven 150–200 yards and sliced).
• Ha : The four numbers are not equally likely (in the population).
If we let pi be the proportion of golf balls with the number i on them, then we can restate the null hypothesis
as
• H0 : p1 = p2 = p3 = p4 = 0.25.
• Ha : The four numbers are not equally likely (in the population).
Step 2: Compute a test statistic. Here are Allan’s data:
golfballs
1
2
3
4
137 138 107 104
(Fourteen golf balls didn’t have any of these numbers and were removed from the data leaving 486 golf balls to
test our null hypotheses.)
Although we made up several potential test statistics in class, the standard test statistic for this situation is
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
74
Goodness of Fit Testing
The Chi-squared test statistic:
χ2 =
X (observed − expected)2
expected
There is one term in this sum for each cell in our data table, and
• observed = the tally in that cell (a count from our raw data)
• expected = the number we would “expect” if the percentages followed our null hypothesis exactly.
(Note: the expected counts might not be whole numbers.)
In our particular example, we would expect 25% of the 486 (i.e., 121.5) golf balls in each category, so
χ2 =
(137 − 121.5)2 (138 − 121.5)2 (107 − 121.5)2 (104 − 121.5)2
+
+
+
= 8.469
121.5
121.5
121.5
121.5
Step 3: Compute a p-value Our test statistic will be large when the observed counts and expected counts are
quite different. It will be small when the observed counts and expected counts are quite close. So we will reject
when the test statistic is large. To know how large is large enough, we need to know the sampling distribution.
If H0 is true and the sample is large enough, then the sampling distribution for the Chi-squared test
statistic will be approximately a Chi-squared distribution.
• The degrees of freedom for this type of goodness of fit test is one less than the number of cells.
• The approximation gets better and better as the sample size gets larger.
The mean of a Chi-squared distribution is equal to its degrees of freedom. This can help us get a rough idea
about whether our test statistic is unusually large or not.
How unusual is it to get a test statistic at least as large as ours (8.47)? We compare to a Chi-squared distribution
with 3 degrees of freedom. The mean value of such a statistic is 3, and our test statistic is almost 3 times as
big, so we anticipate that our value is rather unusual, but not extremely unusual. This is the case:
1 - pchisq(8.469, df = 3)
[1] 0.03725
Step 4: Interpret the p-value and Draw a Conclusion. Based on this smallish p-value, we can reject our null
hypothesis at the usual α = 0.05 level. We have sufficient evidence to cast doubt on the hypothesis that all the
numbers are equally common among the population of golfballs.
Automation. Of course, R can automate this whole process for us if we provide the data table and the null
hypothesis. The default null hypothesis is that all the probabilities are equal, so in this case it is enough to give
R the golf balls data.
chisq.test(golfballs)
Chi-squared test for given probabilities
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Goodness of Fit Testing
75
data: golfballs
X-squared = 8.469, df = 3, p-value = 0.03725
8.1.2
Plant Genetics Example
A biologist is conducting a plant breeding experiment in which plants can have one of four phenotypes. If these
phenotypes are caused by a simple Mendelian model, the phenotypes should occur in a 9:3:3:1 ratio. She raises
41 plants with the following phenotypes.
phenotype
1
2
3
4
count
20
10
7
4
Should she worry that the simple genetic model doesn’t work for her phenotypes?
plants <- c(20, 10, 7, 4)
chisq.test(plants, p = c(9/16, 3/16, 3/16, 1/16))
Warning message: Chi-squared approximation may be incorrect
Chi-squared test for given probabilities
data: plants
X-squared = 1.97, df = 3, p-value = 0.5786
Things to notice:
• plants <- c( 20, 10, 7, 4 ) is the way to enter this kind of data by hand.
• This time we need to tell R what the null hypothesis is. That’s what p=c(9/16, 3/16, 3/16, 1/16) is
for.
• The Chi-squared distribution is only an approximation to the sampling distribution of our test statistic,
and the approximation is not very good when the expected cell counts are too small. This is the reason
for the warning.
8.1.3
Small Cell Counts
The chi-squared approximation is not very good when expected cell counts are too small. Our rule of thumb
will be this:
The chi-squared approximation is good enough when
• All the expected counts are at least 1.
• Most (at least 80%) of the expected counts are at least 5.
Notice that this depends on expected counts, not observed counts.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
76
Goodness of Fit Testing
1
Our expected count for phenotype 4 is only 16
· 41 = 2.562, which means that 75% of our expected counts are
between 1 and 5, so our approximation will be a little crude. (The other expected counts are well in the safe
range.) If you are too lazy to calculate all those expected counts yourself, you can use the xchisq.test() in
the pkgmosaic package, which prints out some additional information about our test. If we are getting sick of
typing those null probabilities over and over, we could save them, too:
nullHypothesis <- c(9/16, 3/16, 3/16, 1/16)
xchisq.test(plants, p = nullHypothesis)
Warning message: Chi-squared approximation may be incorrect
Chi-squared test for given probabilities
data: plants
X-squared = 1.97, df = 3, p-value = 0.5786
20.00
(23.06)
[0.407]
<-0.64>
10.00
( 7.69)
[0.696]
< 0.83>
7.00
( 7.69)
[0.061]
<-0.25>
4.00
( 2.56)
[0.806]
< 0.90>
key:
observed
(expected)
[contribution to X-squared]
<residual>
When the expected cell counts are not large enough, we can ask R to use the empirical method (just like we did
with statTally) instead.
chisq.test(plants, p = nullHypothesis, simulate = TRUE)
Chi-squared test for given probabilities with simulated p-value (based on 2000
replicates)
data: plants
X-squared = 1.97, df = NA, p-value = 0.5967
This doesn’t change our overall conclusion in this case. These data are not inconsistent with the simple genetic
model. (The simulated p-value is not terribly accurate either if we use only 2000 replicates. We could increase
that for more accuracy. If you repeat the code above several times, you will see how the empirical p-value varies
from one computation to the next.)
8.1.4
If we had more data . . .
Interestingly, if we had 10 times as much data in the same proportions, the conclusion would be very different.
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Goodness of Fit Testing
77
morePlants <- c(200, 100, 70, 40)
chisq.test(morePlants, p = nullHypothesis)
Chi-squared test for given probabilities
data: morePlants
X-squared = 19.7, df = 3, p-value = 0.0001957
We’ve seen this same effect with the binomial test in an exercise.
8.1.5
When there are only two categories
We can redo our toads example using a goodness of fit test instead of the binomial test.
binom.test(14, 18)
Exact binomial test
data: x and n
number of successes = 14, number of trials = 18, p-value = 0.03088
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
0.5236 0.9359
sample estimates:
probability of success
0.7778
chisq.test(c(14, 4))
Chi-squared test for given probabilities
data: c(14, 4)
X-squared = 5.556, df = 1, p-value = 0.01842
Both of these test the same hypotheses:
• H0 : pL = pR = 0.5
• Ha : pL , pR
where pL and pR are proportions of right and left-handed toads in the population).
Although both tests test the same hypotheses and give similar p-values (in this example), the binomial test is
generally used because
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
78
Goodness of Fit Testing
• The binomial test is exact for all sample sizes while the Chi-squared test is only approximate, and the
approximation is poor when sample sizes are small.
• Later we will learn about the confidence interval associated with the binomial test. This will be another
advantage of the binomial test over the Chi-squared test.
8.2
Testing Complex Hypotheses
In the examples above, our null hypothesis told us exactly what percentage to expect in each category. Such
hypotheses are called simple hypotheses. But sometimes we want to test complex hypotheses – hypotheses that
do not completely specify all the proportions, but instead express the type of relationship that must exist among
the proportions. Let’s illustrate with an example.
Example 8.5: Number of Boys in 2-Child Families
Q. Does the number of boys in a 2-child family follow a binomial distribution (with some parameter p)?
Data: number of boys in 2444 2-child families.
boys <- c(‘0‘ = 530, ‘1‘ = 1332, ‘2‘ = 582)
boys
0
1
530 1332
2
582
n <- sum(boys)
n
[1] 2444
p.hat = (1332 + 2 * 582)/(2 * (530 + 1332 + 582))
p.hat
[1] 0.5106
p0 <- dbinom(0:2, 2, p.hat)
p0 # binomial proportions
[1] 0.2395 0.4998 0.2608
expected <- p0 * sum(boys)
expected # expected counts
[1]
585.3 1221.4
637.3
observed <- boys
observed
0
1
530 1332
2
582
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Goodness of Fit Testing
79
This is a 1 degree of freedom test:
3 cells − 1 − 1 parameter estimated from data
We reduce the degrees of freedom because observed counts will fit expected counts better if we use the data to
calculate the expected counts.
X.squared <- sum((boys - expected)^2/expected)
X.squared
[1] 20.02
1 - pchisq(X.squared, df = 1)
[1] 7.658e-06
That’s a small p-value, so we reject the null hypothesis: It doesn’t appear that a binomial distribution is a good
model. Why not? Let’s see how the expected and observed counts differ:
cbind(expected, observed)
0
1
2
expected observed
585.3
530
1221.4
1332
637.3
582
Any theories? How might we check if your theory is correct?
Don’t make this mistake
chisq.test(boys, p = p0)
# THIS IS INCORRECT;
Wrong degrees of freedom
Chi-squared test for given probabilities
data: boys
X-squared = 20.02, df = 2, p-value = 4.492e-05
This would be correct if you were testing that the distribution of boys is Binom(2, 0.5106), but that isn’t what
we are testing. We are testing whether the distribution of boys is Binom(2, p) for some p.
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
80
Goodness of Fit Testing
8.3
Reporting Results of Hypothesis Tests
Key ingredients
1. Null and alternative hypotheses
2. Test statistic
3. p-value
4. Sample size
Examples
Results of hypothesis tests are typically reported very succinctly in the literature. But if you know what to look
for, you should be able to find it.
Example 1: Reporting Significance but not p-value
Abstract. Ninety-six F2 .F3 bulked sampled plots of Upland cotton, Gossypium hirsutum L., from
the cross of HS46 × MARCABUCAG8US-1-88, were analyzed with 129 probe/enzyme combinations
resulting in 138 RFLP loci. Of the 84 loci that segregated as codominant, 76 of these fit a normal
1 : 2 : 1 ratio (non-significant chi square at P = 0.05). Of the 54 loci that segregated as dominant
genotypes, 50 of these fit a normal 3: 1 ratio (non-significant chi square at P = 0.05). These 138
loci were analyzed with the MAPMAKER3 EXP program to determine linkage relationships among
them. There were 120 loci arranged into 31 linkage groups. These covered 865 cM, or an estimated
18.6% of the cotton genome. The linkage groups ranged from two to ten loci each and ranged in size
from 0.5 to 107 cM. Eighteen loci were not linked.
Notes:
• “non-significant chi square at P = 0.05” should say “at α = 0.05” (or perhaps P > 0.05).
• The p-values are not listed here because there would be 84 of them. What we know is that 76 were larger
than 0.05 and 8 were less than 0.05. Presumably they will discuss those 8 in the paper.
Source: Zachary W. Shappley, Johnie N. Jenkins, William R. Meredith, Jack C. McCarty, Jr. An RFLP linkage
map of Upland cotton, Gossypium hirsutum L. Theor Appl Genet (1998) 97 : 756–761. http://ddr.nal.usda.
gov/bitstream/10113/2830/1/IND21974273.pdf
Last Modified: February 24, 2012
Math 143 : Spring 2012 : Pruim
Goodness of Fit Testing
81
Example 2: Details not in abstract but are in the paper
Abstract
no measurable female fertility (no viable seed production). The present research demonstrates the feasibility
Inheritance of two mutant foliage types, variegated of breeding simultaneously for ornamental traits and
and purple, was investigated for diploid, triploid, and non-invasiveness.
tetraploid tutsan (Hypericum androsaemum). The
fertility of progeny was evaluated by pollen viability
..
tests and reciprocal crosses with diploids, triploids, and
.
tetraploids and germinative capacity of seeds from successful crosses. Segregation ratios were determined for There are many sentences of the following type in the
diploid crosses in reciprocal di-hybrid F1, F2, BCP1, paper:
and BCP2 families and selfed F2s with the parental
phenotypes. F2 tetraploids were derived from induced
autotetraploid F1s. Triploid segregation ratios were determined for crosses between tetraploid F2s and diploid
F1s. Diploid di-hybrid crosses fit the expected 9 : 3 :
3 : 1 ratio for a single, simple recessive gene for both
traits, with no evidence of linkage. A novel phenotype
representing a combination of parental phenotypes was
recovered. Data from backcrosses and selfing support
the recessive model. Both traits behaved as expected at
the triploid level; however, at the tetraploid level the
number of variegated progeny increased, with segregation ratios falling between random chromosome and
random chromatid assortment models. We propose the
gene symbol var (variegated) and pl (purple leaf) for
the variegated and purple genes, respectively. Triploid
pollen stained moderately well (41%), but pollen germination was low (6%). Triploid plants were highly infertile, demonstrating extremely low male fertility and And some of the results are summarized in a table.
Source: Richard T. Olsen, Thomas G. Ranney, Dennis J. Werner, J. AMER. SOC. HORT. SCI. 131(6):725–
730. 2006. Fertility and Inheritance of Variegated and Purple Foliage Across a Polyploid Series in Hypericum
androsaemum L. http://www.ces.ncsu.edu/fletcher/mcilab/publications/olsen-etal-2006c.pdf
Math 143 : Spring 2012 : Pruim
Last Modified: February 24, 2012
82
Last Modified: February 24, 2012
Goodness of Fit Testing
Math 143 : Spring 2012 : Pruim
R Commands
83
R Commands
Here is a list of the most important R commands we have seen so far. For a more complete list, see the index.
Find data from the book
abdData(3)
# all data in chapter 3
Creating tables
mtable(~ sector, data = CPS)
mtable(~ sector, data = CPS, format = "proportion")
mtable(~ sector & sex, data = CPS)
mtable(~ sector | sex, data = CPS)
Numerical summaries
# These work with sd(), var(), median(), max(), min(), etc. too
mean(CPS$age)
mean(~ age, data = CPS)
mean(age ~ sex, data = CPS)
Plots
histogram( ~ age, data=CPS )
xhistogram( ~ age, data=CPS, width=5 )
densityplot( ~ age, data=CPS )
xyplot( wage ~ age, data=CPS )
xyplot( wage ~ age | sex, data=CPS )
bwplot( ~ age, data=CPS )
bwplot( age ~ sex, data=CPS )
bwplot( sex ~ age, data=CPS )
Math 143 : Spring 2012 : Pruim
#
#
#
#
#
#
#
#
histogram
histogram with some extra features
smooth version of histogram
scatter plot
scatter plots for each sex
boxplot
boxplots for each sex side-by-side
boxplots for each sex above and below
Last Modified: February 24, 2012
84
R Commands
#
#
#
#
bargraph
bargraph
separate panels for each sex
overlaid plots for each sex
20
●
5
0
0.02
0.01
0.00
20
30
40
50
60
0.02
20
30
40
age
50
60
●
● ●●
●●
●
● ●●
●
●● ●● ●● ● ●●
●● ●
●● ● ● ●
●
●
●●● ●●● ●
●● ● ●● ●
● ●●
●●●
●● ●●
● ●
●●
●●
●●
●●
● ●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●● ●
●
●
●
●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
● ●●
●●
●●
●●
●●
●●
●
●●●●
●●●
●
●●
●●●
●
●●●
●●
●
●
●●
●●
●●●●●●●
●
●●
●
●●
● ●●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●●●
●●●●●●●●● ● ●● ●●● ● ●● ●● ●● ●● ●●
●
20
0.01
10
●●●●●
●●
●
●●
●
●●
●●●●●●●
●●
●
●●
●●●
●●
●●
●●●
●●
●●
●●
●●
●●
●●
●●●
●●
●●
●●
●●
●●
●
●●●●
●
●●●●
●●
●●●
●●
●
●●●●●●
●●
●
●
●
●
●●
●
●
0.00
10
30
wage
Density
10
40
0.03
0.03
15
Density
Percent of Total
bargraph( ~ sector, data=CPS )
bargraph( ~ sector, data=CPS, horizontal=TRUE )
densityplot( ~ age | sex, data=CPS )
densityplot( ~ age, group=sex, data=CPS )
70
20
40
age
0
60
20
30
40
age
50
60
age
20 30 40 50 60
F
30
●
●
●
● ●
●
●
●
●●
●●
●
●
● ●●
●
●
●●
●●●
●●
●● ● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
20
10
0
40
●
●
F
M
●
F
30
20
20 30 40 50 60
20
30
age
40
50
60
20
40
40
50
60
60
M
0.03
Density
40
F
Density
60
30
age
service
sales
prof
other
manuf
manag
const
clerical
80
20
age
20
100
Freq
●
M
50
age
40
wage
60
M
●
0.03
0.02
0.01
0.00
●
●●
●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
0.02
0.01
0.00
●●
●●●
●●●
●●●
●●
●●
●●
●●
●●●●
●●
●●
●●●●
●●●●●●●
●●
●●
●●●●●●●●●●●●●
●
●●
●●●●●●●●
0
0
clericalconstmanagmanufother prof salesservice
20
40
60
80
100
Freq
20
40
60
20
age
40
60
age
Probability functions
dbinom(92, 100, .92)
pbinom(92, 100, .92)
1 - pbinom(91, 100, .92)
pchisq(5.2, df=3)
1 - pchisq(5.2, df=3)
#
#
#
#
#
P(X
P(X
P(X
P(X
P(X
= 92) for X ~ Binom(100, .92)
<= 92) for X ~ Binom(100, .92)
>= 92) for X ~ Binom(100, .92)
<= 5.2) for X ~ Chisq(3)
>= 5.2) for X ~ Chisq(3)
**note 91**
Hypothesis tests (super short-cuts)
binom.test(14,18)
chisq.test(c(20,30,15), p=c(1/4, 1/2, 1/4))
xchisq.test(c(20,30,15), p=c(1/4, 1/2, 1/4))
# Toads example
# chi-squared goodness of fit for 1:2:1 ratio
# more verbose output
Mathematical functions
sqrt(10)
log(10)
log10(10)
sin(pi)
exp(3)
#
#
#
#
#
Last Modified: February 24, 2012
square root
natural log
log base 10
sine
e to the power 3
Math 143 : Spring 2012 : Pruim
Bibliography
[AmH82] American Heritage Dictionary, Houghton Mifflin Company, 1982.
[Fis25]
[Fis70]
R. A. Fisher, Statistical methods for research workers, Oliver & Boyd, 1925.
, Statistical methods for research workers, 14th ed., Oliver & Boyd, 1970.
[fMA86] Consortium for Mathematics and Its Applications, For all practical purposes: Statistics, Intellimation,
1985–1986.
[fMA89]
, Against all odds: Inside statistics, Annenberg Media, 1989.
[Kap09]
Daniel T. Kaplan, Statistical modeling: A fresh approach, 2009.
[Sal01]
D. Salsburg, The lady tasting tea: How statistics revolutionized science in the twentieth century, W.H.
Freeman, New York, 2001.
[Utt05]
Jessica M. Utts, Seeing through statistics, 3rd ed., Thompson Learning, 2005.
85
Index
CPS, 35
HELPrct, 19, 35
IQR(), 44
TeenDeaths, 41
abdData(), 18
abd, 18
barchart(), 41
bargraph(), 20, 37, 41
binom.test(), 70, 77
bwplot(), 20, 22, 46
c(), 28
chisq.test(), 74, 77
choose(), 69
colors(), 28
cut(), 39
data(), 17
dbinom(), 55, 69, 78
demo(), 32
densityplot(), 20, 21
do(), 9
factorial(), 69
histogram(), 20, 21, 23, 33
lattice, 22, 25, 33
log(), 15
log10(), 15
mean(), 43
median(), 44
mhistogram(), 21
mosaicManip, 21
mosaic, 8, 19, 21, 23, 35, 38
mtable(), 10, 39
pbinom(), 70
perctable, 37
plotDist(), 70
proptable, 37
pval(), 70
rflip(), 8
sd(), 44
show.settings(), 29
sqrt(), 15
stripplot(), 47
sum(), 74
summary(), 44
table(), 37–39
trellis.par.set(), 29
var(), 44
xchisq.test(), 76
xhistogram(), 21, 23, 40
xtabs(), 37–40
xyplot(), 25, 46, 47
cards, 64
conditional probability, 62
data, 5
dice, 64
Fisher, R. A., 6, 8
lady tasting tea, 6
medical testing, 65
population, 6
randomness, 6
sample, 7
86
© Copyright 2026 Paperzz