Elementary Statistics
Brian E. Blank
February 22, 2016
first president university press
ii
0
6. Obtaining Data
6.1 Random Digits and Simulation
Many of the games we play rely on “randomness”. When we throw a die, we obtain a face with n dots where
n = 1, 2, 3, 4, 5, or 6. Any one of these six faces can arise and they all occur with equal likelihoods. The rolling
of a die is a random number generator that returns one of the six digits, 1, 2, 3, 4, 5, or 6 at random.
The die was invented in ancient times using the primitive technology at hand for the sole purpose of being a
random number generator. Games of chance that use dice depend on that randomness.
We are familiar with other activities in which randomness is essential. The state lottery is a form of
voluntary self-taxation that relies on randomness. We shuffle a deck of cards to give them a random order.
Engineers, scientists, and mathematicians frequently turn to random number generators to solve problems
that would otherwise be difficult or even intractable. Instead of resorting to dice or game spinners, they use
tables of randon digits that have been created by computers, or, more frequently, they use the random number
generators that most mathematical software packages provide.
In a table of random numbers, the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 appear. In a typical presentation,
five are printed in a row with no space in between them. Then there is a space and the next group of five
digits begins. The groupings have no mathematical significance—they allow for ease of reading and serve
no other purpose. Every entry of the table can be filled with any of the ten digits: each digit has a one in
ten chance of filling the entry. Imagine ten slips of paper with the digits zero through 9. They are placed
in a hat, well-shaken up, one is pulled out, it’s number is recorded, and then it is returned to the hat. The
next digit is generated in the same way. The selection of each digit is in no way influenced by any of the
previously selected digits. A table of random digits is similar except that it is generated by a computer using
a mathematical algorithm instead of slips of paper and a hat.
Let’s see how a table of random digits can solve a fairly trick mathematical puzzle know as the “collector
problem.” You may well be familiar with the problem in a nonmathematical way. In its abstract form, a
collector hopes to acquire a complete set of r different collectible items. The distribution of these collectible
items is through an arbitrarily large supply of boxes that are identical in all but one way: in each box, one
collectible item is placed. In the present discussion, we will assume that, for any box, each collectible item is
as likely to be found as any other. The question we pose is, What is the probability p that n boxes contain a
complete set of the r collectible items?
The collector problem has been used to sell Happy Meals, bubble gum packs, boxes of cereal, and many
other items. For example, suppose that Weetabix cereal boxes contain cards of hockey greats Gordie Howe,
Wayne Gretzky, and Bobby Orr. Is we purchase 5 boxes, what is the probability that we have the complete
set of 3 cards?
It turns out that, for general r and n, mathematicians have obtained an exact algebraic formula for the
solution to the collector problem, as posed (i.e., with equal probabilities for each of the r collectible items):
1
2
p=
( )(
)n
r−1
∑
r
j
(−1)j
1−
.
j
r
j=0
(6.1.1)
For r = 3 and n = 5, this formula gives p = 50/81, or p ≈ 0.6173. It takes quite a bit of ingenuity to deduce
formula (6.1.1). By contrast, we can obtain the number 0.617 with ridiculaous ease by using random numbers
to simulate cereal box purchases. To do so, we will use the table of random digits shown in Figure 6.1.1.
Figure 6.1.1 Table of Random Digits
Because any one of the cards has a 1/3 chance of being in any box, we will select a group of digits and
divide them into three equal subsets, each subset representing one of the hockey players. We could, for
example, use the digit 1 to represent a Gordie Howe card, 2 Wayne Gretzky, and 3 Bobby Orr. If so, then we
would disregard the digits 0, 4, 5, 6, 7, 8, and 9. Instead, we could use 6 of the digits, assigning two to each of
3
the players. In this case we would disregard the remaining four digits. It would be even more efficient to use
nine of the digits and disregard only one. Thus, we will assign 1, 2, 3 to Gordie Howe (G), 4, 5, 6 to Wayne
Gretzky (W), and 7, 8, 9 to Bobby Orr (B). Selecting a sequence of n random digits will simulate buying a
sequence of n cereal boxes.
Using the first block of five random digits from Table 6.1.1, namely 19223, we see that our first five boxes
of cereal result in the cards GBGGG and failure to achieve a complete set. We will denote this trial by
19223 GBGGG . The next group of random digits representing the purchase of 5 boxes of Weetabix is
9503405. Remember, we are discarding the digit 0. This group represents the cards BW GW W, which results
in a complete collection. We will denote this by 9503405 BW GW W . For ease of lookup, we will skip
the rest of the digits in the block of 5 we just started and move on to the fourth block of random digits. The result is 28713 GBBGG . Continuing in this way, we obtain 964091 BWW BG , 42544 WGWWW ,
82853 BGBWG , 73676 BGWBW , 471509 WBGW B , 019272
GBGBG , and 42648 WGWWB .
If we count the number of times we obtained a complete set, namely 6, and the total number of trials, namely
10, then we find that the frequency of obtaining a complete set is 6/10, or 0.6. Our approximation to
p = 0.6173 based on our simulation with 10 trials is therefore 0.6.
Nowadays, when nearly everyone walks around with a device that can produce a stream of random digits
(and can even be used to take a selfie of the experimenter awaiting the outcome of the simulation), it is not
often that tables are the source of random digits. For example, here is Maple code for a program that will
perform a user-inputted number of simulations:
weetabix := proc()
local N, j, k, G, W, B, collection, rn, newRandom, completeSet:
N := args[1]:
rn := rand(1..9): #Generates a random digit from1 through 9 inclusive
completeSet := 0:
for j from 1 to N do
collection := {}:
for k from 1 to 5 do
newRandom := rn():
if newRandom <= 3 then collection := {op(collection),G}:
elif newRandom <= 6 then collection := {op(collection),W}:
else collection := {op(collection),B}:
end if:
end do:
if {G,W,B} subset collection then completeSet := completeSet+1:
end if:
end do:
return evalf(completeSet/N);
end proc:
Figure 6.1.2 shows a screen capture in which this program was used to run a simulation of 162,000 trials. The
approximation of p = 50/81 ≈ 0.6172839506 that has been generated, namely 0.6173395062, is quite good—it
is off by only 0.009%. And, as you can see, the simulation ran for less than four seconds. No selfie was taken,
however.
4
Figure 6.1.2 Screen Capture of Collector problem Simulation Using Maple
Random Digits in R (Optional)
To obtain one or more random digits in R, we use R’s random sample function, sample. It should be mentioned
at the outset that sample is not specifically about digits, and, in its most basic call, sample seems to be slightly
misnamed. If v is a vector of objects, then sample(v) returns a vector of the same length consisting of the
entries of v randomly reordered. In other words, sample(v) is a random permutation of the entries of v.1
Thus, sample(c(1,2,3,4,5)) returns a random permutation of the first five positive integers.2
The following two lines of R code create a vector named Hamlet of six character strings, followed by all
720 permutations (not all distinct) of Hamlet’s’ entries:
> Hamlet = c("not", "be", "or", "be", "to", "to")
> for(j in 1:720) {print(c(j,sample(Hamlet)))}
Of the 720 lines of output, here are the first seventeen:
“1”
“2”
“3”
“4”
“5”
“6”
“7”
“8”
“9”
“10”
“11”
“12”
“13”
“14”
“15”
“16”
“17”
“to”
“to”
“not”
“be”
“to”
“to”
“to”
“to”
“be”
“be”
“to”
“to”
“to”
“or”
“not”
“to”
“to”
“be”
“be”
“to”
“or”
“to”
“or”
“not”
“not”
“not”
“to”
“or”
“to”
“to”
“be”
“to”
“be”
“be”
“or”
“be”
“be”
“be”
“not”
“not”
“be”
“to”
“or”
“not”
“to”
“be”
“or”
“to”
“be”
“not”
“or”
“be”
“not”
“or”
“to”
“be”
“be”
“be”
“be”
“be”
“be”
“be”
“be”
“be”
“not”
“or”
“be”
“not”
“to”
“to”
“be”
“not”
“or”
“to”
“to”
“or”
“to”
“or”
“not”
“or”
“be”
“be”
“to”
“or”
“to”
“not”
“or”
“to”
“to”
“be”
“be”
“or”
“be”
“to”
“to”
“be”
“not”
“not”
“to”
“be”
“to”
“be”
You know what they say about enough monkeys with enough keyboards.3 The reader will note how difficult it
is for the human eye to judge randomness. Of the 17 rearrangements printed, 11 begin with “to”. Because “to”
appears twice among Hamlet’s’ entries, as does “be”, we expect it to be the first entry of the 17 rearrangements
about twice as many times as “not” and “or”. Yet “not” is a first entry only twice and “or” only once.
1 The
analogous function in Maple is randperm.
will mention the colon operator shortly, but, anticipating that discussion, it is appropriate to point out right away that
sample(1:n, n) is a convenient way to produce a random permutation of 1, 2, 3, . . . , n when n is not a small positive integer.
3 http://en.wikipedia.org/wiki/Infinite_monkey_theorem
2 We
5
Recall that if v is a vector, then length(v) is the number of entries of v. For example, length(Hamlet)
is 6 not because “Hamlet” has 6 letters but because it has the 6 entries “not”, “be”, “or”, “be”, “to”, “to”.
If k is a positive integer not exceeding length(v), then the call sample(v, k) will randomly select k of the
entries of v. Notice that sample(v) and sample(v, length(v)) are equivalent calls: the default is to return
a vector of the same length as the first argument if there is no contraindication.
The reason that k cannot exceed length(v) in the call sample(v, k) is that the sampling is done
without replacement: once an entry of v has been selected to be an entry of sample(v, k), it is withdrawn from the pool of entries of v that are available for subsequent selection. In mathematical terms,
if 1 ≤ j < ℓ ≤ k ≤ length(v), and if vm is an entry of v that has been selected as the j th entry of
sample(v,k), then vm is no longer available to be the ℓth entry of sample(v,k). Thus, in the permutations
of ⟨1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5⟩ generated by sample(c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,5), 10), the
number 1 may appear as many as five times (because that is how often it appears in the list that is sampled
from) but no more than 5 times (because every time it is drawn from the list, it is not replaced). Similarly, 2
may appear at most 4 times, 3 may appear at most 3 times, 4 may appear at most twice, and 5 may appear
at most once. It will be noted that if k > length(v) , then the call sample(v,k) necessarily leads to an
error: when the length(v)th entry has been chosen, the remaining pool of entries is empty and continued
sampling is impossible.
There are certainly times when we wish to sample with replacement. Consequently, there is an optional
parameter, replace, that can be added to the argument list of sample. The default value of replace
is FALSE. When replace = FALSE is included as an argument of sample, sampling is performed without
replacement, as described above. Because FALSE is the default value of replace, the calls sample(v,k
and sample(v,k,replace=FALSE) are equivalent. On the other hand, the inclusion of replace = TRUE
as an argument of sample overrides the default specification, FALSE: when sample(v,k,replace=TRUE) is
implemented, sampling is performed with replacement. That means that after an entry has been drawn
in the sample, it is returned to the pool and may be selected again in continued sampling. Thus, the call
sample(Hamlet,replace=TRUE) might result in the vector ⟨‘‘not’’, ‘‘not’’, ‘‘not’’, ‘‘not’’, ‘‘not’’, ‘‘not’’⟩.
It is easy to see that, in the call sample(v,k,replace = TRUE), there is no requirement that k not exceed
length(v): the pool from which entries are selected never gets smaller when sampling is performed with
replacement.
Now that all the preparation is in place, let us finally take up the topic of random digits. Suppose that m
and n are integers with m < n. Then m:n is the simplest R code that generates the sequence m, m + 1, . . . , n
consisting of the consecutive integers beginning with m and ending with n. In particular, 0:9 generates the
sequence of all base 10 digits in increasing order. To obtain one or more random digits in R, we use the call
> sample(0:9, N, replace = TRUE)
where N is the desired number of random digits.
6.2 Sampling
In 1935 George Horace Gallup (1901–1984) formed a polling company, the American Institute of Public
Opinion, dedicated to predicting American elections. To this date, the surveys undertaken by this company
are known as Gallup Polls. Gallup became famous in 1936 when he correctly predicted the victory of Franklin
Delano Roosevelt over Alf Landon in the presidential election that year. Gallup’s prediction was based on
a sample of 50,000 voters. By contrast, the Literary Digest conducted a large scale sample. The magazine
mailed 10,000,000 ballots to a list of voters obtained from telephone directories, drivers’ registrations, magazine
subscription lists, country club memberships, and the like. Of the ballots sent out, some 2,400,000 were
6
returned. Based on those returns, the Literary Digest predicted that Landon would win with 57% of the vote.
As mentioned, Landon actually lost.
Sampling Bias
A sampling procedure that underestimates or overestimates a characteristic of the population is said to be
biased . As in the ordinary English language usage of the term, the meaning of “bias” in statistics has a
connotation of one-sidedness. Although there is a sense of willfulness that is asscociated with the term “bias”
in its ordinary English usage, bias in statistics is an unintentional flaw in a sample survey that is designed to
accurately reveal some aspect of a population.4 The Literary Digest survey suffered from two types of bias:
selection bias and nonresponse bias. Selection bias describes a survey in which certain types of surveyees
are disproportionately chosen. The Literary Digest compiled its survey invitations from lists that tended to
include more affluent members of society. In Depression era America, telephone connectedness was a luxury
for many, to say nothing of car ownership or club membership. Nonresponse bias can occur when surveyees
elect not to participate in a survey: it may happen that those who respond and those who do not respond
represent very different segments of the population with regard to the focus of the survey.5
The moral of the Literary Digest survey is that a large survey size will not necessarily overcome a poor
method of sampling. In the other direction, as Gallup showed in 1936, a well-designed survey can be accurate
even if the surveyees make up only a small fraction of the population.
A sample frame is a list of the population from which the sample is drawn. For an election, the names on
the voter registries for the polling stations would be the sample frame. The sample bias in the Literary Digest
survey could have been avoided by a random selection of names from the sample frame. Such a survey is
called a simple random sample (SRS). The randomization procedure need not be elaborate. For example,
if a sample frame consists of 10,000 persons and a survey is to question 600 members of that frame, then each
member of the frame is assigned a unique number from 1 to 10,000, and a random number generator is run
until it arrives at 600 distinct numbers in that range. One criticism of this procedure is that all aspects of
the selection are left to chance.
Quota Sampling and Stratified Sampling
In the 1948 election, the Gallup Poll predicted that Thomas Dewey would receive 50% of the vote for presidency
compared to Harry Truman’s 44%. Never heard of President Dewey? In fact, the vote split was almost exactly
the reverse of the Gallup prediction. Gallup himself attributed his poll’s failure to timing. In his opinion, the
poll was conducted too soon before the election: undecided voters decided after his poll and decided voters
had time to switch their preferences between the poll and the election. A different opinion is that the failure
of the 1948 Gallup poll was the result of human bias. Gallup’s polling strategy was quota sampling . In
quota sampling, the population is partitioned into subpopulations called strata. A predetermined number
(i.e. quota) of surveyees is then selected from each stratum. For example, in a presidential election, the
country might be divided into geographic regions. Then the population of each region is subdivided into
subpopulations according to race (eg. Black, Hispanic, White Non-Hispanic, Other). Each subpopulation is
then subdivided by gender. Further subdivision might be done according to other criteria (urban/rural, etc.).
The subpopulations that result when the subdividing is over are the strata. In quota sampling, the number
samplesd from each stratum is not left to chance as in a simple random sample.
So what went wrong with the 1948 Gallup poll? Some analysts believe that human bias led to nonrepresentative sampling within the strata in the 1948 Gallup poll: although they were assigned quotas, Gallup
pollsters were otherwise allowed to use their discretions in selecting surveyees. In stratified sampling , a
strategy of selection that eliminates human bias, such as simple random sampling, is empployed within each
4 One can imagine intentionally biased surveys that are that are designed for the purpose of advocacy rather than discovery.
Such surveys are not considered here.
5 The volunteer bias that results when the surveyees in a sample consist entirely of volunteers can be even worse.
7
stratum. In an election poll, the proportion of surveyees that come from a stratum will, for each stratum,
equal the proportion of the member of the entire population that come from that stratum. On the other
hand, in surveys conducted for market research, a particular demographic is often targetted and quotas that
are appropriate for the targeted market are set.
Cluster Sampling
There are situations in which simple random sampling is not practical. In evaluating the manuscript of an
elementary textbook, a publisher will need to determine if the text is at an appropriate reading level for the
grade for which the text is intended. To do, the publisher will test vocabulary, sentence length, sentence
structure, and so on. If every word and every sentence were numbered, then it might be possible to make a
selection by simple random sampling. But manuscripts do not come with such counts. Instead the publisher
will assume that the writing level is homogeneous and do a simple random sampling of the pages. The pages
are clusters of words and sentences. Then each word and each sentence of the chosen clusters is examined.
This procedure is known as cluster sampling . Superficially, it might seem that stratified sampling and
cluster sampling are similar. In fact, there is a crucial difference. A population is divided into strata because
it is presumed that there might be significant differences between the subpopulations of different strata. By
contrast, clusters are presumed to be more or less the same in their properties.
6.3 Experimental Design
In broad outline, a statistical experiment involves experimental units, which, if they are human, are more
commonly called subjects or participants.), one or more explanatory variables that are “applied” (in some
sense) to the experimental units, and one or more response variables that reflect the effects, if any, of the
experiment on the experimental units. The experiment seeks to determine if the explanatory variables are
factors in the values of the response variables.
A value of an explanatory variable is called a level . Taken together, the levels of all the explanatory
variables constitute a treatment. For example, in a study of the combination of the blood thinner warfarin
and the hypertension drug lisinopril, the levels of warfarin might be 5 mg, 7.5 mg, and 10 mg, and the levels
of lisinopril might be 5 mg and 10 mg. There are then 6 treatments: warfarin 5 mg & lisinopril 5 mg, warfarin
5 mg & lisinopril 10 mg, warfarin 7.5 mg & lisinopril 5 mg, warfarin 7.5 mg & lisinopril 10 mg warfarin 10 mg
& lisinopril 5 mg, and warfarin 10 mg & lisinopril 10 mg.
A Compromised Study
Almost every music educator in United States is aware of the results of a 1981 Mission Viejo High School
study: The average GPA of music students was 3.59 compared to an average GPA of 2.91 for students who
did not study music. The same study also reported that 16% of music students had a 4.0 GPA compared
to only 5% of the students who did not study music.6 Evidently, the study of music imparts a discipline, a
work ethic, a skill of organization, and a sense of achievement that allows its participants to excel in other
academic subjects.
Not so fast!
In this observational study , there were two reponse variables, the numerical variable GPA and the
categorical variable Straight A Student (with two possible values, Yes or No). There was one explanatory
variable: the categrorical variable Music Student (with two possible values, Yes or No). The experimental
units were the students of one graduating class. The study was retrospective: it was not planned in advance,
and the data used was collected before the study was conceived.
6 http://www.childrensmusicworkshop.com/advocacy/studentdevelopment.html,
retrieved October 9, 2014.
8
The Mission Viejo study is open to several criticisms that cast doubt on its conclusions. There was no
control for other factors that might have influenced academic performance. For example, can we rule out the
possibility that the students who studied music were better students even before their music studies began?
Maybe their aptitude for learning made them more open to the challenge of learning an instrument. The
study of music is not without expense. Is it not possible that the students who studied music came from
more affluent families? There is a positive association between affluence and academic achievement that has
nothing to do with music. Maybe the music students chose easier electives so that they would have more time
to put in the countless hours of dreary practice almost every music student must endure.
In a prospective observational study, we can take steps to minimize the possible influences of factors we
are not studying. The Mission Viejo educators could have taken some of these steps in their retrospective
study, but other possibly influential factors were beyond their control. In the next subsection we will see how
a proper experimental design can obviate many of the criticisms to which the Mission Viejo study is subject.
Four Principles of Experimental Design
If you are to perform statistical experiments then there are no options: you must become a control freak .
Whether you allow this trait to carry over into your nonprofessional activities is up to you, but, as far as
experimentation is concerned, it is a must. The four principles of experimental design are:
• Control
• Randomize
• Replicate
• Block
Actually, randomization, replication, and blocking are specific strategies for implementing certain types of
control, so perhaps we might be justified in repeating the real estate agent’s slogan: successful experimental
design comes down to four things: control, control, control, control.
Suppose that we wish to conduct a Mission Viejo type study that is not subject to the criticisms that can
be directed at that study. We will begin by selecting students for our study. (These will be our experimental
units.) If music education begins in grade m, we will select the subjects of our experiment when they are
in grade m − 1. For example, if music classes begin in grade 6, we will troll the current grade 5 class for
our subjects. We might have all students in grade 5 reply to the statement, I would like to study music next
year, with a Likert scale number. Our subjects will be taken from the group who answer, Neither agree nor
disagree. Using the academic records of the students in our sample frame, we can place students of equal
academic promise in the music and non-music groups. In this way we can control an important factor that
was not controlled in the Mission Viejo study.
There will be factors that we cannot control and factors we might not anticipate or know. The affluence
of a student’s family is a factor that we cannot reliably ascertain (because we cannot ask students to bring in
the financial records of their parents), and therefore its influence is beyond our control. To equalize (as best
we can) such factors, we randomize7 selection. Chances are, when randomization is employed, both groups
will get affluent and non-affluent students.
Can gender be a factor in this study? No need to take a chance. Randomization might place more
females and fewer males in one of the groups. We therefore ensure that this does not happen. When a
group of experimental units is more homogeneous than the entire collection of experimental units, we consider
the homogeneous group to be a block . If our experiment involves t treatments and our block consists of
b experimental units, we can randomly assign about b/t experimental units from the block to each of the t
treatments. A variable (often categorical) that determines a block is called a blocking variable. In our design
7 The
strategy of randomization has been attributed to Johns Hopkins statistician Charles Sanders Peirce (1839–1914).
9
of a Mission Viejo type of experiment, we might make Gender a blocking variable. If we know that some
students in the sample frame of this experiment intend to do varsity sports, which can bring pressures that
impact academic pursuits, then we can treat athletic inclinations as another blocking variable. If a livestock
feed company is conducting an animal diet experiment, the company may block according to the ranches that
are participating. If a psychologist is studying the motor skills of the elderly, then age group (such as [78,80),
[80,82), [82,84), ... ) might be a blocking variable.
The results of an experiment can depend to a certain degree on chance. After all, it is randomization that
assigns the experimental units to the treatments, and different assignments will lead to different results (but
not to different conclusions if the experimental design is sound). If the average GPA of the music group turns
out to be 3.6 and the average GPA of the non-music group is 3.4, how can we be sure that the difference
between these two average GPAs is meaningful and not due to chance? We will deal with this question in a
more precise way later in these notes. For now we simply note that sample size helps us control the variance
of the response variables. In general, the larger the sample size, the more confident we can be about our
results. That is where replication comes in. We assign multiple experimental units to every treatment.
In a typical statistical experiment, one treatment group will have no treatment: the factors that are under
study are not applied at all. This group of experimental units is called the control group. In some studies,
the existence of a control group involves some subterfuge. For example, the subjects in the control group of
an experiment trialing a new drug receive a placebo. This procedure implements one aspect of control: an
equalization of the perceived prognoses among the treated and non-treated subjects.
In the Mission Viejo type experiment that we have been considering, we have not been specific about levels,
other that one group, the control group, will have 0 music instruction. We might design our experiment so
that there is more than one treatment group in addition to the control group. On the theory that if some
music is good, then more music is better, we might incorporate two positive levels of music instruction in our
experimental design: some students in the music group will do performance only, whereas others will augment
performance with a music theory class.
Testing a Claim
Suppose that we wish to test the claim of a plant nutrient company that its product improves both the taste
and the yield of fruits and vegetables that are reaped from plants that receive the nutrient. How do we test
the claim? Here is one experimental design to do that.
Experimental units We will purchase seeds of one variety of one fruit or vegetable—say the charentais
variety of canteloupe. Perhaps the strain of one seed producer is more sensitive to the nutrient under study.
to control for this possible factor, we will buy a seed packet from each of three different seed producers.
We will sort the seeds from each packet and select sixteen seeds from each packet (48 in total) that are the
same size and free from any discoloration or other irregularities. The first eight seeds from each package that
germinate and survive infant mortality (24 in total) will be our experimental units.
Explanatory variables and Treatments There is only one factor we are studying in this experiment: the
plant nutrient in question. (In a different experiment, we might have included irrigation and hours of direct
sunlight as factors in our study.) We will set four levels: (1) no fertilizer at all for the control group, (2)
half the recommended application of the nutrient, (3) the recommended application of the nutrient, and (4)
3/2 the recommended application of the nutrient. (Because there is only one factor in this experiment, each
treatment is nothing more than the level of the factor.)
Response variables There are two response variables. As the canteloupes become ripe, we will pick them
and weigh them, keeping a running total of the weight of the fruit produced by each plant in our study. For
the taste variable, we will pick 5 judges to taste each fruit and respond on a Likert scale to the statement,
This canteloupe is delicious.
Experimental design
• Randomize We will randomly divide the eight seedlings from each packet into four groups of 2.
10
• Replicate We will assign six experimental units to each of the four treatments by assigning to each
treatment one group of 2 seedlings from each of the three seed packets.
• Control We will control for known growth factors by planting the seedlings so that they receive equal hours
of sum. We will water and weed equally.
• Block In the experimental design, we have blocked according to seed producer.
Blinding
Humans are prone to errors in judgment and subtle biases. Suppose that an experimenter, when receiving
a potential subject’s consent to become a participant, looks at the second hand of a clock to determine if
the subject will receive a new wonder drug (second hand in the interval (0,30]) or a placebo (second hand
in the interval (30,60]). Is that a valid method of randomization? If the experimenter is aware of the group
that will receive the sugar pill, perhaps not. If the experimenter is aware of the treatments for each group,
if the experimenter has formed a favorable opinion of the participant, and if the seond hand points to the
31, will the experimenter ruthlessly deny the participant the benefits of the new wonder drug just because
the experimenter was a fraction of a second slow glancing at the clock? It would be better if the technician
created two groups without knowing which treatment would be applied to the groups. Such a strategy is
called blinding . Charles Sanders Peirce, to whom randomization has been attributed, has also been credited
with the proposal for blinding as part of experimental design.
There are two classes of people whose actions can affect the outcomes of experiments: those who can
influence the results (subjects, technicians, administrators, ... ) and those who evaluate the results (judges,
lead scientists or physicians, ... ). If all the individuals in one class are blinded, then the experiment is
said to be single-blind . If all the individuals in both classes are blinded, then the experiment is said to be
double-blind .
Lurking Variables & Confounding Variables & not Confusing the Two
We have already introduced the terms “lurking” and “confounding.” It is easy to confuse these two types of
potentially problematic variables. Suppose that X and Y are explanatory variables for response variable Z.
If X and Y are associated with each other, as well as being associated with Z, then X and Y are said to be
confounding variables. It is difficult to determine which of them causes the response of Z. Suppose that
a bank, seeking to increase its profits, institutes a sequence of fee increases to increase its revenue. Suppose
that, at the same times, the bank decreases the availability of customer representatives in order to reduce its
expenditures. Suppose that, as a result of these actions, the number of the bank’s depositors decreases. What
caused this decrease? Pinpointing the cause is not clear: the cost of the fees and the availability of tellers are
confounding variables—they are both associated with the number of customers as well as with each other.
Notice that confounding variables are to be found among the explanatory variables. A lurking variable is
not: it lurks. Although there is an association between two confounding variables, there need be no causal
relationship between the confounding variables, as the bank example illustrates: although the motivation for
the fee increases and the employee availability decreases was the same, the bank made independent decisions
to take each action. A lurking variable X is not included among the explanatory variables, but it is causal for
both the explanatory variable Y and the response variable Z. It will then seem that explanatory variable Y
causes response Z whereas lurking X is really the cause. At a beach in Florida, the number of jellyfish stings
and the weight of the litter picked up from the beach are strongly correlated. Are the jellyfish punishing
humans for their careless discarding of waste? It is likely that the number of people visiting the beach is a
lurking variable that explains both the number of jellyfish stings and the weight of the litter.
In a British study that began in 1970, each woman in a group of 1314 was asked if she was a smoker.8
8 David R. Appleton, Joyce M. French, and Mark P.J. Vanderpump, Ignoring a Covariate: An Example of Simpson’s Paradox,
The American Statistician 50 (Nov. 1996), 340-341.
11
Twenty years later, the mortality of the subjects was analyzed. The result: 24% of the smokers had died in
the intervening twenty years compared to 31% of the nonsmokers. Do these numbers mean that smoking is
good for your health?
Look at the year the study began and project backward. At one time cigarette smoking was not yet very
common, and smoking among women was far less than among men. The oldest women in the study tended
to be nonsmokers. But, because of their greater age, they tended to die in greater numbers (of nonsmoking
related causes). Age Group was associated with the categorical variables Smoker (Yes/No) and Twenty Year
Survivor (Yes/No). It would have been a lurking variable had it not been considered when the experiment
was designed. But it was. So Age Group and Smoker (Yes/No) were confounding variables for Twenty Year
Survivor (Yes/No).
Exercises
1. (Washington University exam, Fall 2007. No doubt before Tiger and Lance fell from grace.)
Suppose that 20% of Weetabix cereal boxes contain a Tiger Woods card, 30% a Lance Armstrong card,
and the rest contain a Serena Williams card. Suppose you buy five boxes of cereal. Use the ten sets of
five numbers from the following line taken from a table of random digits to simulate ten runs of buying
five boxes of cereal:
77007 26962 55466 12521 48125 12280 54985 26239 76044 54398
Assign the digits 0 and 1 to Tiger; 2, 3, and 4 to Lance; and the remaining digits to Serena. Based on
this simulation, estimate the probability that you end up with a complete set of their cards.
A) 0.0
F) 0.5
B) 0.1
G) 0.6
C) 0.2
H) 0.7
D) 0.3
I) 0.8
E) 0.4
J) 0.9
2. Suppose that N different letters are written. They are to be sent to N different people. The envelopes
are addressed, but the letters are placed at random into the envelopes. How many letters to you expect
will be sent to their intended destinations? This is a famous problem in probability. A theoretical
answer will be discussed later in these notes. For now, let us do a simulation for the case N = 5. In the
accompanying table, random permutations of the numbers 1, 2, 3, 4, 5, have been paired. For each pair,
count the number of matches. Calculate the average number of matches for the thirty pairs. The next
time you go to a New Year’s Eve party and five of your friends celebrate too much, grabbing a random
toque9 on departure, how many do you expect to go home wearing their own hat?
24351
21453
21354
21345
15243
14532
43521
43251
45213
23415
43125
34152
15342
52314
15432
24531
25143
12534
12354
24315
24531
15324
23451
31254
24351
34152
12354
13524
51342
12534
24531
15423
14235
53142
42513
51342
23514
53421
51324
25341
14532
43521
42351
25413
25143
34125
41352
35412
14235
53124
15324
31245
15423
12453
31254
12435
51243
15432
52413
51432
3. A baseball loving couple decide to have children until they either have a girl or they have three children:
if a childbirth results in either or both outcomes,then they will not have additional children. How many
children can they expect to have? How many boys? Girls? Let us do a simulation. For each block of 5
random digits found in lines 131, 132, 133, and 134 of the Random Digit Table in Figure 6.1.1, use the
first one, two, or three digits to simulate the couple’s family-forming strategy. To be definite, use the
9 http://www.urbandictionary.com/define.php?term=toque
12
digits 0-4 to simulate the birth of a girl. Calculate the averge number of children in the 32 simulations.
The average number of boys. The average number of girls.
4 Raw Bits, Weetabix’s fiercest rival, came out with its own promotional campaign, offering its own
collectible cards of three of hockey’s near-greats: Mike Modano, Brett Hull, and Randy Wilson. Boxes
of Raw Bits contained these three cards in the following percentages: 50% for Mike Modano, who was
definitely a Star, 30% for Brett Hull, who some have compared favorably with his unquestionably great
father, Bobby, and who others described as an over-paid under-achiever, and 20% for Randy Wilson,
who remains unknown to those who did not see him in action. Suppose a collector decides to attempt
acquisition of the set of three by purchasing only 5 boxes—there is only so much roughage a human
can digest. Simulate the collector’s chances. Assign digit 0, 1, 2, 3, 4 to Mike Modano, digits 5, 6, 7 to
Brett Hull, and digits 8, 9 to Randy Wilson. Start with the first group of 5 random digits in the first
line (labelled 101) of the Table in Figure 6.1.1 and continue until the end of the fifth line (labelled 105).
Estimate the probability of the collector’s success.
5. If five random digits are successively generated, how likely is it that there are no repetitions? That is,
what is the probability p that five different digits compose a list of five random digits? The question
just posed is not to be answered precisely for now. Instead, use the blocks of five random digits found in
the first fifteen rows (those rows labeled 101–115) of the random digit table in Figure 6.1.1 to estimate
p.
6. (Washington University exam, Spring 2014) Twenty dogs and twenty cats were subjects in an experiment to test the effectiveness of a new flea control chemical. Ten of the dogs were randomly assigned
to an experimental group that wore a collar containing the chemical, while the others wore a similar
collar without the chemical. The same was done with the cats. After 30 days veterinarians were asked
to inspect the animals for fleas and evidence of flea bites. This experiment is...
A. completely randomized with one factor: the type of collar
B. completely randomized with one factor: the species of animal
C. randomized block, blocked by species
D. randomized block, blocked by type of collar
E. completely randomized with two factors
7. To check the effect of cold temperatures on the battery’s ability to start a car, researchers purchased
a battery from Sears and one from NAPA. They disabled a car so it would not start, put the car in a
warm garage, and installed the Sears battery. They tried to start the car repeatedly, keeping track of
the total time that elapsed before the battery could no longer turn the engine over. Then they moved
the car outdoors where the temperature was below zero. After the car had chilled there for several hours
the researchers installed the NAPA battery and repeated the test. Is this a good experimental design?
A. Yes
B. No, because the car and the batteries were not chosen at random.
C. No, because they should have tested other brands of batteries, too.
D. No, because they should have tested more temperatures.
E. No, because temperature is confounded by brand.
8. Read the exercise and solution in Chapter 5 concerning the correlation between maternal age and the
risk of a Down syndrome birth. Suppose that an obstetrician notices that Down syndrome occurs
proportionately more frequently in mothers’ second births than in first births, proportionately more in
third births than in second births, and so on. The obstetrician might conclude that increasing birth
13
order might be causally related to increasing Down syndrome risk. On the other hand, what might be
a lurking variable that explains the apparent correlation?
9. In a comparative meta-analysis of the efficacy of Drug A and Drug B in treating a specific malady,
researchers noted that the level of improvement not only depended on the treatment but also on the
gender of the subject. In what way might gender have been confounding?
© Copyright 2026 Paperzz