Chapter13 Slides

13 Collecting Statistical Data
13.1 The Population
13.2 Sampling
13.3 Random Sampling
1.1 - 1
• Polls, studies, surveys and other data
collecting tools collect data from a
small part of a larger group so that we
can learn something about the larger
group.
• This is a common and important goal
of statistics: Learn about a large group
by examining data from some of its
members.
1.1 - 2
 Data
collections of observations (such as
measurements, genders, survey
responses)
1.1 - 3
 Statistics
is the science of planning studies
and experiments, obtaining data,
and then organizing, summarizing,
presenting, analyzing, interpreting,
and drawing conclusions based on
the data
1.1 - 4
 Population
the complete collection of all
individuals (scores, people,
measurements, and so on) to be
studied; the collection is complete
in the sense that it includes all of
the individuals to be studied
1.1 - 5

Census
Collection of data from every
member of a population

Sample
Subcollection of members
selected from a population
1.1 - 6
A Survey
•
•
•
The practical alternative to a census is to collect
data only from some members of the population
and use that data to draw conclusions and make
inferences about the entire population.
Statisticians call this approach a survey (or a poll
when the data collection is done by asking
questions).
The subgroup chosen to provide the data is called
the sample, and the act of selecting a sample is
called sampling.
1.1 - 7
A Survey
•
•
The first important step in a survey is to distinguish
the population for which the survey applies (the
target population) and the actual subset of the
population from which the sample will be drawn,
called the sampling frame.
The ideal scenario is when the sampling frame is the
same as the target population–that would mean that
every member of the target population is a candidate
for the sample. When this is impossible (or
impractical), an appropriate sampling frame must be
chosen.
1.1 - 8
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
The U.S. presidential election of 1936
pitted Alfred Landon, the Republican
governor of Kansas, against the
incumbent Democratic President,
Franklin D. Roosevelt. At the time of the
election, the nation had not yet emerged
from the Great Depression, and economic
issues such as unemployment and
government spending were the dominant
themes of the campaign.
1.1 - 9
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
The Literary Digest had used polls to
accurately predict the results of every
presidential election since 1916, and their
1936 poll was the largest and most
ambitious poll ever.
The sampling frame for the Literary
Digest poll consisted of an enormous list
of names that included:
1.1 - 10
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
(1) every person listed in a telephone
directory anywhere in the United States,
(2) every person on a magazine
subscription list, and
(3) every person listed on the roster of a
club or professional association.
1.1 - 11
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
From this sampling frame a list of about
10 million names was created, and every
name on this list was mailed a mock
ballot and asked to mark it and return it to
the magazine.
1.1 - 12
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• Based on the poll results, the Literary
Digest predicted a landslide victory for
Landon with 57% of the vote, against
Roosevelt’s 43%.
• the election turned out to be a
landslide victory for Roosevelt with
62% of the vote, against 38% for
Landon.
1.1 - 13
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• The difference between the poll’s
prediction and the actual election
results was 19%, the largest error ever
in a major public opinion poll.
1.1 - 14
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• For the same election, a young pollster
named George Gallup was able to
predict accurately a victory for
Roosevelt using a sample of “only”
50,000 people.
• What went wrong with the Literary
Digest poll and why was Gallup able to
do so much better?
1.1 - 15
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• The first thing seriously wrong with the
Literary Digest poll was the sampling
frame, consisting of names taken from
telephone directories, lists of magazine
subscribers, rosters of club members,
and so on. Telephones in 1936 were
something of a luxury, and magazine
subscriptions and club memberships
even more so, at a time when 9 million
people were unemployed.
1.1 - 16
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• When it came to economic status the
Literary Digest sample was far from
being a representative cross section of
the voters. This was a critical problem,
because voters often vote on
economic issues, and given the
economic conditions of the time, this
was especially true in 1936.
1.1 - 17
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• When the choice of the sample has a
built-in tendency (whether intentional
or not) to exclude a particular group or
characteristic within the population, we
say that a survey suffers from
selection bias.
• Selection bias must be avoided, but it
is not always easy to detect it ahead of
time. Even the most scrupulous
attempts to eliminate selection bias
can fall short.
1.1 - 18
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• The second serious problem with the
Literary Digest poll was the issue of
nonresponse bias.
• In a typical survey it is understood that
not every individual is willing to
respond to the survey request (and in a
democracy we cannot force them to do
so).
1.1 - 19
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• Those individuals who do not respond
to the survey request are called
nonrespondents, and those who do are
called respondents.
• The percentage of respondents out of
the total sample is called the response
rate.
1.1 - 20
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• For the Literary Digest poll, out of a
sample of 10 million people who were
mailed a mock ballot only about 2.4
million mailed a ballot back, resulting
in a 24% response rate.
• When the response rate to a survey is
low, the survey is said to suffer from
nonresponse bias.
1.1 - 21
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
• One of the significant problems with
the Literary Digest poll was that the
poll was conducted by mail. This
approach is the most likely to magnify
nonresponse bias, because people
often consider a mailed questionnaire
just another form of junk mail.
1.1 - 22
CASE STUDY 2
THE 1936 LITERARY
DIGEST POLL
The Literary Digest story has two morals:
(1) You’ll do better with a well-chosen
small sample than with a badly
chosen large one, and
(2) watch out for selection bias and
nonresponse bias.
1.1 - 23
Examples
•
Page 516, problems 17,18,19
(Solutions on following slides)
NOTE:
students should omit problem 28 from
homework exercises
1.1 - 24
Examples
Solutions
17(a) the citizens of Cleansburg
17(b) the sampling frame is limited to that part of the
target population that passes by a city street corner
between 4:00 pm and 6:00 pm
1.1 - 25
Examples
Solutions
18(a) 475
18(b) yes, this survey is subject to nonresponse bias.
The response rate is
475
0.266
1313 475
1.1 - 26
Examples
Solutions
19(a) the choice of street corner could make a difference
in responses collected
19(b) interviewer D. We are assuming that people who
live or work downtown are more likely to answer yes
than people in other parts of town.
1.1 - 27
Examples
Solutions
19(c) yes. People on street between 4 pm and 6 pm are
not representative of the population at large. Also,
the five street corners were chosen by the
interviewers and the passers-by are unlikely to
represent a cross-section of the target population.
19(d) omit
1.1 - 28
Convenience Sampling
•
•
One commonly used short-cut in sampling is
known as convenience sampling. In convenience
sampling the selection of which individuals are in
the sample is dictated by what is easiest or
cheapest for the data collector, never mind trying
to get a representative sample.
A classic example of convenience sampling is
when interviewers set up at a fixed location such as
a mall or outside a supermarket and ask passersby
to be part of a public opinion poll.
1.1 - 29
Convenience Sampling
•
•
A different type of convenience sampling occurs
when the sample is based on self-selection–the
sample consists of those individuals who volunteer
to be in it.
Self-selection is the reason why many Area Code
800 polls are not to be trusted. Convenience
sampling is not always bad–at times there is no
other choice or the alternatives are so expensive
that they have to be ruled out.
1.1 - 30
Quota Sampling
•
•
Quota sampling is a systematic effort to force the
sample to be representative of a given population
through the use of quotas–the sample should have
so many women, so many men, so many blacks, so
many whites, so many people living in urban areas,
so many people living in rural areas, and so on. The
proportions in each category in the sample should
be the same as those in the population.
If we can assume that every important
characteristic of the population is taken into
account when the quotas are set up, it is
reasonable to expect that the sample will be
representative of the population and produce
reliable data.
1.1 - 31
Random Sampling
•
•
The best alternative to human selection is to let the
laws of chance determine the selection of a sample.
Sampling methods that use randomness as part of
their de- sign are known as random sampling
methods, and any sample obtained through random
sampling is called a random sample (or a
probability sample).
1.1 - 32
Simple Random Sampling
•
•
The most basic form of random sampling is called
simple random sampling. It is based on the same
principle a lottery is. Any set of numbers of a given
size has an equal chance of being chosen as any
other set of numbers of that size.
In theory, simple random sampling is easy to
implement. We put the name of each individual in
the population in “a hat,” mix the names well, and
then draw as many names as we need for our
sample. Of course “a hat” is just a metaphor.
1.1 - 33
Simple Random Sampling
•
•
These days, the “hat” is a computer database
containing a list of members of the population. A
computer program then randomly selects the
names.
This is a fine idea for small, compact populations,
but a hopeless one when it comes to national
surveys and public opinion polls. For most public
opinion polls–especially those done on a regular
basis” the time and money needed to do this are
simply not available.
1.1 - 34
13 Collecting Statistical Data
13.1 The Population
13.2 Sampling
13.3 Random Sampling
13.4 Sampling: Terminology and Key
Concepts
13.5 The Capture-Recapture Method
13.6 Clinical Studies
1.1 - 35
Survey
In a survey, we use a subset of the population, called a
sample, as the source of our information, and from
this sample, we try to generalize and draw conclusions
about the entire population.
1.1 - 36
 Parameter
a numerical measurement
describing some characteristic of a
population.
population
parameter
1.1 - 37
 Statistic
a numerical measurement describing
some characteristic of a sample.
sample
statistic
1.1 - 38
Sampling Error
•
•
We will use the term sampling error to describe the
difference between a parameter and a statistic used
to estimate that parameter.
In other words, the sampling error measures how
much the data from a survey differs from the data
that would have been obtained if a census had been
used.
1.1 - 39
Sampling Error
•
Sampling error can be attributed to two factors:
chance error and sampling bias.
1.1 - 40
Chance Error
•
•
Chance error is the result of the basic fact that a
sample can only give us approximate information
about the population.
Different samples are likely to produce different
statistics for the same population, even when the
samples are chosen in exactly the same way–a
phenomenon known as sampling variability.
1.1 - 41
Chance Error
•
While sampling variability, and thus chance error, are
unavoidable, with careful selection of the sample and
the right choice of sample size they can be kept to a
minimum.
1.1 - 42
Sampling Bias
•
•
Sample bias is the result of choosing a bad sample
and is a much more serious problem than chance
error.
Getting a sample that is representative of the entire
population can be very difficult and can be affected
by many subtle factors.
1.1 - 43
Sampling Bias
•
As opposed to chance error, sample bias can be
eliminated by using proper methods of sample
selection.
1.1 - 44
Sampling Proportion
•
•
•
The size of the sample is denoted by the letter n
The size of the population is denoted by the letter N
The ratio n/N is called the sampling proportion.
1.1 - 45
Examples
•
Page 515, problems 5,6,7,8
1.1 - 46
Examples
5(a) 680/8325
5(b) 45%
1.1 - 47
Examples
6(a) the registered voters in Cleansburg
6(b) the 680 registered voters polled by telephone
6(c) simple random sampling
1.1 - 48
Examples
7. Smith 3%, Jones 3%, and Brown 0%
1.1 - 49
Examples
8. Chance, the sample was chosen randomly to eliminate
selection bias and there was a 100% response rate to
eliminate non-response bias.
1.1 - 50
13 Collecting Statistical Data
13.1 The Population
13.2 Sampling
13.3 Random Sampling
13.4 Sampling: Terminology and Key
Concepts
13.5 The Capture-Recapture Method
13.6 Clinical Studies
1.1 - 51
Finding the N-value
•
•
•
Finding the exact N-value of a large and elusive
population can be extremely difficult and
sometimes impossible.
In many cases, a good estimate is all we really
need, and such estimates are possible through
sampling methods.
The simplest sampling method for estimating the
N-value of a population is called the capturerecapture method.
1.1 - 52
THE CAPTURERECAPTURE
METHOD
■ Step 1. Capture (sample):
Capture (choose) a sample of size n1, tag (mark,
identify) the animals (objects, people), and
release them back into the general population.
1.1 - 53
THE CAPTURERECAPTURE
METHOD
■ Step 2. Recapture (resample):
After a certain period of time, capture a new
sample of size n2 and take an exact head count
of the tagged individuals (i.e., those that were
also in the first sample). Call this number k.
1.1 - 54
THE CAPTURERECAPTURE
METHOD
■ Step 3. Estimate:
The N-value of the population can be estimated
to be approximately
(n1• n2)/k.
1.1 - 55
Capture-Recapture Method
The capture-recapture method is based on the
assumption that both the captured and recaptured
samples are representative of the entire population.
1.1 - 56
Capture-Recapture Method
•
•
Under these assumptions, the proportion of tagged
individuals in the recaptured sample is
approximately equal to the proportion of the
tagged individuals in the population.
In other words, the ratio k/n2 is approximately
equal to the ratio n1/N. From this we can solve for N
and get
N ≈ (n1• n2)/k
1.1 - 57
Example 13.6 Small Fish in a Big
Pond
A large pond is stocked
with catfish. As part of a
research project we need to estimate the number of
catfish in the pond. An actual head count is out of the
question (short of draining the pond), so our best bet is
the capture-recapture method.
1.1 - 58
Example 13.6 Small Fish in a Big
Pond
Step 1. For our first sample
we capture a predetermined
number n1 of catfish, say
n1 = 200. The fish are tagged and released unharmed
back in the pond.
1.1 - 59
Example 13.6 Small Fish in a Big
Pond
Step 2. After giving enough time for the released fish to
mingle and disperse throughout the pond, we capture a
second sample of n2 catfish. While n2 does not have to
equal n1, it is a good idea for the two samples to be of
approximately the same order of magnitude. Let’s say
that n2 = 150. Of the 150 catfish in the second sample, 21
have tags (were part of the original sample).
1.1 - 60
Example 13.6 Small Fish in a Big
Pond
Assuming the second sample is representative of the
catfish population in the pond, the ratio of tagged fish in
the second sample (21/150) is approximately the same
as the ratio of tagged fish in the pond (200/N). This gives
the approximate proportion
21/150 ≈ 200/N
which in turn gives
N ≈ 200 150/21 ≈ 1428.57
1.1 - 61
Example 13.6 Small Fish in a Big
Pond
Obviously, the value N = 1428.57 cannot be taken
literally, since N must be a whole number. Besides, even
in the best of cases, the computation is only an estimate.
A sensible conclusion is that there are approximately N
= 1400 catfish in the pond.
1.1 - 62