Why was the Literary Digest poll so wrong?

Statistics 1
Suppose we want to find out how many residents of Imperial Valley believe in the
statewide legalization of marijuana provided the revenue from the legalization resulted
in free college tuition for all students in California. Which sampling technique do you
believe will generate the most accurate results?
 Student A: Asked all 5000 local high school students and Imperial Valley College
students and all 9652 Imperial Valley College students.
 Student B: Randomly selected 500 most likely voters how they would vote on this
measure.
Chapter Problem
Why was the Literary Digest poll so wrong?
Founded in 1890, the Literary Digest
magazine was famous for its success in
conducting polls to predict winners in
presidential elections. The magazine
correctly predicted the winners in the
presidential elections of 1916, 1920, 1924,
1928 and 1932. In the 1936 presidential
contest between Alf Landon and Franklin D.
Roosevelt, the magazine sent out ten million
ballots and received 1,293,669 ballots for
Landon and 972,897 ballots for Roosevelt,
so it appeared that Landon would capture
57% of the vote. The size of this poll is
extremely large when compared to the sizes
of other typical polls, so it appeared that
the poll would correctly predict the winner
once again. James A. Farley, Chairman of the
Democratic National Committee at the time,
praised the poll by saying this: “Any sane
person cannot escape the implication of such
a gigantic sampling of popular opinion as is
embraced in The Literary Digest straw vote.
I consider this conclusive evidence as to the
desire of the people of this country for a
change in the National Government. The
Literary Digest poll is an achievement of no
little magnitude. It is a poll fairly and
correctly conducted.” Well, Landon received
16,679,583 votes to the 27,751,597 votes
cast for Roosevelt. Instead of getting 57%
of the vote as suggested by The Literary
Digest poll, Landon received only 37% of the
vote. The Literary Digest magazine suffered
a humiliating defeat and soon went out of
business.
In that same 1936 presidential election,
George Gallup used a much smaller poll of
50,000 subjects, and he correctly predicted
that Roosevelt would win. How could it
happen that the larger Literary Digest poll
could be wrong by such a large margin? What
went wrong? As you learn about the basics
of statistics in this chapter, we will return
to the Literary Digest poll and explain why it
was so wrong in predicting the winner of the
1936 presidential contest.
Statistics 2
Polling Data



New Hampshire Primary:November 10, 2011 – January 9, 2012
Predictions 2012 Presidential Elections
2008 Presidential Election Predictions and Final Results
www.realclearpolitics.com
Statistics 3
2012 Predictions: Rasmussen reports
www.rasmussenreports.com
www.realclearpolitics.com
Statistics 4
1.1
What can we learn from the Literary Digest and Gallup polls?
 What conjectures can we make as to why the predicted Literary Digest poll so far
off from the actual results?
o Receiving 2.3 million respondents vs. a sample of 50,000?
 Respondents constitute a sample, whereas the population consists of the entire
collection of all adults eligible to vote.
DEFINITIONS: Data, Statistics, population, census, sample





Data are a collection of observations (such as measurement, genders, survey responses.
Statistics is the science of
o planning studies and experiments,
o obtaining data,
o organizing the data,
o summarizing, presenting, analyzing, and interpreting the data
o and finally, drawing conclusions based on the data so that people can make
informed decisions.
A population is the complete collection of all individuals to be studies
o All individuals including, but are not limited to scores, people, measurements
o The collection is complete in the sense that the set includes all individuals that are
to be studied.
A census is the collection of data from every member of the population.
A sample is a subcollection of members selected from the population
Example 1:
The Literary Digest poll was a ______________________ of 2.3 million respondents. What
would the population consist of?
Remember—garbage in, garbage out! Sample data must be collected through a
process of ________________________ selection. If sample data are not
collected in an appropriate way, the data may be completely ________________!
Statistics 5
1.2
STATISTICAL THINKING
Key Concept…
 When conducting a statistical analysis of data we have collected or analyzing a statistical
analysis done by someone else, we should not rely on blind acceptance of mathematical
calculations. It is imperative that we consider these factors:
Context of the data
Source of the data
Sampling method
Conclusions
Practical implications

In learning to think statistically, common sense and practical considerations are usually
more important than the implementation of cookbook formulas and calculations.
Example 2: Here is some data! Your grade in this dependent upon doing the following:


In the next 2 minutes, please summarize the data.
By class Thursday, present a power point to our class summarizing, analyzing and
interpreting the data.
650
1050
967
500
1700
2000
1100
1300
1400
2250
800
3500
1200
1250
24249
20666
19413
21992
21399
22022
25859
20390
23738
23294
19063
30131
18698
25348
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2250
3000
1750
1525
1500
1500
1250
1200
1600
425
1450
900
675
1450
25642
23074
28349
24644
23245
24378
23246
23695
23258
19325
20397
17256
19545
20780

At this point, you should be saying to yourself: _________________________

Why?
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Statistics 6
Example 2:


In the next 2 minutes, please summarize the data.
By class Thursday, present a power point to our class summarizing, analyzing and
interpreting the data.
Description: These data for the 1991 season of the National Football League were reported by
the Associated Press. QB is the salary (in thousands) of quarterback, TOTAL: total team
salaries (in thousands of dollars) and NFC 1 means National football conference, NFC 0 means
American Football Conferrence.
Number of cases: _______
Variable Names:
Statistics 7
Example 3: Refer to the data in the table below.
 The x-values are weights (in pounds) of cars;
 The y-values are the corresponding highway fuel consumption amounts (in mi/gal).
Car Weights and Highway Fuel Consumption Amounts
WEIGHT
4035
3315
4115
3650
3565
FUEL
CONSUMPTION
26
31
29
29
30
a. Context of the data.
i. Are the x-values matched with the corresponding y-values? That is, is each x-value
somehow associated with the corresponding y-value in some meaningful way?
ii.
If the x and y values are matched, does it make sense to use the difference
between each x-value and the y-value that is in the same column? Why or why not?
b. Conclusion. Given the context of the car measurement data, what issue can be addressed
by conducting a statistical analysis of the values?
c. Source of the data. Comment on the source of the data if you are told the car
manufacturers supplied the values. Is there an incentive for car manufacturers to report
values that are not accurate?
d. Conclusion. If we use statistical methods to conclude that there is a correlation between
the weights of cars and the amounts of fuel consumption, can we conclude that adding
weight to a car causes it to consume more fuel?
Statistics 8
Example 4: Form a conclusion about statistical significance. Do not make any formal
calculations. Either use results provided or make subjective judgements about the
results.
One of Gregor Mendel’s famous hybridization experiments with peas yielded 580 offspring with
152 of those peas (or 26%) having yellow pods. According to Mendel’s theory, 25% of the
offspring should have yellow pods. Do the results of the experiment differ from Mendel’s
claimed rate of 25% by an amount that is statistically significant?
1.3
TYPES OF DATA
Key Concept…
A goal of statistics is to make inferences, or generalizations, about a population. We discussed
the meaning of population and sample, now we need to learn to learn how distinguish between
cases when we have data of the entire population or data from a sample only and from different
types of numbers.
Parameter
Statistic
Quantitative Data
Categorical Data
DEFINITIONS: Parameter and a statistic

A parameter is a numerical measurement describing some characteristic of a
________________________.

A statistic is a numerical measurement describing some characteristic of a
_____________________.
Statistics 9
Example 5: Determine whether the given value is a statistic or a parameter.
a. 45% of the students in a calculus class failed the first exam.
b. 25 calculus students were randomly selected from all the sections of calculus I. 38% of
these student failed the first exam.
DEFINITIONS: Quantitative and Categorical data

Quantitative (aka numerical) data consist numbers representing counts or
measurements.
o Examples: grades on exam 1 of statistics students, length of hair, age (in years)
of respondents

Categorical (aka qualitative or attribute) data consist names or labels that are not
numbers representing counts or measurements.
o Examples: Student ID numbers, number on the Jersey of a soccer player, social
security numbers, phone numbers
Example 6: Please provide two examples each of:
a. Quantitative data
b. Categorical data
Statistics 10
DEFINITIONS: Discrete and continuous data

Discrete data result when the number of possible values is either a finite number or a
countable number.
o Examples: The number of students in this room.

Continuous (aka numerical) data result from infinitely many possible values that
correspond to some continuous scale that cover a range of values without gaps,
interruptions or jumps.
o Example: Age, volume of water in a jug, length of your hair
Example 7: Please provide two examples each of:
a. Discrete data
b. Continuous data
Definitions: Levels of Measurement




,
Nominal level of measurement --characterized by data that consists of names, labels, or
categories only.
 The data cannot be arranged in an ordering scheme (such as low to high).
o Yes/No/undecided; Political Party
Ordinal level of measurement – Data that can be arranged in some order, but
differences (obtained by subtraction) between data values either cannot be determined
or are meaningless.
o Course grades, ranks of colleges
Ratio level of measurement – similar to interval level, with the additional property that
there is also a natural zero starting point (where zero indicates that none of the quantity
is present). For values at this level, differences and ratios are both meaningful.
o Distances, Prices
Interval level of measurement – similar to ordinal level, with the additional property
that the difference between any two data values is meaningful. However, data at this
level do not have a natural zero starting point.
o Temperatures, years
Statistics 11
Example 8: Please provide an example of each level of measurement:
(a) Nominal
(b) Ordinal
(c) Ratio
(d) Interval
1.4 & 1.5: CRITICAL THINKING & COLLECTING SAMPLE DATA
Key Concepts…
We focus on meaning of information obtained by studying data as well as the methods used in
collecting sample data.
Goals in this section are to:
Learn how to interpret information based on data.
Thinking carefully about the context of the data, the source of the data, the method used
in data collection, the conclusions reached and practical implications.
Misuse of statistics
Evil intent on the part of a dishonest person
Unintentional errors on the part of people who do not know better.
σ If sample data are not collected in the appropriate way, the data may be so completely useless that no
amount of statistical torturing can salvage them.
"Lies, damned lies, and statistics" is a phrase describing the persuasive power of numbers, particularly
the use of statistics to bolster weak arguments, and the tendency of people to disparage statistics that
do not support their positions. It is also sometimes colloquially used to doubt statistics used to prove an
opponent's point.
Statistics 12
Sampling Techniques

Voluntary Response Sample (self-selected sample) – respondents themselves
decide whether to be included
o

Literary Digest study
Simple Random Sample – Every subject in the population has an equally likely
chance of being selected (each subject has the same probability of
being selected). Any pair (or triples, quads, etc) has an equally likely
chance of being selected.
o

Drawing names from a hat
o
First draw: Population size “N” – each subject has probability 1/N
o
Second draw: Population size “N – 1” – each subject has prob 1/(N-1)
Random Sample – members from a population are selected in such a way that
each individual member in the population has either an equal or unequal
chance of being selected; that is the probability of selected different
subjects need not be equal.
o

NBA draft
Systematic Sample – select a starting point and then select every kth element in
the population

o
Put 30 numbers in a hat (there are 30 students on my roster).
o
Choose every 6th student to survey.
Convenience Sampling – use results that are very easy to obtain.
o
It is convenient for me to take my sample at the grocery store at 10 a.m. close to
the 55+ living.
BTW – I am asking people whether they believe that abortion should be
legal/illegal.

Stratified Sampling – subdivide the population into at least two different
subgroups (strata) so that subjects within the same subgroup share the
same characteristics (such as gender or age bracket), then we draw a
sample from each subgroup (or stratum)
o
Suppose a group want to learn the chances of a bond issue passing. A community
is composed of two distinct income groups – low and high. Population of community
is stratified into low and high income and a simple random sample from the low
and high income groups.

Cluster sampling – first divide the population into sections (or clusters), then
randomly select some of those clusters, and then survey every member in the
cluster.
o

Randomly select airline flights – then survey every passenger on the plane.
Multistage sampling – use a different sampling method in different stages
Statistics 13
Example 9: Determine what sampling technique: Voluntary, Random, simple random,
systematic, cluster, convenience, or stratified sample would be useful in each situation.
(a)
(b)
An experimenter wants to estimate the average water consumption per family in a
city. He chooses a random starting point on every block and uses the water bill
of every 4th house.
____________________
Six different health plans were randomly selected and all of their members were
surveyed about customer satisfaction.
____________________
(c)
I surveyed all of my students to obtain sample data consisting of the number of
credit cards students possess.
_____________________
(d)
In a study of college programs, 820 students are randomly selected from those
majoring in communications, 1463 were randomly selected from those majoring in
mathematics and 760 were randomly selected from those majoring in history.
_____________________
(e)
Wes set up a booth outside the cafeteria and waited for students to approach
him. He asked students whether professor’s at IVC should be allowed to assign
homework.
_____________________
(f)
On your desk, you have five flash cards. Each flash card either has A, B, C, D
or F on it. Draw your flash card and that is your grade for the course.
_____________________
Statistics 14
Misuse of Data

Reported Results – take the data yourself rather than allowing someone to report their
data.
o
I will measure the height of each of you rather than you report to me how tall
you are.

Small Samples – conclusions shall not be reached on samples that are “not large enough”.
The question arises – how large of sample is large enough – and this brings us to a
discussion on the normal distribution.
o

I will ask five students on their feeling about euthanasia.
Misused Percentages – misleading or unclear percentages. (% review at end of section)
o
Continental Airlines claimed that they have improved 100% on lost luggage in the
last six months. The New York Times interpreted (correctly) to mean that no
baggage is now being lost – an accomplishment not yet enjoyed by Continental
Airlines.

Loaded Questions – if survey questions are not worded carefully, the results of a study
can be misleading. Survery questions can be “loaded” or intentionally worded to
elicit a desired response.

Order of questions -- results from a poll conducted in Germany
o
Would you say that traffic contributes more or less to air pollution than industry?

o
Would you say that industry contributes more or less to air pollution than traffic?


Traffic presented first, 45% blamed traffic, 27% blamed industry
Industy presented first, 24% blamed traffic and 57% blamed industry.
Missing Data
o
Sometimes missing data is out of control of experimenter (subjects dropping out
of a study for reasons unrelated to study)

o
Low income people less likely to respond to a questions such as income
o
U.S. Census – missing data from homeless people –
o
people that do not respond to phone surveys (or don’t have land lines)
Self-Interest study
o
Aware of monetary gains from results.

Our survey 250 people from hiring professionals demonstrated that you
are more likely to get hired if you purchase Tistoni shoes for men (price
tag: $38,000). (By the way, I work for Tistoni and as part of my
marketing, I took every one of these executives to dinner).

Deliberate Distortions – see graphs next page

Small samples – make sure your sample size is large enough!
Statistics 15
Example 10: Do you want to work for company A or company B. The x-axis represents
time and the y-axis represents salary. All aspects of both Company A and
Company B are equivalent.
Company B
Salary
Salary
Company A
Salary
Salary
Bias – everything sampling technique will generate some bias. Goal of the researcher should
be to minimize bias. These are two biases that can be minimized. The non-response
bias – at least report the non-response, or take another random sample from the
differing population.

Non-response bias – answers of those that respond potentially differ from
respondents that did not respond.

Response Bias – respondents answering questions in the way that they believe
the surveyor wants them to respond.
Correlation vs. Causality
Example 11: It is a known fact that as ice cream sales increase, shark attacks increase.
Therefore, we can conclude that sharks love to eat people that eat ice cream!! Obviously,
‘we’ ice cream eaters taste better!
Is this a valid conclusion? Why/ why not?
Statistics 16
Correlation vs. Causality and Confounding
Finding a statistical association between two variables and to conclude that one of the
variables causes (or directly affects) the other variable.

Two variables may seem to be linked (smoking and pulse rate), but the increase in pulse
rate may/may not be caused by smoking.

The relationship of the shark attacks vs. ice cream sales as well as smoking vs. pulse rate
are correlations.

Even though we may find a number of cases to be true – we cannot conclude that one
variable caused the other
o
Need to consider confounding variables.
Confounding Variables – not able to distinguish among the affects of different factors.
Moral of the story – CORRELATION DOES NOT IMPLY CAUSALITY!!!
Example 12: David’s study demonstrated that tall people read better.
(a)
Does being tall CAUSE people to read better?
(b)
Is it possible that there is a correlation between being tall and reading better?
(c)
What are some possible confounding variables?
Statistics 17
Studies

Observational Study – we observe and measure specific characteristics, but we
do not attempt to modify the subjects being studied.
o
Cross-Sectional Study – Data are observed, collected and measured at one
particular point in time. Aims to provide data on the entire population.
o
Retrospective study (Observational Study)

Case-Control study – Data are observed, collected and measured
from a particular subset of the population.

Used frequently in epidemiology – compare subjects with a
particular condition “the cases” with those that do not have
the condition “the controls” but who are otherwise similar.

Data are collected from the past by going back in time (through
examination of records, interviews, and so on.

o
A type of case-control study (look back historically)
Prospective (longitudinal) study

Data are collected in the future from groups sharing common
factors (called cohorts).

Experiment – we apply some treatment and then proceed to observe its effect
on the subjects (Subjects in experiments are called experimental units)
**Note: In an observational study, no treatment is given.
Example 13: Identify whether each study is cross-sectional, retrospective or prospective.
(a)
Qualcomm funded a project that studied the affects of 4th – 6th gradestudents who were
taught by a math specialist ability to communicate mathematics over a period of 4 years.
(b)
Physicians at the Mount Sinai Medical Center plan to study emergency personnel who worked
at the site of the terrorist attacks in New York City on September 11, 2001. They plan to
study these workers from now until several years into the future.
(c)
University of Toronto researchers studied 669 traffic crashes involving drivers with cell
phones. They found that cell phone use quadruples the risk of a collision.
Statistics 18
Experimental Design
 Designing an experiment is important. A faulty design can result in ‘GIGA’ (as your author
suggests) – garbage in, garbage out!
 Famous example: The Salk Vaccine Experiment
o 1954 – test the effectiveness of the Salk vaccine in preventing polio.
o 200,745 children given a treatment
 201,229 injected with placebo (essentially, a sugar pill that has no affect)
 200,745 injected with Salk vaccine injection
o Result: 115 injected with placebo developed paralytic polio
33 developed paralytic polio of those injected with vaccine.
Types of Experiments

Randomization – subjects assigned to different groups through a process of random
selection. Equivalent to flipping a coin!
o
The children in the Salk Vaccine experiment were assigned treatment or control
group based on random selection.
o

Use chance as a way to create similar groups.
Replication – replication of an experiment on more than one subject.
o
Sample sizes should be large enough so that the erratic behavior characteristic of
small samples do not disguise the true effects of differing treatments.
o
Larger sample sizes increase the change of recognizing different treatment effects;
but large sample sizes do not necessarily indicate a good sample.

Completely Randomized Experimental Design – assign subjects to different
treatment groups through a process of random selection (Salk Vaccine)

Randomized Block Design – Group subjects that are similar but blocks differ in ways
that might affect the outcome of an experiment (i.e. gender and assigning
treatment for heart medication ). A block is a group of subjects that are
similar.

Rigorously Controlled Design – subjects are assigned to different treatment groups
in ways that are important to experiment

Matched Pairs Design – Compare exactly two treatment groups by using subjects matched in
pairs that somehow are related and/or share similar characteristics. Can be subject (twin
studies)…Coke/Pepsi anyone??
Statistics 19
Blinding & Placebo Affect

Blinding is a technique used in an experiment in which the subject doesn’t know
whether he or she is receiving the treatment or the placebo.

In a double-blind experiment, both the subject and the investigator do not know
whether the subject received the treatment or the placebo.

Placebo affect occurs when an untreated subject reports an improvement in
symptoms.
** Note: Salk Vaccine was a double-blind experiment.
Sampling Errors
 There will always be sampling error, no matter how well the experimental design is
planned.
 If we randomly sample 1000 IVC students and asked if they obtained a high school
diploma or a GED the result will differ slightly from another 1000 students asked the
same question. (we often hear the term “margin of error” in reporting statistics. This will
be discussed later).

Sampling Error – the difference between a sample result and the true population
result; such an error results from change sample fluctuations.

Non-sampling error – occurs when the sample data are incorrectly collected,
recorded, or analyzed (such as selected a biased sample, using a defective
measurement instrument, or copying the data incorrectly).
o
The student that ends up with 1005% in the class had an incorrect test score
input in the grade book!
Statistics 20
Example 14: Choose something you want to study and design an experiment.
Statistics 21
Homework Chapter 1
1.1
1.2
1.3
1.4
1.5
NA
1-18, 23, 26, 28
1-32, 34
1,3, 4, 5, 6, 8, 9, 10, 12, 13, 15-19, 21, 24, 25, 28, 30
1-4,6, 9, 11, 12, 13, 15, 16, 18, 19, 21-26, 27, 29, 31
PERCENTAGE REVIEW
“of” often means multiply
Percent means per hundred so
n% 
Percentage of: Change the % to
n
100
1
then multiply.
100
Fraction to percentage: Divide by denominator and multiply by 100.
Decimal to percentage: Multiply the decimal by 100 and put in the percent symbol.
Percentage to decimal: Remove the percent symbol and divide by 100.
Perform the indicated operation.
a. 12% of 1200
c. 12% of 1200
b. Write 5/8 as a percentage.
d. Write 5/8 as a percentage.