Chapter 5 - Data Collection and Sampling

October 8, 2012
RANDOM SAMPLING
& SAMPLING DISTRIBUTIONS
Chapter 11
The foundation for inferential statistics
Backtracking
A couple things we skipped with Chapter 10
Probability
The odds that an event will occur
Range from 0 to 1.0
5.2
October 8, 2012
Quick check
I roll a die with 6 sides.
What is the probability it lands with the “2” face up?
a)
b)
c)
d)
e)
Who knows?
1 out of six
2 out of six
2
50-50
(If this is hard, read chapter 10).
5.3
Key Statistical Concepts…
Population
Sample
Subset
.Population parameters
Sample statistics
1.4
October 8, 2012
Sampling…
Inferential statistics permits us to draw conclusions about a
population based on a sample.
Sampling (i.e. selecting a sub-set of a whole population) is
done for reasons of:
cost (it’s less expensive to sample 1,000 television viewers
than 100 million TV viewers) and
practicality (e.g. performing a crash test on every automobile
produced is impractical).
We want the sample to be a good representation of the
population.
5.5
The risks and rewards of sampling
Risks
1. The sample might not represent the larger population.
2. We might reach inaccurate conclusions.
Rewards
1. The sample might represent the larger population.
2. We might reach accurate conclusions at a very low cost.
Best approach: Realize there is some error in sampling, and
Estimate this error.
October 8, 2012
Different types of error in research
Inferential statistics will not address:
Inferential statistics will address:
Nonsampling error:
Sampling Error:
unrelated to the sampling strategy
e.g., lousy measures, sloppy ratings
differences between the sample and the
population that exist only because of the
observations that happened to be selected
for the sample.
Biased sampling
Random
We cannot get rid of this type of error.
5.7
Selection Bias or Biased Samples…
…occurs when the sampling plan is such that some members
of the target population are less likely to be selected for
inclusion in the sample.
e.g., I am interested in how my students are doing, but I
sample those in the front row; I am interested in depression,
but I only sample those in the hospital
We need to avoid this.
To do so, we use random samping strategies.
5.8
October 8, 2012
Random sampling
Every observation in the population has an equivalent
chance of being sampled.
Every sample of a given size (n) has an equivalent chance
of being drawn from the population.
How do we do this?
Ideal: use random numbers from a site like
www.random.org to guide selection
5.9
Random Sampling



Sampling without replacement: no element may
appear more than once in a sample.
Sampling with replacement: an element may
appear more than once in a sample.
We will assume sampling with replacement.
Differences tend to be negligible when sample size
(n) is small relative to the size of the population (N),
which is often quite large.
October 8, 2012
Simple Random Sampling…
A government income tax auditor must choose a sample of 5
of 11 returns to audit.
Person
baker
george
ralph
mary
sally
joe
andrea
mark
greg
aaron
kim
Generate
Random #
0.87487
0.89068
0.11597
0.58635
0.34346
0.24662
0.47609
0.08350
0.53542
0.37239
0.73809
1
2
3
4
5
Person
mark
ralph
joe
sally
aaron
andrea
greg
mary
kim
baker
george
Sorted
Random #
0.08350
0.11597
0.24662
0.34346
0.37239
0.47609
0.53542
0.58635
0.73809
0.87487
0.89068
5.11
Let’s watch what happens with random sampling.
Assume I want to do a study to understand the population
of people who are in the front row of my statistics class.
What data do we want to gather?
I will gather one sample from the front row using simple
random sampling, and then a second sample.
5.12
October 8, 2012
Sample 1
Let’s assign everyone in the front row a random number
from random.org
Random Integer Generator
Here are your random numbers:
32 30 95 4 88 32 13 60 51 91 6 88 75 58
Now let’s draw the two people with the lowest numbers as
our sample (4 and 6).
5.13
Sample 2
Let’s assign everyone in the front row a NEW random
number from random.org
Here are your random numbers:
75 94 71 57 14 26 41 6 75 38 41 88 67 23
Now let’s draw the two people with the lowest numbers as
our sample (6 and 14).
5.14
October 8, 2012
5.15
Gathering our data
 Let’s
compute the mean for sample 1 and
for sample 2.
 What
do we expect will happen?
Stratified random sampling
Our sample might be more representative with stratified
random sampling
Define strata within the population (e.g., males, females)
Conduct random sampling from each strata
5.16
October 8, 2012
Sampling Error…
Sampling error refers to differences between the sample and
the population that exist only because of the observations
that happened to be selected for the sample.
Every sample is likely to differ slightly from the population,
due to random variation.
Individual differences, timing effects, etc., all can lead to
subtle differences across samples.
5.17
Sampling error
Increasing the sample size will reduce sampling error.
It still exists though. We need to estimate the sampling
error.
This is the goal of inferential statistics.
5.18
October 8, 2012
5.19
Give it a shot… explain to your
neighbor:
 What
is sampling error?
 Is
it influenced by the quality of your
measures?
The goal of inferential statistics
Even when we use careful random sampling, careful
measurement, etc., we still have sampling error.
How can we estimate the magnitude of sampling error?
That is, how do we know how accurate our sample is as an
estimate of the population?
5.20
October 8, 2012
Random Sampling
Population
(μ, σ)
Sample 2
Sample 1
X , sX
X , sX
The values of these statistics will
vary from sample to sample.
Sampling distributions




The distribution we would observe if we took all
possible samples of a given size from a population.
We want to estimate the population mean, standard
deviation, and other parameters.
There is a sampling distribution for each of these:
E.g., Sampling distribution of the mean
October 8, 2012
x
Sampling Distribution of Mean
Proper analysis and interpretation of a sample
statistic requires knowledge of its distribution.
Use x
Pop ulation
to estimate 

Process of
(p arameter) Inferential Statistics
" Start here."
Samp le
x
(statistic)
Select a
random sample
Developing a sampling distribution
The sampling distribution of the means– the distribution of
means if we took all possible samples of a given size from
the population, and calculated their means
5.24
October 8, 2012
Sampling distribution of the mean
We use this to estimate:
the level of error we would observe if we used sample
means to estimate the population mean
The probability of observing a sample mean that was
highly discrepant from the population mean
.
Sampling Distribution of the Mean

Some interesting characteristics which are captured
in the Central Limit Theorem…
October 8, 2012
Central Limit Theorem:




Given any population with mean (μ) and standard
deviation (σ), the sampling distribution of the mean
for sample size n will have:
a mean equal to μ
A standard deviation (known as the standard error of
the mean) =  n
and, as n → ∞, will approach a normal distribution.
Central Limit Theorem
Let’s break down the implications for
understanding the mean, the variability, and the
shape of the sampling distribution of the mean
October 8, 2012
The Mean of
the Sampling Distribution of the Mean
With repeated sampling (actually an infinite number of
times):
 The mean of the sampling distribution of the mean is
the same as the population mean (μ) from which the
scores were drawn.
 True
regardless of n, σ, and the shape of the population.
Standard deviation of
the sampling distribution of the mean
The standard deviation of the distribution of
sample means tells us how close the sample
mean is likely to be from the population mean,
or the spread of the sample means.
●Very important tool for all of inferential
statistics
●
October 8, 2012
Sampling Distribution of the Mean

The standard deviation of the sampling distribution of
the mean is the standard error of the mean.
X  
n
Sampling Distribution of the Mean
X  
n
The sample means vary less when:
 the scores from the population vary less.
 sample size (n) is greater.
 Note that there will be many random sampling
distributions of the mean. Each based on a different
sample size (n).
October 8, 2012
Shape of the sampling distribution of the
mean
Even when the underlying population distribution is
NOT normally distributed,
If your sample size is large enough,
The sampling distribution of the mean will be normal.
This is a huge advantage in conducting inferential
statistics.
Central Limit Theorem
Allows one to study populations with differently shaped
distributions
Creates the potential for applying the normal distribution
to many problems when sample size is sufficiently large
October 8, 2012
A brief detour: the Monte Carlo
simulation
We can use a Monte Carlo simulation to
demonstrate what would happen if we took many
samples from a population with known
characteristics.
●Several factors will affect the simulation.
●Let’s look at two of these:
●Distribution of the population
●Sample size
●
Monte Carlo simulations
Assume a population distribution. Draw many
samples (of size n) from this population and plot
the samples (or their descriptive statistics)
across repeats.
●NOTE: The MCS is for demonstration purposes
only. We would never do a Monte Carlo
simulation to analyze the results of a real
experiment.
●Remember, in a real experiment we usually don't
know the population parameters, we only have
the sample statistics.
●
October 8, 2012
Sampling by Monte Carlo simulation
Consider a roulette wheel with numbers 00, 0,
and 1-36.
●Spin the wheel n times, and make a frequency
histogram of the results.
●
“3”
n
Number on wheel
Sampling by Monte Carlo simulation
Draw one sample of size n, calculate the sample
mean.
●Now repeat this procedure 10000 times.
●What does the distribution of means look like?
●
M
M
M
Samples
(Keep going to
10,000)
Sampling Distribution
Of the Mean
October 8, 2012
Sampling Distribution of the Mean
To visualize the relationship between the
sample mean and the underlying population, we
can do the following experiment many times,
with different numbers of observations.
M
M
M
M
Sampling Distribution
Of the Mean
Population
M
Sample of
size n
Take Mean
The Sampling Distribution Of the Mean
distributed normally even if the population and samples from it are not
distributed normally!
M
M
M
M
Population
M
Sample of
size n
Take Mean
Sampling Distribution
Of the Mean
October 8, 2012
The Sampling Distribution Of the Mean
Consider a population with mean 0 and standard
deviation 1.
●
0.2
0.0
0.1
Density
0.3
0.4
Sample of 100000
-4
-2
0
2
4
M= 0 SD= 1
The Sampling DistributionOf the Mean
for large samples
When the size of the sample is large, the
sample means will be fairly close to the
population mean.
●
Sample of 100
Sample of 100
-2
0
M= -0.01 SD= 1.1
2
4
12
10
8
4
2
0
2
0
-4
6
Frequency
8
4
6
Frequency
8
6
4
2
0
Frequency
10
10
12
12
14
Sample of 100
-4
-2
0
M= -0.06 SD= 0.94
2
4
-4
-2
0
M= 0.03 SD= 1
2
4
October 8, 2012
The Sampling Distribution Of the Mean
for large samples
When the size of the sample is large:
The means across samples will be pretty similar– they don’t vary much.
The mean of the Sampling Distribution Of the Mean = population mean.
3000
2000
0
1000
Frequency
4000
5000
DOSM for sample size 100
-4
-2
0
2
4
M= -0.00052 SD= 0.0999769
The Sampling Distribution Of the Mean
for large samples
When the sample size is large
the Sampling Distribution Of the Mean will have a small SD.
M
M
M
M
Population
M
Sample of
size 20
Take Mean
Distribution
of Sample
Means
October 8, 2012
The Sampling Distribution Of the Mean for small samples
When the sample size is small:
the sample means vary a lot
the standard deviation of the sampling
distribution is larger
Sample of 8
1.0
Frequency
2.0
-4
-2
0
2
4
0.0
0.0
0.0
0.5
0.5
0.5
1.0
1.5
Frequency
1.5
1.0
Frequency
2.0
1.5
2.5
2.5
2.0
Sample of 8
3.0
3.0
Sample of 8
-4
-2
M= 0.9 SD= 0.41
0
2
-4
4
M= -0.3 SD= 0.62
-2
0
2
4
M= -0.4 SD= 1.3
The Sampling Distribution Of the Mean for small samples
When the sample size is small:
mean of the Sampling Distribution Of the Mean is still = population mean,
but the SD of the Distribution of Sample Means is large.
1500
1000
500
0
Frequency
2000
2500
DOSM for sample size 8
-4
-2
0
2
M= -0.00034 SD= 0.353312
4
October 8, 2012
The Sampling Distribution Of the Mean for very small samples
When the sample size is very small,
the mean of the Distribution of Sample Means still = population mean,
the standard deviation of the Distribution of Sample Means is huge.
0
500
Frequency
1000
1500
DOSM for sample size 2
-4
-2
0
2
4
M= -0.001 SD= 0.705306
Sample size affects the SD of the Distribution
of Sample Mean
0.4
0.3
0.2
0.1
SD of DOSM
0.5
0.6
0.7
The standard deviation of the Distribution of Sample Mean falls as:
0
20
40
60
Sample size
80
100
October 8, 2012
Optimal sample size
In any single sample, the sample mean is more
likely to be equal to the population mean if the
sample size is larger.
Always use the largest sample size that you can
afford!
If you must use a small sample, remember that
the sample statistics might not accurately
reflect the population parameters.
5.50
Sampling distribution of the
mean

Assume you have taken all samples of size 25
from a population of 400. The population is
skewed, with a mean of 0 and a SD of 1.

What will the shape of the sampling
distribution of the means be?
a)
b)
c)
Skewed
Normal
Who knows?
October 8, 2012
What does all this allow us to do?
If we know the distribution of means for samples drawn
from a given population,
we can estimate the probability that a sample was drawn
from a population with a certain mean and SD.
This is inferential statistics.
5.53
Converting Sample Mean to z
z
X  X
X

X  X

n
October 8, 2012
Example 1
Assume a normally distributed population with μ = 70
and σ = 20. Your sample size is 25. What is the
probability of obtaining a random sample with a
mean of 80 or higher?
Convert sample mean to z.
z
X  X

n

80  70
 2.5
20
25
Refer to z table.
The probability of
obtaining such a
mean or larger in
random sampling is
.0062.