Sampling

STK 4600: Statistical methods for
social sciences.
Survey sampling and statistical demography
Surveys for households and individuals
1
Survey sampling: 4 major topics
1. Traditional design-based statistical inference
•
7 weeks
2. Likelihood considerations
•
1 week
3. Model-based statistical inference
•
3 weeks
4. Missing data - nonresponse
•
2 weeks
2
Statistical demography
• Mortality
• Life expectancy
• Population projections
• 2 weeks
3
Course goals
• Give students knowledge about:
– planning surveys in social sciences
– major sampling designs
– basic concepts and the most important estimation methods in
traditional applied survey sampling
– Likelihood principle and its consequences for survey
sampling
– Use of modeling in sampling
– Treatment of nonresponse
– A basic knowledge of demography
4
But first: Basic concepts in sampling
Population (Target population): The universe of all
units of interest for a certain study
•
Denoted, with N being the size of the population:
U = {1, 2, ...., N}
All units can be identified and labeled
•
Ex: Political poll – All adults eligible to vote
•
Ex: Employment/Unemployment in Norway– All
persons in Norway, age 15 or more
•
Ex: Consumer expenditure : Unit = household
Sample: A subset of the population, to be observed.
The sample should be ”representative” of the
population
5
Sampling design:
• The sample is a probability sample if all units in the sample have been
chosen with certain probabilities, and such that each unit in the population
has a positive probability of being chosen to the sample
• We shall only be concerned with probability sampling
• Example: simple random sample (SRS). Let n denote the sample size.
Every possible subset of n units has the same chance of being the sample.
Then all units in the population have the same probability n/N of being
chosen to the sample.
• The probability distribution for SRS on all subsets of U is an example of
a sampling design: The probability plan for selecting a sample s from the
population:
N 
p( s )  1 /   if | s | n
n 
p( s )  0 if | s | n
6
Basic statistical problem: Estimation
• A typical survey has many variables of interest
• Aim of a sample is to obtain information regarding totals or
averages of these variables for the whole population
• Examples : Unemployment in Norway– Want to
estimate the total number t of individuals unemployed.
For each person i (at least 15 years old) in Norway:
yi  1 if person i is unemployed , 0 otherwise
Then :
t  iN1 yi
7
• In general, variable of interest: y with yi
equal to the value of y for unit i in the
population, and the total is denoted
t  iN1 yi
• The typical problem is to estimate t or t/N
•Sometimes, of interest also to estimate ratios of
totals:
Example- estimating the rate of unemployment:
yi  1 if person i is unemployed , 0 otherwise
xi  1 if person i is in the labor force, 0 otherwise
with totals t y , t x
Unemployment rate: t y / t x
8
Sources of error in sample surveys
1. Target population U vs Frame population UF
Access to the population is thru a list of units – a
register UF . U and UF may not be the same:
Three possible errors in UF:
–
–
–
•
Undercoverage: Some units in U are not in UF
Overcoverage: Some units in UF are not in U
Duplicate listings: A unit in U is listed more than
once in UF
UF is sometimes called the sampling frame
9
2. Nonresponse - missing data
•
•
•
•
•
Some persons cannot be contacted
Some refuse to participate in the survey
Some may be ill and incapable of responding
In postal surveys: Can be as much as 70%
nonresponse
In telephone surveys: 50% nonresponse is not
uncommon
•
Possible consequences:
–
–
Bias in the sample, not representative of the
population
Estimation becomes more inaccurate
•
Remedies:
–
imputation, weighting
10
3. Measurement error – the correct value of yi is
not measured
– In interviewer surveys:
•
•
•
Incorrect marking
interviewer effect: people may say what they think
the interviewer wants to hear – underreporting of
alcohol ute, tobacco use
misunderstanding of the question, do not
remember correctly.
11
4. Sampling «error»
– The error (uncertainty, tolerance) caused by
observing a sample instead of the whole
population
– To assess this error- margin of error:
measure sample to sample variation
–
Design approach deals with calculating
sampling errors for different sampling designs
– One such measure: 95% confidence interval:
If we draw repeated samples, then 95% of the
calculated confidence intervals for a total t
will actually include t
12
• The first 3 errors: nonsampling errors
– Can be much larger than the sampling error
• In this course:
– Sampling error
– nonresponse bias
– Shall assume that the frame population is
identical to the target population
– No measurement error
13
Summary of basic concepts
•
•
•
•
•
Population, target population
unit
sample
sampling design
estimation
–
–
–
–
estimator
measure of bias
measure of variance
confidence interval
14
• survey errors:
–
–
–
–
register /frame population
mesurement error
nonresponse
sampling error
15
Example – Psychiatric Morbidity Survey
1993 from Great Britain
• Aim: Provide information about prevalence of
psychiatric problems among adults in GB as well
as their associated social disabilities and use of
services
• Target population: Adults aged 16-64 living in
private households
• Sample: Thru several stages: 18,000 adresses were
chosen and 1 adult in each household was chosen
• 200 interviewers, each visiting 90 households
16
Result of the sampling process
• Sample of addresses
Vacant premises
Institutions/business premises
Demolished
Second home/holiday flat
• Private household addresses
Extra households found
• Total private households
Households with no one 16-64
• Eligible households
• Nonresponse
• Sample
18,000
927
573
499
236
15,765
669
16,434
3,704
12,730
2,622
10,108
households with responding adults aged 16-64
17
Why sampling ?
• reduces costs for acceptable level of accuracy
(money, manpower, processing time...)
• may free up resources to reduce nonsampling error
and collect more information from each person in
the sample
– ex:
400 interviewers at $5 per interview: lower sampling
error
200 interviewers at 10$ per interview: lower
nonsampling error
• much quicker results
18
When is sample representative ?
• Balance on gender and age:
– proportion of women in sample @ proportion in
population
– proportions of age groups in sample @ proportions in
population
• An ideal representative sample:
– A miniature version of the population:
– implying that every unit in the sample represents the
characteristics of a known number of units in the
population
• Appropriate probability sampling ensures a
representative sample ”on the average”
19
Alternative approaches for statistical inference
based on survey sampling
• Design-based:
– No modeling, only stochastic element is the
sample s with known distribution
• Model-based: The values yi are assumed to be
values of random variables Yi:
– Two stochastic elements: Y = (Y1, …,YN) and s
– Assumes a parametric distribution for Y
– Example : suppose we have an auxiliary
variable x. Could be: age, gender, education. A
typical model is a regression of Yi on xi.
20
• Statistical principles of inference imply that the
model-based approach is the most sound and valid
approach
• Start with learning the design-based approach
since it is the most applied approach to survey
sampling used by national statistical institutes and
most research institutes for social sciences.
– Is the easy way out: Do not need to model. All
statisticians working with survey sampling in
practice need to know this approach
21
Design-based statistical inference
• Can also be viewed as a distribution-free
nonparametric approach
• The only stochastic element: Sample s, distribution
p(s) for all subsets s of the population U={1, ..., N}
• No explicit statistical modeling is done for the
variable y. All yi’s are considered fixed but
unknown
• Focus on sampling error
• Sets the sample survey theory apart from usual
statistical analysis
• The traditional approach, started by Neyman in 1934
22
Estimation theory-simple random sample
N
SRS of size n: Each sample s of size n has p( s )  1 /  
n 
Can be performed in principle by drawing one unit at time
at random without replacement
Estimation of the population mean of a variable y:
  iN1 yi / N
A natural estimator - the sample mean: ys  is yi / n
Desirable properties:
( I) Unbiasedness : An estimator ˆ is unbiased if
E ( ˆ ) 
ys is unbiased for SRS design
23
The uncertainty of an unbiased estimator is measured by its
estimated sampling variance or standard error (SE):
Var( ˆ ) E ( ˆ   )2 , if E ( ˆ )  
Vˆ ( ˆ ) is an (unbiased) estimate of Var( ˆ )
SE( ˆ )  Vˆ ( ˆ )
Some results for SRS:
(1) Let  i be the probabilit y that unit i is in the sample,
Then  i  n / N  f , the sampling fraction
( 2) E ( y s ) 
24
(3) Let  2 be the population variance :  2 
Var( y s ) 
2
1
2
N
i 1 ( yi   )
N 1
(1  f )
n
Here, the factor (1 - f ) is called the finite population correction
• usually unimportant in social surveys:
n =10,000 and N = 5,000,000: 1- f = 0.998
n =1000 and N = 400,000: 1- f = 0.9975
n =1000 and N = 5,000,000: 1-f = 0.9998
• effect of changing n much more important than effect of
changing n/N
25
An unbiased estimator of  2 is given by the
sample variance
1
2
s 
is ( yi  y s )
n 1
2
s
The estimated variance Vˆ ( y s )  (1  f )
n
Usually we report the standard error of the estimate:
2
SE( y s )  Vˆ ( y s )
Confidence intervals for  is based on the
Central Limit Theorem:
For large n, N  n : Z  ( ys   ) /  (1  f ) / n ~ N (0,1)
Approximat e 95% CI for  :
ys  1.96  SE( ys ), ys  1.96  SE( ys )  ys  1.96  SE( ys )
26
Example – Student performance in California
schools
• Academic Performance Index (API) for all California
schools
• Based on standardized testing of students
• Data from all schools with at least 100 students
• Unit in population = school (Elementary/Middle/High)
• Full population consists of N = 6194 observations
• Concentrate on the variable: y = api00 = API in 2000
• Mean(y) = 664.7 with min(y) =346 and max(y) =969
• Data set in R: apipop and y= apipop$api00
27
Histogram of y population with fitted normal density
0.0020
0.0010
0.0000
Density
0.0030
Histogram of y
400
500
600
700
800
900
1000
y
28
Histogram for sample mean and fitted normal density
y = api scores from 2000. Sample size n =10, based on
10000simulations
R-code:
>b =10000
>N=6194
>n=10
>ybar=numeric(b)
>for (k in 1:b){
+s=sample(1:N,n)
+ybar[k]=mean(y[s])
+}
>hist(ybar,seq(min(ybar)-5,max(ybar)+5,5),prob=TRUE)
>x=seq(mean(ybar)-4*sqrt(var(ybar)),mean(ybar)+4*sqrt(var(ybar)),0.05)
>z=dnorm(x,mean(ybar),sqrt(var(ybar)))
>lines(x,z)
29
Histogram and fitted normal density
api scores. Sample size n =10, based on 10000 simulations
0.006
0.004
0.002
0.000
Density
0.008
0.010
Histogram of ybar
500
550
600
650
700
750
800
ybar
30
y = api00 for 6194 California schools
10000 simulations of SRS. Confidence level of the approximate 95% CI
n
Conf. level
10
0.915
30
0.940
50
0.943
100
0.947
1000
0.949
2000
0.951
31
For one sample of size n = 100
100:
y s  654 .5
SE( y s )  12.6
Approximat e 95% CI for population mean :
654.5  1.96  12.6  654.5  24.7  (629 .8, 679.2)
R-code:
>s=sample(1:6194,100)
> ybar=mean(y[s])
> se=sqrt(var(y[s])*(6194-100)/(6194*100))
> ybar
[1] 654.47
> var(y[s])
[1] 16179.28
> se
[1] 12.61668
32
Absolute value of sampling error is not informative when
not related to value of the estimate
For example, SE =2 is small if estimate is 1000, but very
large if estimate is 3
The coefficient of variation for the estimate:
CV ( y s ) SE ( y s ) / y s
In example : CV ( y s )  12.6 / 654 .5  0.019  1.9%
•A measure of the relative variability of an estimate.
•It does not depend on the unit of measurement.
• More stable over repeated surveys, can be used for
planning, for example determining sample size
• More meaningful when estimating proportions
33
Estimation of a population proportion p
with a certain characteristic A
p = (number of units in the population with A)/N
Let yi = 1 if unit i has characteristic A, 0 otherwise
Then p is the population mean of the yi’s.
Let X be the number of units in the sample with
characteristic A. Then the sample mean can be
expressed as
pˆ  y s  X / n
34
Then under SRS :
E ( pˆ )  p
and
p(1  p )
n 1
Var( pˆ ) 
(1 
)
n
N 1
Np(1  p )
since the population variance equals  
N 1
2
n
s 
pˆ (1  pˆ )
n 1
2
So the unbiased estimate of the variance of the estimator:
pˆ (1  pˆ )
n
ˆ
V ( pˆ )
(1  )
n 1
N
35
Examples
A political poll: Suppose we have a random sample of 1000
eligible voters in Norway with 280 saying they will vote
for the Labor party. Then the estimated proportion of Labor
votes in Norway is given by:
p̂  280 / 1000  0.28
p̂( 1  p̂ )
n
0.28  0.72
SE( p̂ )
(1  ) 
 0.0144
n 1
N
999
Confidence interval requires normal approximation.
Can use the guideline from binomial distribution, when
N-n is large: np  5 and n(1  p)  5
36
In this example : n = 1000 and N = 4,000,000
Approximat e 95% CI : p̂  1.96  SE( p̂ )
 0.280  0.028  (0.252, 0.308)
Ex: Psychiatric Morbidity Survey 1993 from Great Britain
p = proportion with psychiatric problems
n = 9792 (partial nonresponse on this question: 316)
N @ 40,000,000
pˆ  0.14
SE( pˆ )  (1  0.00024 )0.14  0.86 / 9791  0.0035
95 % CI : 0.14  1.96  0.0035  0.14  0.007  (0.133,0.1 47)
37
General probability sampling
• Sampling design: p(s) - known probability of selection for
each subset s of the population U
• Actually: The sampling design is the probability distribution
p(.) over all subsets of U
• Typically, for most s: p(s) = 0 . In SRS of size n, all s with
size different from n has p(s) = 0.
• The inclusion probability:
 i  P( unit i is in the sample)
 P(i  s )   p( s )
{s:is}
38
Illustration
U = {1,2,3,4}
Sample of size 2; 6 possible samples
Sampling design:
p({1,2}) = ½, p({2,3}) = 1/4, p({3,4}) = 1/8, p({1,4}) = 1/8
The inclusion probabilities:
 1   p( s )  p({1,2})  p({1,4})  5 / 8
{s:1s}
 2   p( s )  p({1,2})  p({2,3})  3 / 4  6 / 8
{ s:2s}
 3   p( s )  p({2,3})  p({3,4})  3 / 8
{s:3s}
 4   p( s )  p({3,4})  p({1,4})  2 / 8
{ s:4s}
39
Some results
( I )  1   2  ...   N  E ( n) ; n is the sample size
( II ) If sample size is determined to be n in advance :
 1   2  ...   N  n
Proof :
Let Z i  1 if unit i is included in the sample, 0 otherwise
 i  P( Z i  1)  E ( Z i )
n  i 1 Z i  E (n) i 1 E ( Z i )  i 1  i
N
N
N
40
Estimation theory
probability sampling in general
Problem: Estimate a population quantity for the variable y
N
For the sake of illustration: The population total t   yi
An estimator of t based on the sample : tˆ
i 1
Expected value : E (tˆ )  s tˆ( s ) p( s )
Variance : Var(tˆ )  E[tˆ  Etˆ]2  s [tˆ( s )  Etˆ]2 p( s )
Bias : E (tˆ )  t
tˆ is unbiased if E (tˆ )  t
41
Let Vˆ (tˆ) be an (unbiased if possible) estimate of Var(tˆ)
The standard error of tˆ : SE(tˆ) Vˆ (tˆ)
Coefficient of variation of tˆ : CV (tˆ) SE(tˆ) / tˆ
CV is a useful measure of uncertainty, especially when
standard error increases as the estimate increases
Margin of error : 2  SE(tˆ)
Because, typically we have that
P(tˆ  2SE(tˆ)  t  tˆ  2SE(tˆ))  0.95 for large n, N  n
Since tˆ is approximat ely normally distribute d for large n, N  n
t̂  2  SE( t̂ ) is approximat ely a 95% CI
42
Some peculiarities in the estimation theory
Example: N=3, n=2, simple random sample
s1  {1,2}, s2  {1,3}, s3  {2,3}
p( sk ) 1 / 3 for k  1,2,3
Let tˆ1  3 ys , unbiased
Let tˆ2 be given by :
ˆt 2 ( s1 )  3  1 ( y1  y2 )  tˆ1 ( s1 )
2
ˆt 2 ( s2 )  3  ( 1 y1  2 y3 )  tˆ1 ( s 2 )  1 y3
2
3
2
ˆt 2 ( s3 )  3  ( 1 y2  1 y3 )  tˆ1 ( s3 )  1 y3
2
3
2
43
Also tˆ2 is unbiased :
1 3 ˆ
1
ˆ
ˆ
E (t2 )  s t2 ( s ) p( s )  k 1 t2 ( sk )   3t  t
3
3
1
ˆ
ˆ
Var(t1 )  Var(t2 )  y3 (3 y2  3 y1  y3 )
6
 Var(tˆ1 )  Var(tˆ2 ) if y3  0 and 3 y2  3 y1  y3
If yi  0 / 1  variables, this happens when y1  0, y2  y3  1
For this set of values of the yi’s:
tˆ1 ( s1 )  1.5, tˆ1 ( s2 )  1.5, tˆ1 ( s3 )  3 : never correct
tˆ2 ( s1 )  1.5, tˆ2 ( s2 )  2, tˆ2 ( s3 )  2.5
tˆ2 has clearly less variabilit y than tˆ1 for these y - values
44
Let y be the population vector of the y-values.
This example shows that
Ny s
is not uniformly best ( minimum variance for all y)
among linear design-unbiased estimators
Example shows that the ”usual” basic estimators do not
have the same properties in design-based survey
sampling as they do in ordinary statistical models
In fact, we have the following much stronger result:
Theorem: Let p(.) be any sampling design. Assume each
yi can take at least two values. Then there exists no
uniformly best design-unbiased estimator of the total t
45
Proof:
Let tˆ be unbiased, and let y 0 be one possible value of y.
Then there exists unbiased tˆ0 with Var(tˆ0 )  0 when y  y 0
tˆ0 ( s, y)  tˆ( s, y)  tˆ( s, y 0 )  t0 , t0 is the total for y0
1) tˆ0 is unbiased : E (tˆ0 )  t  s tˆ( s, y 0 ) p( s )  t0  t
2) When y  y 0 : tˆ0  t0 for all samples s  Var(tˆ0 )  0
This implies that a uniformly best unbiased estimator
must have variance equal to 0 for all values of y,
which is impossible
46