1 A few motivating examples.

ST640: Fall 2010
Examples
1
A few motivating examples.
• National Household Pesticide Usage Study.
Population of interest: all households in the U.S.
Goal: Estimate the proportion of households storing pesticides and of those what
proportion are stored in something not the original container.
(Center for Rural Environmental Health, Colorado State University.)
This study is representative of many national studies where the goal is to sample all
households, or all individuals, in the country. This includes all of the major national
government surveys for labor statistics and health (NHANES, CSFII).
How should we sample here?
• Sampling hunters in Massachusetts.
Population: All licensed hunters in MA in 198?.
Goal: Estimate the total number of individuals ”harvested” for each of a number of
species (pheasants, deer, etc.) Paper copy of Licenses stored in bundles by town.
(Rich Kennedy, Dept. of Forestry and Wildlife)
• Forest regeneration studies in the Quabbin reservoir.
Population: MDC land area around Quabbin reservoir, which is broken up into five
”Blocks.” More recently (now) some studies have been interested in looking at regeneration in “openings” that occur from logging.
Goal: Estimate the average number of young trees per acre (e.g., less than 4.5” in
diameter.) These value are of interest for various reasons including, assessing the
impact of the deer hunt on regeneration and the nature of regeneration after logging.
(Tom Kyker-Snowman, MDC Forester).
• The accuracy of satellite imagery in classifying land use.
Population: A portion of Bourne, MA for which different regions were classified by
habitat using aerial photos and infrared levels.
Goal: Estimate what proportion of area in each class is accurately classified. Done by
getting ground truth on a sample and attaching an accuracy level to the map.
(Janice Stone).
1
• Movement of graduating Umass students.
Ellyse Fosse: Honors project. Wanted to sample all seniors at Umass to estimate
what proportion planned to remain and work in Massachusetts and compare these
proportions across majors and other groupings. This was motivated by comments by
then Gov. Mitt Romney when questioning some higher education expenditures by
saying that a lot of Umass students from MA leave the state to work after getting
educated.
• Sprinkler safety in dorms.
A fire occured in a hotel where the sprinklers were of the same type as those used in
a number of UMass dorms. In the hotel fire, the sprinkler failed. There was concern
about the safety of these sprinklers in general, and in the UMass dorms in particular.
The recommendation made to the University was to sample 1% of the sprinklers in the
dorm; so in one dorm with 698 sprinklers 7 were sampled. The small recommended
number was in part due to the fact that the sampling is destructive and costly. This is
a common occurrence in many quality control situations. What can be said about the
proportion of defective sprinklers in the dorm with this sample? Is that sample size
sufficient?
2
Example 1: Estimating egg mass and defoliation means
with an SRS.
This data is from a study looking at gypsy moth egg mass densities and defoliation over a
number of different ”plots”. Each plot, viewed as a population, is 60 ha (hectare)in size.
On each plot, a number of subplots are sampled, each subplot is .1 ha in size. So with a
plot being the population there are = 600 “units”. For illustration here the plots are treated
as an SR without replacement. This particular data is for different plots in the State of
Maryland in 1987. See Buonaccorsi (1994) in “Case Studies in Biometry” ( N. Lange, ... J
Greenhouse, eds.) for more background .
The analysis below shows estimated means, standard errors and approximate confidence
intervals for each of defoliation and egg mass densities. This is done first using proc surveymeans in SAS, which accounts for the finite population and then using a standard univariate analysis for each plot which ignores the finite population correction factor. Since the
sample sizes are small relative to the population size of 600 the correction factor doesn’t
matter much here.
data dd;
infile ’87md.dat’;
input plot $ subplot egg def;
2
proc sort;
by plot;
proc surveymeans total=600;
var egg def;
/* With no other options this assumes
an SRS without replacement */
by plot;
/* Analyze assuming "random sample" that does not
accounting for finite population */
proc means mean stderr clm;
class plot;
var egg def;
run;
*** Partial output for plot 460 and plot 4A.***
------------------------------ plot=460 -----------------------------The SURVEYMEANS Procedure
Number of Observations
15
Std Error
Lower 95% Upper 95%
Variable
N
Mean
of Mean
CL for Mean CL for Mean
-------------------------------------------------------------------------------egg
15
54.200000
12.815186
26.714159
81.685841
def
15
18.666667
1.894855
14.602606
22.730727
------------------------------------------------------------------------------------------------------------- plot=4A ------------------------------Number of Observations
15
Std Error
Lower 95%
Upper 95%
Variable
N
Mean
of Mean
CL for Mean CL for Mean
------------------------------------------------------------------------------egg
15
21.133333
5.327775
9.706392
32.560275
def
15
13.333333
2.293884
8.413441
18.253226
------------------------------------------------------------------------------The MEANS Procedure
N
Lower 95%
Upper 95%
plot
Obs
Variable
Mean
Std Error
CL for Mean CL for Mean
---------------------------------------------------------------------------------460
15
egg
54.2000000
12.9784437
26.3640068 82.0359932
def
18.6666667
1.9189944
14.5508329 22.7825004
4A
15
egg
def
21.1333333
13.3333333
3
5.3956479
2.3231068
9.5608196
8.3507647
32.7058470
18.3159020
3
Simulations of estimated mean and CI with SRS
This illustrates the performance of the sample mean and confidence intervals through simulation using the program simmean.sas. Here we use the Parvin Lake data as the population
with the variable “number of fish caught”. It examines the behavior over different samples
sizes in two settings. The first uses all 137 days of the season as the population. The second
eliminates opening day, which is an outlier. Sample sizes of 10, 20 and 50 are shown here.
We know exactly what the expected value and variance (or standard error) of the sample
mean are so we don’t need to simulate for that purpose. The value of the simulation is to
look at the performance of the confidence intervals and the distribution of the sample mean.
Using all N = 137 days, ȳu = 55.55 (population mean) and S 2 = 4597.54 (population
“variance”)
The mean of Cover, Coverc and Coverct are estimated coverage rates of the approximate
normal CI, Chebychev CI using sample variance and Chebychev CI using true variance (just
to see how much the estimation of the variance matters).
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
LENGTH
LENGTHC
LENGTHCT
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
LENGTH
LENGTHC
LENGTHCT
N
1000
1000
1000
1000
1000
1000
1000
1000
sample size 10
true SE of the mean = 20.644505
Mean
Std Dev
Minimum
54.6009000
19.4458693
20.2000000
4077.89
10329.66
125.9555556
0.8500000
0.3572501
0
0.9850000
0.1216133
0
1.0000000
0
1.0000000
56.6748086
50.9823842
13.3945562
129.3174014
116.3287464
30.5629474
184.6500660
0
184.6500660
Maximum
140.1000000
47356.27
1.0000000
1.0000000
1.0000000
259.7217930
592.6186286
184.6500660
N
1000
1000
1000
1000
1000
1000
1000
1000
sample size 20
true SE of the mean = 14.011368
Mean
Std Dev
Minimum
55.2352000
13.6250699
30.4000000
4305.13
7264.58
391.0947368
0.8820000
0.3227695
0
0.9960000
0.0631505
0
1.0000000
0
1.0000000
42.8292971
31.4861507
16.0190487
97.7254893
71.8433336
36.5513674
125.3214872
0
125.3214872
Maximum
112.8000000
24074.46
1.0000000
1.0000000
1.0000000
125.6822775
286.7747757
125.3214872
4
sample size 50
true SE of the mean = 7.6414757
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
LENGTH
LENGTHC
LENGTHCT
N
1000
1000
1000
1000
1000
1000
1000
1000
Mean
55.8094600
4637.26
0.8630000
0.9960000
1.0000000
26.9867472
61.5768470
68.3474366
Std Dev
7.7107916
4157.18
0.3440194
0.0631505
0
13.2999150
30.3470006
0
Minimum
38.5200000
737.3322449
0
0
1.0000000
11.9956570
27.3710178
68.3474366
Maximum
74.5600000
10631.07
1.0000000
1.0000000
1.0000000
45.5492192
103.9316552
68.3474366
N
1000
1000
1000
1000
1000
1000
1000
1000
sample size 10
true SE of the mean = 11.649923
Mean
Std Dev
Minimum
50.7963000
11.3441035
19.9000000
1429.51
711.7743356
157.8333333
0.8960000
0.3054133
0
0.9880000
0.1089397
0
1.0000000
0
1.0000000
43.6492419
11.3983191
14.9897113
99.5963935
26.0080457
34.2026831
104.2000826
0
104.2000826
Maximum
89.1000000
4784.04
1.0000000
1.0000000
1.0000000
82.5261354
188.3035097
104.2000826
N
1000
1000
1000
1000
1000
1000
1000
1000
sample size 20
true SE of the mean = 7.9040885
Mean
Std Dev
Minimum
50.6555000
7.9726585
28.3500000
1442.51
485.9184976
259.2078947
0.9180000
0.2745020
0
0.9970000
0.0547174
0
1.0000000
0
1.0000000
30.2929706
5.2582564
13.0330704
69.1208021
11.9979947
29.7381295
70.6963167
0
70.6963167
Maximum
81.0500000
3192.01
1.0000000
1.0000000
1.0000000
45.7356427
104.3570259
70.6963167
With N = 136 (no opening day) ȳu = 50.76 and S 2 = 1464.
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
LENGTH
LENGTHC
LENGTHCT
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
LENGTH
LENGTHC
LENGTHCT
5
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
LENGTH
LENGTHC
LENGTHCT
4
N
1000
1000
1000
1000
1000
1000
1000
1000
sample size 50
true SE of the mean = 4.3042949
Mean
Std Dev
Minimum
50.6693000
4.2748816
37.3000000
1457.60
247.0899739
633.2575510
0.9410000
0.2357426
0
1.0000000
0
1.0000000
1.0000000
0
1.0000000
16.7685381
1.4415388
11.0933584
38.2615102
3.2892225
25.3122034
38.4987841
0
38.4987841
Maximum
63.2800000
2154.05
1.0000000
1.0000000
1.0000000
20.4597825
46.6839849
38.4987841
Example 2: Estimating a proportion with an SRS.
As discussed there are many situations involving relatively small sample sizes because of the
expense, in either time or money, to obtain an observation, or because testing is destructive.
This shows some results with very small sample sizes using both approximate normal based
confidence intervals and exact confidence intervals based on the hypergeometric. Output was
obtained using the program exactp.sas. x is the number of successes in the sample. Notice
6
what happens with 0 success in the sample to the usual approach of using the standard error
and an approximate C.I.
N
20
20
20
5
n
5
5
5
x phat
2 0.4
1 0.2
0 0.0
SE
0.212132
0.1732051
0.0
Approx CI for p
[-0.01578, 0.81577]
[-0.13948, 0.53948]
[0, 0]
Exact CI for p
[0.1, 0.8]
[0.05, 0.65]
[0.0, 0.45]
Exact CI for total
[2, 16]
[1, 13]
[0, 9]
Simulation in estimating a proportion with an SRS
Here are some results illustrating the performance of the sample proportion and confidence
intervals through simulation (using the program simbin2.sas.) We know exactly what the
expected value and variance (or standard error) of the sample proportion is are so we don’t
need to simulate for that purpose. The value of the simulation here is mainly to look at
the performance of the confidence intervals. In general, the behavior of the approximate
confidence intervals depends on the true p as well as both the population and sample size.
In all but the last setting here the approximate CI does badly.
7
YBAR IS
YUBAR =
MEAN OF
MEAN OF
MEAN OF
MEAN OF
THE SAMPLE PROPORTION.
true mean = proportion p
COVER = proportion of successful CI’s using normal approximation:
COVERC = proportion of successful CI’s using Chebychev with estimated SE
COVERCT = proportion of successful CI’s using Chebychev with true SE
COVERE = proportion of successful CI’s using exact hypergeometric CI.
population size =
true mean =
20
sample size =
0.05 population s2 =
5
0.05
true SE of the mean = 0.0866025
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
COVERCE
LENGTH
LENGTHC
LENGTHCT
LENGTHE
N
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
The MEANS Procedure
Mean
Std Dev
0.0500000
0.0866459
0.0500000
0.0866459
0.2500000
0.4332294
0.2500000
0.4332294
1.0000000
0
1.0000000
0
0.1697379
0.2941417
0.3872983
0.6711561
0.7745967
0
0.4875000
0.0649844
Minimum
0
0
0
0
1.0000000
1.0000000
0
0
0.7745967
0.4500000
Maximum
0.2000000
0.2000000
1.0000000
1.0000000
1.0000000
1.0000000
0.6789514
1.5491933
0.7745967
0.6000000
population size =
100 sample size =
20
true mean =
0.05 population s2 = 0.0479798
true SE of the mean = 0.0438086
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
COVERCE
LENGTH
LENGTHC
LENGTHCT
LENGTHE
N
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
The MEANS Procedure
Mean
Std Dev
0.0484000
0.0433511
0.0465053
0.0400235
0.6650000
0.4722266
0.6650000
0.4722266
1.0000000
0
0.9940000
0.0772656
0.1354827
0.1011848
0.3091369
0.2308778
0.3918359
0
0.2066100
0.0485733
8
Minimum
0
0
0
0
1.0000000
0
0
0
0.3918359
0.1500000
Maximum
0.2000000
0.1684211
1.0000000
1.0000000
1.0000000
1.0000000
0.3217409
0.7341303
0.3918359
0.3400000
population size =
200 sample size =
40
true mean =
0.02 population s2 = 0.0196985
true SE of the mean = 0.0198487
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
COVERCE
LENGTH
LENGTHC
LENGTHCT
LENGTHE
N
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
Mean
0.0196000
0.0192923
0.5790000
0.5790000
1.0000000
0.9970000
0.0575566
0.1313294
0.1775319
0.1041200
Std Dev
0.0201556
0.0195169
0.4939666
0.4939666
0
0.0547174
0.0511733
0.1167644
0
0.0279098
Minimum
0
0
0
0
1.0000000
0
0
0
0.1775319
0.0750000
Maximum
0.1000000
0.0923077
1.0000000
1.0000000
1.0000000
1.0000000
0.1684271
0.3843076
0.1775319
0.1850000
population size =
500 sample size =
40
true mean =
0.2 population s2 = 0.1603206
true SE of the mean = 0.0607238
Variable
YBAR
SAMPLES2
COVER
COVERC
COVERCT
COVERCE
LENGTH
LENGTHC
LENGTHCT
LENGTHE
N
1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
Mean
0.1970500
0.1589269
0.9340000
0.9980000
1.0000000
0.9740000
0.2354034
0.5371303
0.5431298
0.2492600
Std Dev
0.0571911
0.0352541
0.2484063
0.0446990
0
0.1592148
0.0274412
0.0626137
0
0.0248359
9
Minimum
0.0250000
0.0250000
0
0
1.0000000
0
0.0939966
0.2144761
0.5431298
0.1240000
Maximum
0.4000000
0.2461538
1.0000000
1.0000000
1.0000000
1.0000000
0.2949479
0.6729956
0.5431298
0.3040000
.
10
6
Duck Example: Estimating population size with variable size units using ratio and regression estimators.
This example is typical of settings where there are variables size units and the size of the unit
can be used as an auxiliary variable. Here there is an area which is 1303 square miles divided
into N = 140 transects each 1/2 mile wide buy of varying lengths. The goal is to estimate
the number of black-belly ducks in the area. Define xi = length of the ith transect and yi
= # of black-belly tree ducks in transect i. The main objective is to estimate ty = 1303ȳU
= total number of ducks in the area. The area of the ith transect is (1/2) ∗ xi but we know
P
that 1303 = i (1/2)xi = tx /2, so we know tx = 2606 (without having to know all of the
individual xi ) and x̄U = 2606/140 = 18.614.
The number of ducks per square mile is ty /1303 = ty /(tx /2) = 2 ∗ B where B = ȳU /x̄U .
Ratio Estimation.
The ratio estimator of ȳU is ȳˆr = x̄U ∗ (ȳ/x̄) and the ratio estimator of the total is t̂yr =
140 ∗ ȳˆr = 2606(ȳ/x̄).
A simple random sample of n = 14 transects yielded the following data and estimates.
Using just the y value leads to an estimated mean of ȳ = 21.07 and an estimated total of
t̂y = 2950 with an estimated standard error for the estimated total of 1072.92. Using a ratio
estimator the estimated ratio of means is B̂ = 1.22, the estimated mean y is ȳˆr = 22.65 and
an estimated total of t̂yr = 3171.49 with an estimated SE of the total of 1098.16 or 1180.6.
The estimated SE of the ratio estimator is larger than that obtained with the use of ȳ (but
these are estimates so the data analysis does not tell us which estimator is actually better.)
Note that there is one outlier (obs 8) that destroys the value of using the ratio or regression
estimator here.
Obs
length
ducks
Obs
1
6.2
3
8
2
12.4
3
9
3
2.3
30
10
4
22.8
3
11
5
20.7
15
12
6
24.8
16
13
7
22.8
4
14
Estimated mean and total without using x
YBAR
SEYBAR
21.071429 7.6636848
length
18.6
22.8
18.6
16.6
18.6
16.6
18.6
ducks
114
24
48
0
8
0
27
THAT
SETHAT
2950 1072.9159
Ratio Estimators using xubar in variance of Bhat
BHAT
SEBHAT1
YRHAT SEYRHAT1
TYHATR SETYHATR1
1.2169967 0.4213959 22.653524 7.8439846 3171.4934 1098.1578
11
Ratio Estimators using xbar in variance of Bhat
BHAT
SEBHAT2
YRHAT SEYRHAT2
TYHATR SETYHATR2
1.2169967 0.4530354 22.653524 8.4329306 3171.4934 1180.6103
In SAS surveymeans you can get B̂ and its estimated standard error using
proc surveymeans total=140;
var ducks length;
ratio ducks/length;
run;
Numerator Denominator
ducks
length
N
14
Ratio Analysis
Ratio
Std Err
1.216997 0.453035
95% CL for Ratio
0.23827320
2.19572020
The duck example using a regression estimator The regression fit is:
The REG Procedure
Dependent Variable: ducks
Parameter
Standard
Variable
DF
Estimate
Error
Intercept
1
18.00461
24.91561
length
1
0.17713
1.35473
*Carrying out analysis in IML *
t Value
0.72
0.13
Pr > |t|
0.4838
0.8981
B
estimated regression coefficients 18.004608
0.1771266
Estimated mean and total using regression estimator
YREG
SEYREG
TREG
SETREG
21.301693 7.6582319 2982.237 1072.1525
Once you have the estimated regression coefficients you can also create new variables
ye = ducks + 0.1771266*((2606/140) - length) and te = 140*ye so that running these through
surveymeans gives us the estimated mean and total and standard errors.
The SURVEYMEANS Procedure
Std Error
Lower 95%
Upper 95%
Variable
N
Mean
of Mean
CL for Mean
CL for Mean
-----------------------------------------------------------------------------ye
14
21.301693
7.658232
4.757089
37.846297
te
14
2982.237041
1072.152467
665.992457
5298.481626
-----------------------------------------------------------------------------12
Computing variances for the ratio, estimated slope and ratio and regression
estimators using the Jackknife.
The example below, which continues to use the duck data, was done using a customized
IML program (ratreg.sas) that does both ratio and regression estimators for an SRS. The
standard errors of the estimates are calculated using the approximate variances resulting
from the delta method (see class discussion and text) and using the delete one jackknife.
The most recent version of SAS (9.2) now has a “varmethod = “ option which allows for using
the jackknife. This in combination with a ratio command will supply a jackknife estimate
of either the variance or standard error of B̂ = estimator of ȳU /x̄U . Some of the analyses
given earlier are repeated here for easy comparison.
Estimated mean and total without using x
YBAR
SEYBAR
THAT
SETHAT
21.071429 7.6636848
2950 1072.9159
Ratio Estimators using xubar in variance of Bhat
BHAT
SEBHAT1
YRHAT SEYRHAT1
TYHATR SETYHATR1
1.2169967 0.4213959 22.653524 7.8439846 3171.4934 1098.1578
Ratio Estimators using xbar in variance of Bhat
BHAT
SEBHAT2
YRHAT SEYRHAT2
TYHATR SETYHATR2
1.2169967 0.4530354 22.653524 8.4329306 3171.4934 1180.6103
B
estimated regression coefficients 18.004608
0.1771266
Estimated mean and total using regression estimator
YREG
SEYREG
TREG
SETREG
21.301693 7.6582319 2982.237 1072.1525
Jackknife estimate of ratio 1.2069393
Jackknife SE of ratio, based on pseudo values 0.4544981
Jackknife SE of ratio, using original bhat in se calculation 0.4545058
Jackknife ratio estimate of yubar 22.466313
Jackknife SE of est. mean, based on pseudo values 8.4601578
Jackknife SE of est. mean using original bhat in se calculation 8.4603012
Jackknife regression estimate of yubar
13
21.09824
Jackknife SE based on pseudo values 8.0239971
Jackknife SE using original bhat in se calculation 8.0241757
pratio and preg contain the Pseudo values from deleting one observation at a time.
Surveymeans with N = 140 was run on the pseudo values.
Variable
PRATIO
PREG
N
14
14
Mean
1.206939
21.098240
Plot of ducks*length.
Std Error
of Mean
0.454498
8.023997
95% CL for Mean
0.22505583 2.1888228
3.76344817 38.4330318
Legend: A = 1 obs, B = 2 obs, etc.
ducks
|
|
A
100 |
|
|
|
|
50 |
A
|
|
A
A
|
A
A
A
|
A
0 |
A
A
B
B
|
|_______________________________________________________________________
0
5
10
15
20
25
length
14
7
Allocation in Stratified Sampling: Example using
the agricultural farm data.
Here the objective is to estimate the total acres of farmland in the country. The strata are
the states and the counties are the sampling units within the strata. We will use the 1992
sample standard deviations, as the Sk ’s, in determining the optimal allocation and in finding
the variances of the estimates based on different sampling schemes. In the program below
the total sample size n is specified. Both the proportional and optimal allocation (Neyman
allocation) often give non-integers so we round up (unless that exceeds the size for that
state), which means the actual sample sizes are often bigger than desired. This could be
modified to use rounding to the nearest integer instead of always rounding up. With the
optimal allocation there is no guarantee that the sample size for stratum k is less than N k .
Here if it comes out larger than Nk , I just set the sample size for that stratum equal to
Nk . This happens with larger total n and in this case the actual sample size used under the
optimal allocation is less than the specified n. This could also be modified. The purpose here
was to give a general idea of the improvements by using optimal rather than proportional
allocation and a comparison to the use of a SRS.
Finally, I’ve left the sample sizes of 1 in there in some places. This is fine in terms of
computing the variance of the estimator when we treat the Sk0 s as known and compare
variances but we wouldn’t really use samples of size 1 from a stratum since we can’t get an
estimated variance within that stratum.
DESCRIPTION OF THE POPULATION
Analysis Variable : ACRES92
STATE
Obs
N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------AK
4
4
59876.00
58992.47
210.00
141338.00
AL
67
67
126131.69
53670.31
35748.00
233422.00
WY
23
23
1429394.39
783353.88
62307.00
2720903.00
---------------------------------------------------------------------------Obs
STATE
_TYPE_
_FREQ_
bign
ybar
s
1
AK
1
4
4
59876.00
58992.47
2
AL
1
67
67
126131.69
53670.31
50
WY
1
23
23
TOTALN = 3058
total sample size desired
STATE
ALLOC
PROPSS
AK
0.0003797
1
15
1429394.39
100
OPTSS
1
783353.88
N
4
AL
AR
AZ
0.0057868
0.0136038
0.0550905
3
3
1
1
2
6
67
75
15
WA
WI
WV
WY
0.0300641
0.0148585
0.0040224
0.0289943
2
3
2
1
4
2
1
3
39
70
55
23
sample sizes used
TOTALSS
SUMPROP
SUMOPT
100
128
129
variances for estimating the mean acres/county
VARYBAR
VARPROP
VAROPT
1.74988E9 655566777 321906326
variances for estimating the total acres
VART VARTPROP
VARTOPT
1.6364E16 6.1304E15 3.0103E15
standard errors for estimating mean acres/county
SEYBAR
SEPROP
SEOPT
41831.617 25604.038 17941.748
SEYBAR above uses the 100. To make it more comparable to the designs
actually used for proportional and optimal
SEYBAR with n= 128= 36798.862 and
SEYBAR with n = 129 = 36649.698.
RATIOOP
ratio of se optimal to se proportional 0.700739
RATIOOY
ratio of se optimal to se for SRS 0.428904
standard errors for estimating total acres
SET
SETPROP
SETOPT
127921085 78297149 54865866
******* n = 2000 *****
The total sample size used for the optimal allocation is only 1515. This is
because the allocation for a number of the states is bigger than $N_k$ for that
state and so it was truncated to $N_k$.
16
total sample size desired
STATE
ALLOC
PROPSS
AK
AL
AZ
0.0003797
0.0057868
0.0550905
2000
OPTSS
N
1
12
15
4
67
15
3
44
10
WV
WY
0.0040224
36
9
55
0.0289943
16
23
23
sample sizes used
TOTALSS
SUMPROP
SUMOPT
2000
2021
1515
standard errors for estimating mean acres/county
SEYBAR
SEPROP
SEOPT
5594.1409 3993.0167 2053.3689
RATIOOP
ratio of se optimal to se proportional
0.51424
RATIOOY
ratio of se optimal to se for SRS 0.3670571
standard errors for estimating total acres
SET
SETPROP
SETOPT
17106883 12210645 6279202.2
8
Stratified Sampling in Surveymeans: Estimating egg
mass densities with statified samples.
h is used here as in Lohr’s book. In other places we’ve used k
P
The estimate of the population mean is ȳˆstr = H
h Nh ȳn /N . If the Nh are all equal then
PH
N = HNh and so the estimate is h ȳn /H, which is the mean of the strata means. But
if you run surveymeans without any weighting it gives you as an estimate just the overall
sample mean ȳ. The stratified example given in the SAS documentation is only correct
without the use of any weighting because the sampling is proportional to size.
That is, all the sampling rates are the same, in which case we know that ȳˆstr = ȳ.
They fail to mention that with unequal sampling rates you need to weight. This
can be done by creating a weight of Nh /nh for each observation in stratum h.
This is demonstrated in the egg mass/defoliation example using the egg values. With no
weighting surveymeans will just give the overall mean The sampling rate in one of the strata
is different, so we need to weight. There is also an IML program that does the calculation
directly as a double check on what surveymeans is doing.
17
option linesize=70;
data dd;
infile ’87md.dat’;
input plot $ subplot egg def;
wt=600/15;
if plot = ’C2’ then wt=600/30;
run;
proc means; var egg; run;
proc means nway;
class plot; var egg; output out=mout mean=megg;
run;
proc print data=mout;
proc means data=mout; var megg; run;
/*Here we create a file called totals that has in the
strata sizes */
data totals; set mout;
_total_=600; keep plot _total_;
run;
proc print data=totals;
run;
proc surveymeans data=dd total=totals mean clm sum clsum;
stratum plot/list;
var egg;
weight wt;
run;
THE OVERALL MEAN
N
Mean
Std Dev
Minimum
Maximum
135
28.0948148
44.6275086
0
414.6000000
Obs
plot
_TYPE_
_FREQ_
megg
1
460
1
15
54.2000
2
4A
1
15
21.1333
3
C13
1
15
41.0667
4
C2
1
30
9.0000
5
D1
1
15
29.4667
6
H6
1
15
15.8000
7
N22
1
15
54.9867
8
S42
1
15
18.2000
THE MEAN OF THE STRATA MEANS.
N
Mean
Std Dev
Minimum
Maximum
8
30.4816667
17.6933031
9.0000000
54.9866667
USING SURVEYMEANS WITH WEIGHTING
Obs
plot
_total_
18
1
460
600
2
4A
600
3
C13
600
4
C2
600
5
D1
600
6
H6
600
7
N22
600
8
S42
600
Stratum
Population Sampling
Index
plot
Total
Rate
N Obs Variable
N
--------------------------------------------------------------------1
460
600
2.50%
15 egg
15
2
4A
600
2.50%
15 egg
15
3
C13
600
2.50%
15 egg
15
4
C2
600
5.00%
30 egg
30
5
D1
600
2.50%
15 egg
15
6
H6
600
2.50%
15 egg
15
7
N22
600
2.50%
15 egg
15
8
S42
600
2.50%
15 egg
15
Std Error
Variable
Mean
of Mean
95% CL for Mean
----------------------------------------------------------------egg
30.481667
4.036904
22.4933621 38.4699713
----------------------------------------------------------------Variable
Sum
Std Dev
95% CL for Sum
----------------------------------------------------------------egg
146312
19377
107968.138 184655.862
----------------------------------------------------------------*** FROM AN IML PROGRAM TO DO ANALYSIS FOR STRATIFIED RANDOM SAMPLING.
THIS IS STRAT.SAS ON THE WEB SITE. **
N
MEAN
S2
SD
FPC
15
54.2
2526.6 50.265296
0.975
15 21.133333 436.69524 20.897254
0.975
15 41.066667 1374.9238 37.079965
0.975
30
9 144.75862 12.031568
0.95
15 29.466667 769.55238 27.740807
0.975
15
15.8 279.88143 16.729657
0.975
15 54.986667 10361.958 101.79371
0.975
15
18.2 225.74286 15.024742
0.975
YBHAT
SEYBHAT
THAT
SETHAT
30.481667 4.0369041
146312 19377.139
19
9
Single stage cluster example. Ice Cream data.
We’ll illustrate the computing for one-stage cluster sampling using the example from SAS
documentation but for now we’ll use just the grade 8 data. There is a total of K = 1025
students broken up into 252 study groups. Three study groups are sampled and each student
in the study group is observed with the response being the amount spent per week on ice
cream. The sample totals, means and number of psu’s are:
ts1 = 68, Ms1 = 4, ȳ1 = 17
ts2 = 38, Ms2 = 3, ȳ2 = 12.667
ts3 = 33, Ms3 = 2, ȳ3 = 16.5
with t̄ = 46.3333 and M̄ = 3.
The estimated total is
t̂ = N t̄ = 252 ∗ 46.3333 = 11676
with and the estimated mean
ȳˆ = t̂/1025 = 11.391
with which uses the knowledge of the number of ssu’s. The ratio estimator of ȳu (which
doesn’t use the 1025) is
ȳˆratio = 46.333/3 = 15.4444.
If create a file having the tsi ’s and Msi ’s in it we would just analyze this as an SRS using
surveymeans with weight 252/3 and total = 252. As discussed with SRS the sum will give t̂,
while with a weight of 252/(3*1025) the sum will give ȳˆunb . The associated standard errors
are SE(t̂) = 2737.68 and SE(ȳˆunb ) = 2.6709. Can also use ratio option to get ratio estimate
of mean with weight 252/3. This is done near end.
Working with the original data directly:
With no weighting the estimate of the mean is the ratio estimator.
With weight = 252/3 and total = 252 the estimate of the sum is t̂, while with a weight of
252/(3*1025) the estimate of the sum is ȳˆunb .
Obs
1
2
3
4
5
6
7
8
9
Grade
8
8
8
8
8
8
8
8
8
Study
Group
59
59
59
143
143
59
143
156
156
Spending
20
13
17
12
16
18
10
19
14
20
wt
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
wt2
0.081951
0.081951
0.081951
0.081951
0.081951
0.081951
0.081951
0.081951
0.081951
N
StudyGroup Obs N
Mean
Std Dev
Minimum
Maximum
---------------------------------------------------------------------------59
4 4
17.0000000
2.9439203
13.0000000
20.0000000
143
3 3
12.6666667
3.0550505
10.0000000
16.0000000
156
2 2
16.5000000
3.5355339
14.0000000
19.0000000
----------------------------------------------------------------------------
Obs
1
2
3
Study
Group
59
143
156
_TYPE_
1
1
1
_FREQ_
4
3
2
ts
68
38
33
m
4
3
2
grade 8 unweighted
The SURVEYMEANS Procedure
Number of Clusters
3
Number of Observations
9
Std Error
Variable
N
Mean
of Mean
95% CL for Mean
----------------------------------------------------------------------------Spending
9
15.444444
1.435506
9.26796013 21.6209288
----------------------------------------------------------------------------grade 8 wt = N/n
Std Error
Variable
Mean
of Mean
95% CL for Mean
Sum
----------------------------------------------------------------------------Spending
15.444444
1.435506
9.26796013 21.6209288
11676
----------------------------------------------------------------------------Variable
Std Dev
95% CL for Sum
-----------------------------------------------Spending
2737.681501
-103.29278 23455.2928
-----------------------------------------------grade 8 wt = N/nK
Std Error
Variable
Mean
of Mean
95% CL for Mean
Sum
----------------------------------------------------------------------------Spending
15.444444
1.435506
9.26796013 21.6209288
11.391220
----------------------------------------------------------------------------21
Variable
Std Dev
95% CL for Sum
-----------------------------------------------Spending
2.670909
-0.1007734 22.8832125
-----------------------------------------------working with ts wt = 252/3
Number of Observations
3
Variable
Sum
Std Dev
95% CL for Sum
----------------------------------------------------------------ts
11676
2737.681501
-103.29278 23455.2928
m
756.000000
144.623650
133.73466 1378.2653
----------------------------------------------------------------Ratio Analysis
Numerator Denominator
Ratio
Std Err
-------------------------------------------------ts
m
15.444444
1.435506
-------------------------------------------------working with ts wt = 252/(3*1024)
Std Error
Variable
Mean
of Mean
95% CL for Mean
Sum
----------------------------------------------------------------------------ts
46.333333
10.863815
-0.4098920 93.0765587
11.391220
----------------------------------------------------------------------------Variable
Std Dev
95% CL for Sum
-----------------------------------------------ts
2.670909
-0.1007734 22.8832125
-----------------------------------------------option ls=80 nodate;
data IceCreamStudy;
input Grade StudyGroup Spending @@;
wt=252/3;
wt2=252/(3*1025);
datalines;
7 34 7
7 34 7
7 412 4
9 27 14
7 34 2
9 230 15
9 27 15
7 501 2
9 230 8
9 230 7
7 501 3
8 59 20
7 403 4
7 403 11
8 59 13
8 59 17
8 143 12
8 143 16
8 59 18
9 235 9
8 143 10
9 312 8
9 235 6
9 235 11
9 312 10
7 321 6
8 156 19
8 156 14
22
7 321
7 78
7 78
3
1
6
7 321 12
7 78 10
7 412 6
7 489
7 489
7 156
2
2
2
7 489
7 156
9 301
9
1
8
;
data grade8;
set IceCreamStudy;
if grade = 8;
proc print;
run;
proc means data=grade8 nway;
class StudyGroup;
var spending;
output out=gr8out sum=ts n=m;
proc print data=gr8out;
run;
/* Using the original data */
title ’grade 8 unweighted’;
proc surveymeans data=grade8 total=252;
/* the sum will not be useful here */
cluster StudyGroup;
var spending;
run;
title ’grade 8 wt = N/n ’;
proc surveymeans data=grade8 total=252 mean sum clm clsum;
cluster StudyGroup;
var spending;
weight wt;
run;
title ’grade 8 wt = N/nK ’;
/* here the sum will give the unbiased estimator of the
per ssu mean */
proc surveymeans data=grade8 total=252 mean sum clm clsum;
cluster StudyGroup;
var spending;
weight wt2;
run;
proc means data=gr8out;
var ts m;
run;
data tvalues;
set gr8out;
wt = 252/3;
23
wt2 = 252/(3*1025);
run;
title ’working with ts wt = 252/3’;
proc surveymeans data = tvalues total=252 sum clsum;
var ts m;
ratio ts/m;
weight wt;
run;
title ’working with ts wt = 252/(3*1024)’;
proc surveymeans data = tvalues total=252 mean sum clm clsum;
var ts;
weight wt2;
run;
10
Ice Cream data with stratification and cluster sampling.
The number of study groups are 608, 252 and 403 for grades 7, 8 and 9 respectively. The
number sampled are 8, 3 and 5. If we assign weight Nh /nh for an observation from stratum
h, then the estimate of the mean is the ratio estimator. For the total it is the unbiased
estimate of the total. With weight Nh /(nh K) for an observation from stratum h, then the
estimate of the mean is still the ratio estimator, but for the sum it is is the unbiased estimate
of ȳu .
Obs
1
2
3
grade
7
8
9
_total_
608
252
403
Stratum
Population Sampling
Index
Grade
Total
Rate
N Obs Variable
N Clusters
------------------------------------------------------------------------------1
7
608
1.32%
20 Spending
20
8
2
8
252
1.19%
9 Spending
9
3
3
9
403
1.24%
11 Spending
11
5
------------------------------------------------------------------------------Std Error
Variable
N
Mean
of Mean
95% CL for Mean
----------------------------------------------------------------------------24
Spending
40
8.750000
0.634549
7.37913979 10.1208602
----------------------------------------------------------------------------weighted by N_h/n_h
Std Error
Variable
Mean
of Mean
95% CL for Mean
Sum
----------------------------------------------------------------------------Spending
8.923860
0.650859
7.51776370 10.3299565
28223
----------------------------------------------------------------------------Variable
Std Dev
95% CL for Sum
-----------------------------------------------Spending
3456.556840
20755.1629 35690.0371
-----------------------------------------------weighted by N_h/n_hK
Std Error
Variable
Mean
of Mean
95% CL for Mean
Sum
----------------------------------------------------------------------------Spending
8.923860
0.650859
7.51776370 10.3299565
7.055650
----------------------------------------------------------------------------Variable
Std Dev
95% CL for Sum
-----------------------------------------------Spending
0.864139
5.18879074 8.92250926
-----------------------------------------------option ls=80 nodate;
/* The ice cream example with grades as strata */
data ic;
input Grade StudyGroup Spending @@;
if grade= 7 then wt = 608/8;
if grade = 8 then wt=252/3;
if grade = 9 then wt = 403/5;
wtk = wt/4000;
datalines;
7 34 7
7 34 7
7 412 4
9 27 14
7 34 2
9 230 15
9 27 15
7 501 2
9 230 8
9 230 7
7 501 3
8 59 20
7 403 4
7 403 11
8 59 13
8 59 17
8 143 12
8 143 16
8 59 18
9 235 9
8 143 10
9 312 8
9 235 6
9 235 11
9 312 10
7 321 6
8 156 19
8 156 14
7 321 3
7 321 12
7 489 2
7 489 9
7 78 1
7 78 10
7 489 2
7 156 1
7 78 6
7 412 6
7 156 2
9 301 8
25
;
data Stsizes;
input grade _total_;
datalines;
7 608
8 252
9 403
;
proc print;
run;
title ’unweighted’;
proc surveymeans data=ic total=Stsizes;
strata grade/list;
cluster StudyGroup;
var spending;
run;
title ’weighted by N_h/n_h ’;
proc surveymeans data=ic total=Stsizes mean sum clm clsum;
strata grade/list;
cluster StudyGroup;
var spending;
weight wt;
run;
title ’weighted by N_h/n_hK’;
proc surveymeans data=ic total=Stsizes mean sum clm clsum;
strata grade/list;
cluster StudyGroup;
var spending;
weight wtk;
run;
26
11
Ozone levels and systematic sampling
Ozone example from Problem 5-22 in Lohr. The data was stretched out so each case was an
hourly reading with 24 successive hours over a date (I’m not sure what is going on with the dates
in her file but have assumed they are in chronological order). With two days eliminated this gives
17472 records. There were 264 missing values. Here is the information on each cluster if we sample
a random hour and then observe ozone at that hour every day over the two years.
hour
1
2
3
4
5
_TYPE_
1
1
1
1
1
_FREQ_
728
728
728
728
728
n
720
722
723
722
721
mean
24.2319
23.8407
23.7801
23.6440
23.6519
sum
17447
17213
17193
17071
17053
var
114.003
116.123
116.615
120.759
118.774
20
21
22
23
24
1
1
1
1
1
728
728
728
728
728
721
722
722
703
717
28.0888
26.7355
25.6731
25.0853
24.7294
20252
19303
18536
17635
17731
129.820
118.874
112.478
115.873
116.471
a) Population information.
Mean
27.60350
Median
29.00000
Std Deviation
Variance
11.43614
130.78536
b) We choose every 24th value after a random starting point from 1 to 24. This means the sample
size (not accounting for missing values) is 728. This is equivalent to having 24 clusters, each of size
728 (see above ) and we pick one at random. Note that the sample will have values from the 728
dates, all at the same hour, but with some missing. The sample I chose had 20 missing values
c) In analyzing as an SRS, I used 17208 since that was the number of nonmissing values.
proc surveyselect method=sys sampsize=728 out=sysout;
proc surveymeans total=17208 data=sysout mean clm;
var ozone;
Std Error
Lower 95%
Upper 95%
Variable
Mean
of Mean
CL for Mean
CL for Mean
-----------------------------------------------------------------------ozone
30.080508
0.383182
29.328198
30.832819
-----------------------------------------------------------------------The interval misses the true population mean.
d) Now we have one stage cluster sampling with n = 4 sampled clusters out of 96. Surveyselect
states that it chooses independent samples, so that means the sampling is actually done with
27
replacement. This means there is no need for the finite population correction factor that comes
from specifying total=96. But, if you reran until you got 4 distinct clusters, this would be equivalent
to sampling without replacement. Here there were four distinct clusters. If you did put in total=96.
This leads to a minor change in the standard error, namely .965381 With the clusters of equal size
you can run the analysis without any weighting if you ignore missing values. This should be rerun
with weighting to account for the missing.
proc surveyselect data=a method=sys sampsize=182 out=sysout2
rep=4; run;
proc surveymeans data=sysout2 mean clm;
cluster replicate; var ozone; run;
Number of Clusters
4
Number of Observations
728
Std Error
Lower 95%
Upper 95%
Variable
Mean
of Mean
CL for Mean
CL for Mean
----------------------------------------------------------------ozone
25.303621
0.986145
22.165269
28.441974
-----------------------------------------------------------------------Looking at what the effect is of treating a systematic sample (one cluster of 728 as if
it were an SRS.
To illustrate I filled in all of the missing values with 20. The average of varsrs below is the expected
value of estimator of V(ybarhat) treating systematic sample as SRS. The variance of vsys is the
true variance of ybarhat for Systematics sampling
Variable
varsrs
vsys
N
24
24
Mean
0.1566136
26.9098379
Variance
0.000193907
10.8536016
The real variance of ȳˆu is 10.8536016. The expected value of the estimator of the variance of ȳˆu
assuming it was an SRS is .1566136, obviously dramatically wrong. This seems a little strange but
if can be cross checked through a simulation. Running 10000 simulations with a systematic sample
and getting ybarhat = ȳˆu and varsrs the estimator of variance of ȳˆu assuming it is an SRS. Note
the variance of ybarhat and the mean of varsrs give agreement with the theoretical calculations.
Any difference is due to sampling error (from 10000 rather than infinitely many simulations).
Variable
YBARHAT
VARSRS
N
10000
10000
Mean
27.4863152
0.1633585
28
Variance
10.8206840
0.000203398
Std Dev
3.2894808
0.0142617
12
Clarification on page 31 and model/design based
approach with an SRS
Top of page 31. The notation is potentially confusing and I inadvertently left off the key line when
I rewrote these notes. Here is a more detailed explanation.
P
In our notation based on the model, Ȳ = i∈S Yi /N . From the model based perspective, the S,
designating the units in the sample is fixed. When we include the model and the design the S is
random and we can think of Ȳ as random over the model and the design.
If we condition on the realized values Y = y = (Y 1 = y1 , . . . , Yn = yn ), then
ED (Ȳ |Y = y) = ȳu and VD (Ȳ |Y = y) = (N − n)S 2 /N n,
P
2
where S 2 = N
the design (here, a
i=1 (yi − ȳu ) /(N − 1). This expected value and variance are over
PN
2
simple random sample) which is why I’ve written D. We can write S RV = i=1 (Yi − Ȳ )2 /(N − 1)
to describe S 2 as random over the model.
Over the model and the design with
M
indicating mean and variance over the model
E(Ȳ ) = EM (ED (Ȳ |Y)) = EM (Y¯u ) = µ
and
2
/N n+VM (Ȳu ) = (N −n)σ 2 /N n+σ 2 /N = σ 2 /n
V (Ȳ ) = EM (VD (Ȳ |Y))+VM (ED (Ȳ |Y)) = EM ((N −n)SRV
Cov(Ȳ , Ȳu ) = EM (cov(Ȳ , Ȳu |Y) + covM (E(Ȳ |Y)), E(Ȳu |Y)) = 0 + cov(Ȳ , Ȳu ) = σ 2 /N (page 29).
The missing line is : V (Ȳ − Ȳu ) = V (Ȳ ) + V (Ȳu ) − 2Cov(Ȳ , Ȳu ) = (N − n)σ 2 /N n. It is the square
root of an estimate of V (Ȳ − Ȳu ) which gets used for the standard error when predicting Ȳu with
Ȳ . This variance is is estimated by (N − n)s 2 /N n exactly as when using just the design based
approach.
29
13
Analysis of the egg-defoliation data as a two-stage
cluster sample
We revisit the egg-defoliation data. Now suppose that the plots in the sample are an SRS of 8 plots
out of a population of 50 plots (so the total population has K = 50 ∗ 600 subplots). The subplots
are treated as a SRS within a plot. So, we have two-stage cluster sampling. If ȳ i , s2i and mi are
the sample mean, variance and sample size for cluster i respectively, then the estimated total is
P
t̂ = 50 i 600ȳi /8 and the estimated variance is
V̂ (t̂) = 502
600 − mi 2
50 − 8 2 50 X
6002
st +
s .
50 ∗ 8
8 i
600mi i
As seen below with direct calculations this leads to
t̂unb = 914450 with an estimates SE of 178690.70.
(1)
Since the M is constant ( = 600) for all clusters over all clusters, the ratio estimator of the mean
per ssu is the same as the unbiased estimator which is
ȳˆunb = 914450/30000 = 20.48167 with SE = 178690.70/30000 = 5.957.
Computing. Using surveymeans with the weight for an observation in cluster i being N M i /nmi ,
the estimate of the sum gives the correct estimtor of the total and the ratio estimator is the unbiased
estimator of the mean per ssu. (If you ran surveymeans with weight wt2 = wt/30000, the analysis
for either the mean or the sum is the same as given for the mean below.) As shown in class,
however, the estimated standard errors from surveymeans ignores the second term which arises
from the second stage sampling. For example, the standard error here of 171999 for the estiamted
total is too small; the correct SE is 178690.7.
plot
Obs
N
Mean
Std Dev
Sum
---------------------------------------------------------------------460
15
15
54.2000000
50.2652962
813.0000000
4A
15
15
21.1333333
20.8972543
317.0000000
C13
15
15
41.0666667
37.0799651
616.0000000
C2
30
30
9.0000000
12.0315677
270.0000000
D1
15
15
29.4666667
27.7408071
442.0000000
H6
15
15
15.8000000
16.7296572
237.0000000
N22
15
15
54.9866667
101.7937050
824.8000000
S42
15
15
18.2000000
15.0247415
273.0000000
Obs
plot
_TYPE_
_FREQ_
megg
sdegg
tegg
m
1
460
1
15
54.2000
50.265
813.0
15
.....
The SURVEYMEANS Procedure
Number of Clusters
8
Number of Observations
135
30
Std Error
Lower 95%
Upper 95%
Lower 95% Upper 95%
Vae Mean
of Mean
CL for Mean
CL for Mean Sum StdDev CL for Sum CL for Sum
-------------------------------------------------- -------------------------------egg 30.481667 5.733285 16.924601 44.038733
914450 171999 507738
1321162
-------------------------------------------------- -------------------------------If we run the analysis with the main plots treated as strata rather than clusters then (see page 19
of notes) we get an SE for the estimated total of 19377. We know from our work with stratified
P
i 2 1/2
data that this SE, call it SES is ( i 6002 600−m
. If we multiply the square by (50/8) we get
600mi si )
the second term in . So,
SE(t̂) = (V̂ (t̂))1/2 = (1719992 + (50/8) ∗ 193772 )1/2 = 178690.7.
You can also easily program up the direct calculations in either STATA or R. See posted program
for example.
/* Analysis of the defoliation data as a two-stage cluster sample
Each subplot consists of samples of .1 ha. Each plot/psu
is 60 ha in size and so consists of M = 600 ssu’s. */
data dd;
infile ’87md.dat’;
input plot $ subplot egg def;
wt=50*600/(8*15); /* weights for running cluster analysis */
if plot=’C2’ then wt=50*600/(8*30);
run;
proc surveymeans data=dd total=50 mean clm sum clsum;
cluster plot;
weight wt; var egg; run;
31
14
Two-stage cluster sampling; worm fragments in cans
of corn.
This data is from Exercise 5.13. Programs for analyses below are posted. Running the data first
b = 8531.2921802 + 95.5405672 and an
with clustering and then with stratification leads to V ( t)
b
estimated standard error of SE(t) = 8557.1101. This can also be obtained directly from the twob
stage cluster sampling program. Since K = 24 ∗ 580 we get ȳˆ = t/K
= 3.611 (also given in both
2
b
ˆ
ˆ
surveymeans runs) V (ȳ ) = V (t)/K = .3779 and SE(ȳ ) = .6147.
From the analysis of variance M SBdata = 13.808 and M SWdata = 4.53 with corresponding sums
of squares as given. 4.53 is the unbiased estimate of M SW . The estimate of M SB is
M SBE = (13.808 − (1 − 3/24)4.53 = 5.69) ∗ (24/3) = 78.77.
This is quite different than M SBdata = 13.808.
Using the expression in part d) of problem 5-11 (see also page 41 of notes) we can calculate
Vb (ȳˆ) = .3779.
Our estimate of the variance of the mean per can if we took a general m and n is (see (5.34))
V (ȳˆ) = 1 −
n
580
78.77
m
+ 1−
n ∗ 24
24
4.5277
.
n∗m
This can be used to explore the effect of varying n and m over a fixed total number nm of ssu’s in
the sample or over a fixed cost C = nc1 + c2 nm. For the latter the optimal m (approximately) is
found using result on page 156 of text. See variances at end for different combinations of m and n
with mn = 36.
Source
Model
Error
Corrected Total
Variable
worms
DF
11
24
35
Sum of
Squares
151.8888889
108.6666667
260.5555556
Mean Square
13.8080808
4.5277778
F Value
3.05
surveymeans with clustering on case
Std Error
Mean
of Mean
95% CL for Mean
3.611111
0.612880
2.26217092 4.96005130
Variable
Std Dev
95% CL for Sum
worms
8531.292180
31489.4192 69043.9142
survey means stratified on plot.
N
case
Obs
N
1
3
3
2
3
3
Mean
4.3333333
3.3333333
32
Std Dev
3.0550505
1.1547005
Pr > F
0.0108
Sum
50267
Sum
13.0000000
10.0000000
3
4
5
6
7
8
9
10
11
12
Variable
worms
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
1.0000000
5.0000000
7.0000000
3.3333333
3.6666667
1.6666667
5.0000000
2.3333333
6.6666667
0
Mean
3.611111
Variable
worms
Std Error
of Mean
0.331738
Std Dev
95.540567
THAT
50266.667
VTHAT1
VTHAT2
72782946 441186.67
Source
Model
Error
Corrected Total
1.0000000
1.7320508
2.6457513
3.5118846
2.3094011
1.5275252
2.0000000
1.5275252
2.5166115
0
3.0000000
15.0000000
21.0000000
10.0000000
11.0000000
5.0000000
15.0000000
7.0000000
20.0000000
0
95% CL for Mean
2.92643736 4.29578486
95% CL for Sum
842.813961 1237.18604
Sum
1040.000000
VTHAT
SETHAT
73224133 8557.1101
The ANOVA Procedure
Sum of
DF
Squares
Mean Square
11
151.8888889
13.8080808
24
108.6666667
4.5277778
35
260.5555556
F Value
3.05
Estimated variance of ybhar-hat over different n and m with nm = 36.
n
12
9
6
3
m
VYBAR
3 0.3778964
4 0.463825
6 0.6356822
12 1.1512537
33
Pr > F
0.0108
15
Sampling with PPS.
Illustration of sampling with probability proportional to size for a one-stage sample using the Ag
data for Alabama with the cluster being a county and the ssu’s being farms (in 92) in the county.
Objective is to estimate total acres in 92, which for the purposes here we pretend we have only
for the sampled counties, while the number of farms is assumed to be known beforehand for all
counties.
option ls=80 nodate;
/* read in data from agpop.txt. Note the :$25. format to handle
the county name */
data a;
infile ’agpop2.txt’ dlm=’,’ firstobs=2;
input COUNTY :$25. STATE $ ACRES92
ACRES87
ACRES82
FARMS92
FARMS87
FARMS82
LARGEF92
LARGEF87
LARGEF82
SMALLF92
SMALLF87
SMALLF82
REGION $;
if state=’AL’;
/* convert missing values */
array numval[*] _numeric_;
do i = 1 to dim(numval);
if numval[i]=-99 then numval[i]=.;
end;
keep county state farms92 acres92;
run;
proc means mean sum n;
var farms92 acres92;
run;
proc surveyselect data=a out=cout method=pps sampsize=4 jtprobs;
size farms92;
run;
proc print data=cout;
run;
proc surveymeans data=cout mean sum;
var acres92;
weight SamplingWeight;
run;
IML CODE TO DO ANALYSIS. SEE POSTED FILE.
The MEANS Procedure
Variable
Mean
Sum
N
---------------------------------------------FARMS92
565.7462687
37905.00
67
ACRES92
126131.69
8450823.00
67
----------------------------------------------
34
Obs
1
2
3
4
Obs
1
2
3
4
The SURVEYSELECT Procedure
Selection Method
PPS, Without Replacement
Size Measure
FARMS92
Input Data Set
A
Random Number Seed
83578
Sample Size
4
Output Data Set
COUT
Selection
COUNTY
STATE
ACRES92
FARMS92
Prob
RUSSELL COUNTY
AL
112620
213
0.022477
DALE COUNTY
AL
134555
403
0.042527
PICKENS COUNTY
AL
106206
404
0.042633
BALDWIN COUNTY
AL
167832
941
0.099301
Unit
JtProb_1
JtProb_2
JtProb_3
JtProb_4
1
0
.000733392
.000735212
.001712462
2
.000733392
0
.001399373
.003259432
3
.000735212
.001399373
0
.003267641
4
.001712462
.003259432
.003267641
0
Sampling
Weight
44.4894
23.5143
23.4561
10.0704
The SURVEYMEANS Procedure
Number of Observations
4
Sum of Weights
101.530173
Std Error
Variable
Mean
of Mean
Sum
Std Dev
-----------------------------------------------------------------------ACRES92
121695
8826.356920
12355673
2831163
-----------------------------------------------------------------------T
PI
JPROB
112620 0.0224772
0 0.0007334 0.0007352 0.0017125
134555 0.0425274 0.0007334
0 0.0013994 0.0032594
106206 0.0426329 0.0007352 0.0013994
0 0.0032676
167832 0.0993009 0.0017125 0.0032594 0.0032676
0
THT
12355673
SE1
SE
VAR1
VAR2
VAR
6529858 3161750.2 4.2639E13 -3.264E13 9.9967E12
Continue with pps sampling for the ag data. The program below simulates pps sampling by using
surveyselect with reps=200. This generates 200 pps samples. These are analyzed in surveymeans
for each rep and then an iml program, gets the HT estimator, the estimated standard error that
would come from SAS and the estimated SE from our general approach for using the HT estimator.
Note that the latter approach can lead to a 0 variance in which case the SE is set to 0. The full
program is on the web.
option ls=80 nodate;
%macro simpps(nreps,ss);
35
READ DATA ETC. AS BEFORE
proc surveyselect data=a out=cout method=pps sampsize=4 jtprobs reps=&nreps;
size farms92; run;
*proc print data=cout; run;
proc surveymeans data=cout sum total=67;
var acres92;
weight SamplingWeight;
by replicate; run;
IML program omitted here. It creates a temporary SAS file with the estimate,
variances and se’s for each rep.
proc print data=new;
run;
proc means n mean std var;
var tht se var sasvar sasse;
run;
data se0; set new; if se=0;
proc print; run;
%mend;
%simpps(200,4);
-------------------------- Sample Replicate Number=1 --------------------------The SURVEYMEANS Procedure
Variable
Sum
Std Dev
---------------------------------------ACRES92
5698397
1355252
----------------------------------------------------------------- Sample Replicate Number=2 --------------------------Variable
Sum
Std Dev
---------------------------------------ACRES92
7885011
2697267
---------------------------------------Obs
1
2
Obs
1
2
THT
5698397.48
7885010.83
VAR
1249061061641
6312328042973
SE1
SE
VAR1
2903953.32
1117614.00
8.4329E12
4486693.74
2512434.68
2.013E13
SASVAR
SASSE
1.8367089E12
1355252.34
7.2752503E12
2697267.19
From proc means on the simulation results.
Variable
N
Mean
36
Std Dev
VAR2
-7.1839E12
-1.3818E13
Variance
--------------------------------------------------------------THT
200
8389386.06
2233962.33
4.9905877E12
SE
200
1978946.91
1264255.14
1.5983411E12
VAR
200
5.4666143E12
5.779498E12
3.3402597E25
SASVAR
200
5.0891309E12
5.264758E12
2.7717677E25
SASSE
200
1958429.19
1122490.94
1.2599859E12
--------------------------------------------------------------There were 18 cases where the VAR = estimated variance of the HT estimator was negative leading
to an SE of 0.
A run with 1000 simulations was done leading to:
Variable
N
Mean
Std Dev
Variance
---------------------------------------------------------------THT
1000
8455585.13
2277713.11
5.187977E12
SE
1000
1979823.45
1191698.30
1.4201448E12
VAR
1000
5.3011582E12
5.3527862E12
2.8652321E25
SASVAR
1000
4.9047734E12
5.0047122E12
2.5047145E25
SASSE
1000
1925196.50
1095258.61
1.1995914E12
---------------------------------------------------------------72 of the 1000 cases lead to an SE of 0.
With 5000 reps.
Variable
N
Mean
Std Dev
Variance
---------------------------------------------------------------THT
5000
8514163.91
2370980.32
5.6215477E12
SE
5000
1995240.18
1189328.22
1.4145016E12
VAR
5000
5.3617235E12
5.3534185E12
2.8659089E25
SASVAR
5000
4.8888844E12
4.9823656E12
2.4823967E25
SASSE
5000
1923108.90
1091226.25
1.1907747E12
---------------------------------------------------------------357 of the cases had SE=0
In the class notes it is shown that the SAS approach for variance and SE in this case is reasonable if
n/N is small, the selection probabilities are small and the sampling of psu’s is like with replacement.
With a sample of 4 psu’s out of 67 and the selection probabilities here, it looks like this is working
fairly well. It does not suffer from the problems that the unbiased estimator of V ( t̂HT ) has of
sometimes being negative.
37
16
Bootstrapping the duck data.
See Section 6 for analysis of ratio of means and ratio estimate of ȳ u and t where we used delta
method and jackkinfe to get standard errors. Here we bootstrap, using 10,000 bootstrap samples.
First we generate the samples by sampling 14 times with replacement from the orginal n = 14
observations. Then we create a population of 140 by replicating the sample 10 time and each
bootstrap sample consists of a simple random sample without replacement of 14 out of the 140.
The b at the end of a variable name indicates value computed from bootstrap sample, xbar = x̄
ybar = x̄, bhatb = ȳ/x̄, yubhatb = ratio estimate of ȳ u , thatb = ratio estimate of t, thatbunb =
unbiased estimator of t.
bootstrapping the duck data with replacement
Variable
XBARB
YBARB
BHATB
YUBHATB
THATB
N
10000
10000
10000
10000
10000
Mean
17.3222507
21.0941929
1.2275736
22.8500553
3199.06
Std Dev
1.6596200
7.7781307
0.4644919
8.6460525
1210.47
Minimum
10.1214286
2.6428571
0.1637893
3.0487738
426.8348827
Maximum
21.8928571
60.0000000
3.2814852
61.0815655
8551.55
bootstrapping the duck data without replacement
Variable
XBARB
YBARB
BHATB
YUBHATB
THATB
THATBUNB
N
10000
10000
10000
10000
10000
10000
Mean
17.3140521
20.9071143
1.2171495
22.6560202
3171.89
2927.00
Std Dev
1.5750321
7.3492109
0.4413816
8.2158764
1150.24
1028.89
38
Minimum
11.2857143
4.6428571
0.2357635
4.3885020
614.3997098
650.0000000
Maximum
21.7428571
54.5000000
3.4400361
64.0328314
8964.73
7630.00
17
Two-phase ratio estimation using data from Nurses
Health study.
Here we use a initial sample of 173 nurses. This was actually a validation data set to estimate the
relationship between measures of diet intake using an x = FFQ (food frequence questionairre) and y
= diet record. We will work with saturated fat intake with variables fqsat and drsat. We’ll suppose
that our objective is to estimate the population mean ȳ u , defined in terms of the diet record. We’ll
treat the n1 = 173 as an SRS from 1000 and suppose that fsat is measured on all 173 and then
drsat is measured on a random subsample of size n 2 = 50. We examine using a two-phase ratio
estimator.
Since we took an SRS of 173 and the a sub SRS of 50, the final is like an SRS of size 50 from N.
(Why?). If we analyze just the 50 we get
ȳˆu = 24.6152 and SE = 0.9131477.
Using the summary statistics below we have x̄ 1 = 21.92, x̄2 = 20.486, ȳ2 = 24.615 and s2y = 6.6242
(sample variance of y from the 50 stage 2 values.)
(1)
With SRS’s and notation on page 54 of notes, π i = n1 /N and θi = n2 /n2 and so their product is
(2)
(2)
(2)
n2 /N . This leads to tby = N ȳ2 , tbx = N x̄2 and tbx = N x̄2 so the two-phase ratio estimator of t y
is
tb(2)
yr = N x̄1
and of the mean is
(2)
ȳ2
= 26332.96
x̄2
ȳˆur = tb(2)
yr /1000 = 26.33.
We can estimate the variance of tbyr as described in exercise 12.2. This leads to
ˆ
Vb (tb(2)
yr ) = 987213.17with SE = 993.58602 and SE( ȳ ur ) = 0.993586.
Notice that you could argue for estimating the variance more directly. We can write
tb(2)
yr =
ȳ2
Nb
N
n1 x̄1
=
t2 ,
n1
x̄2
n1
where tb2 is the estimate of the total of the y’s at the first stage by applying ratio estimation at the
second stage.
(2)
Given stage 1, we know E(tbyr |stage1) = N ȳ1 and we know how to obtain V (tb2 |stage1) (see page
68 of the book). The unconditional variance is then V (ȳ 1 ) + (N/n1 )2 E(V (tb2 |stage1)). We can
estimate E(V (tb2 |stage1)) using Vb (tb2 |stage1) from a ratio analysis at the second stage and V (ȳ 1 )
with N 2 (1 − (n/N ))s2y /n. Combined this gives exactly the estimator in exercise 12.2.
39
THAT
SETHAT
YURHAT
26332.961 993.58603 26.332961
Variable
fqsat
drsat
Variable
fqsat
drsat
SEYURHAT
0.993586
statistics for whole sample
N
Mean
Std Dev
173
21.9156069
9.2753950
50
24.6152000
6.6246695
statistics for validation sample
N
50
50
Mean
20.4860000
24.6152000
Std Dev
6.6459443
6.6246695
THAT
SETHAT
YURHAT
26332.961 993.58603 26.332961
123
Std Error
0.7051952
0.9368697
124
Std Error
0.9398785
0.9368697
SEYURHAT
0.993586
Notice that we have not really gained anything with the ratio approach here. The standard error
for the ratio estimator of ȳu is actually bigger than just using the SRS of 50. A better approach
would be to consider using a regression estimator, especially since it doesn’t appear a regression
through the origin is appropriate. Even if it was the ratio approach is not always best.
18
Coverage and Future Topics
Tried to strike balance between applications and doing enough theory for you
to work and read on you own in the future.
• More on model based approaches. Often a useful approach. Based on
regression and related techniques.
• More applications to complex surveys. You have all the tools to handle
almost any scenario.
• More on variance estimation techniques for handling non-linear functions
including estimating means using ratio estimators and (later) estmating
regression coefficients and other non-linear functions of totals.
• More complicated non-response models (e.g., callbacks)
• Randomized response for sensitive questions. Easy for you to read on
your own.
40
• Estimating regression models (linear and non-linear) and categorical
data analysis using data from complex surveys. This ends up often involving non-linear functions of estimated totals. This is one place where
model based approaches also come into play. In general the modeling
issues can be a bit subtle.
• Measurement error in either a quantitative variable or misclassification
of a categorical variable. (“Measurement Error in Surveys”, Biemer et
al.)
41
Contents
1 A few motivating examples.
1
2 Example 1: Estimating egg mass and defoliation means with
an SRS.
2
3 Simulations of estimated mean and CI with SRS
4
4 Example 2: Estimating a proportion with an SRS.
6
5 Simulation in estimating a proportion with an SRS
7
6 Duck Example: Estimating population size with variable size
units using ratio and regression estimators.
11
7 Allocation in Stratified Sampling: Example using the agricultural farm data.
15
8 Stratified Sampling in Surveymeans: Estimating egg mass
densities with statified samples.
17
9
Single stage cluster example. Ice Cream data.
20
10 Ice Cream data with stratification and cluster sampling.
24
11 Ozone levels and systematic sampling
27
12 Clarification on page 31 and model/design based approach
with an SRS
29
13 Analysis of the egg-defoliation data as a two-stage cluster
sample
30
42
14 Two-stage cluster sampling; worm fragments in cans of corn. 32
15 Sampling with PPS.
34
16 Bootstrapping the duck data.
38
17 Two-phase ratio estimation using data from Nurses Health
study.
39
18 Coverage and Future Topics
40
43