PDF of Example

Using Bootstrap Replicate Weights with
SAS/STAT Survey Procedures
Overview
Providing bootstrap replicate weights with survey data for the purpose of variance estimation is a common
practice within the survey sampling community. However, the SAS/STAT survey procedures, at present,
do not provide options specifically designed to accommodate bootstrap replicate weights. Fortunately,
some survey bootstrap variance estimators share a common mathematical form with the balanced repeated
replication (BRR) variance estimator, and all of the SAS/STAT survey procedures do support BRR variance
estimation. This example shows how, and under what conditions, this commonality can be exploited so that
bootstrap replicate weights can be used with existing SAS/STAT survey procedures to perform variance
estimation.
The SAS source code for this example is available as an attachment in a text file. In Adobe Acrobat,
right-click the icon and select Save Embedded File to Disk. You can also double-click the icon to open the
file immediately.
source code
Analysis
Suppose that is a population parameter of interest. Let O be the estimate from the full sample for . Let Or
be the estimate from the rth bootstrap replicate subsample and let R represent the total number of replicates.
Lehtonen and Pahkinen (2004) show that when there are an equal number, ˛, of primary sampling units
(PSUs) per stratum, a consistent bootstrap estimator for the variance of O has the following form:
O D
b ./
V
˛
˛
R
1 X O
r
1R
O
2
rD1
Now compare the bootstrap variance estimator to the variance estimator that is used by the SAS/STAT survey
estimators when you use Fay’s BRR method:
O D
b ./
V
1
R.1
/2
R X
Or
O
2
rD1
where is the Fay coefficient. The two variance estimators are equivalent if you set
2 F
˛
˛
1
D
1
/2
.1
or
r
D1
˛
1
˛
The equivalence of these two estimators means that you can use bootstrap replicate weights with any of the
SAS/STAT survey estimators by specifying VARMETHOD=BRR and setting the appropriate value for the
FAY= method-option.
Example
To demonstrate, this example uses a subset of the Mini-Finland Health Survey data set which is available at
the Virtual Laboratory Survey Sampling (VLSS) Web site. The data set has a stratified, two-stage sampling
design with two clusters per stratum and equal probabilities of selection. The data set includes the variables
STR, CLU, CHRON, SYSBP, X, and a constructed variable WGT. STR and CLU are strata and cluster
identifiers, respectively. They are consequential to the example only because they enable you to verify that the
sampling design satisfies the requirement that there be a constant number of PSUs per stratum. In this case,
there are two clusters per stratum, so ˛ D 2. The variable CHRON contains sums of the observations within
each cluster of a binary response variable that indicates chronic morbidity. The variable SYSBP contains
sums of the observations within each cluster of a continuous response variable that measures systolic blood
pressure. The variable X contains the number of respondents within each cluster. The constructed variable
WGT represents the sampling weight and has a constant value of 1, representing the equal probability of
selection.
data MFH;
input STR CLU CHRON SYSBP X;
wgt=1;
label
STR="Stratum ID"
CLU="Cluster ID"
CHRON="Cluster sample sum of Chronic morbidity (CHRON)"
SYSBP="Cluster sample sum for Systolic blood pressure (SYSBP)"
X="Number of sample elements in the clusters";
datalines;
1 1 70 29056 204
1 2 74 29417 210
2 1 12 3692
26
2 2 14 4564
30
3 1 15 7741
59
3 2 16 8585
63
4 1 9 6277
45
Example F 3
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
1
2
14
10
16
10
6
10
4
12
6
15
23
37
25
11
25
33
24
31
27
22
20
18
22
24
37
19
23
25
29
36
34
9
22
18
34
29
41
22
18
15
7
5668
2322
3960
3080
3252
3966
3261
4156
2852
6617
6616
10552
11032
8759
9876
9901
6828
8624
9390
6960
7130
6646
7094
9841
11786
6910
6446
10742
9026
9350
8912
3810
7098
6998
9970
11146
13215
6596
6002
3808
3148
43
17
30
21
22
27
24
28
20
46
48
73
77
60
72
69
47
61
66
48
49
49
49
69
83
48
45
73
61
65
62
26
51
53
69
79
94
48
41
27
22
;
run;
A data set that contains 1000 bootstrap replicate weights was produced using the SAS macro %BOOT, which
is available for download from the VLSS web site. This data set is then merged with the MFH data set.
data MFH;
merge mfh survey.weights;
run;
The population parameters that are to be estimated are the proportion of the population that exhibits chronic
morbidity and the mean systolic blood pressure. You can use the SURVEYMEANS procedure to estimate
4 F
both proportions and means. Before submitting a call to PROC SURVEYMEANS, use the knowledge that
˛ D 2 to calculate the value for as follows:
r
D1
2
1
2
D 0:29289322
In the SURVEYMEANS statement, specify VARMETHOD=BRR and specify the FAY= method-option.
Because both the proportion and the mean to be estimated are ratios, the two numerators, CHRON and
SYSBP, and the common denominator X must be specified in a VAR statement and the form of the ratios must
be specified in a RATIO statement. Use the REPWEIGHTS statement to specify the variables that contain the
bootstrap weights, and use the WEIGHT statement to specify the variable that contains the sampling weights.
proc surveymeans data=mfh varmethod=brr(fay=.29289322);
var chron sysbp x;
ratio chron sysbp/ x;
repweights w1-w1000;
weight wgt;
run;
Figure 1 displays the results of the analysis. The estimates of the population parameters of interest and their
bootstrapped standard errors are listed under the heading “Ratio Analysis.” The estimate of the proportion of
the population that exhibits chronic morbidity is 0.397, and its standard error is 0.0108. The estimate of the
population mean systolic blood pressure is 141.785, and its standard error is 0.497.
Figure 1 SURVEYMEANS Procedure Results Using Bootstrap Replicate Weights
The SURVEYMEANS Procedure
Data Summary
Number of Observations
Sum of Weights
48
48
Variance Estimation
Method
Replicate Weights
Number of Replicates
Fay Coefficient
BRR
MFH
1000
0.29289322
Example F 5
Figure 1 continued
Statistics
Variable Label
N
------------------------------------------------------------------------------CHRON
Cluster sample sum of Chronic morbidity (CHRON)
48
SYSBP
Cluster sample sum for Systolic blood pressure (SYSBP)
48
X
Number of sample elements in the clusters
48
------------------------------------------------------------------------------Statistics
Std Error
Variable
Mean
of Mean
95% CL for Mean
--------------------------------------------------------------CHRON
22.354167
0.819272
20.74648
23.96186
SYSBP
7972.458333
142.064951
7693.67873 8251.23794
X
56.229167
1.002902
54.26113
58.19720
---------------------------------------------------------------
Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
-----------------------------------------------------------------CHRON
X
48
0.397555
0.010810
SYSBP
X
48
141.785106
0.497474
-----------------------------------------------------------------Ratio Analysis
Numerator Denominator
95% CL for Ratio
---------------------------------------------CHRON
X
0.376342
0.418767
SYSBP
X
140.808894
142.761317
----------------------------------------------
A variant of the bootstrap is the mean bootstrap. To compute mean bootstrap weights, you compute regular
bootstrap weights and average the weights in groups of size C. Thus, if you generate K bootstrap weights and
average them in groups of size C, you are left with R D K
C mean bootstrap replicates. The mean bootstrap
variance estimator has the following form:
R
C˛ 1 X O
O
b
V ./ D
r
˛ 1R
O
2
rD1
The procedure for using mean bootstrap weights with the SAS/STAT survey estimators is the same as
demonstrated in the preceding example except that is now computed as follows:
r
D1
˛ 1
C˛
6 F
Consistency of the variance estimator for both the naive bootstrap and the mean bootstrap requires that the
number of PSUs per stratum be constant so that the squared deviations can be properly scaled. However,
Rao and Wu (1988) show that when the number of PSUs per stratum is variable, alternative methods do
exist for computing bootstrap weights. The squared deviations of the variance estimator must still be
properly scaled, but the scaling factor is no longer the single constant ˛ ˛ 1 . Therefore, when the sampling
design includes a variable number of PSUs per stratum, the scaling factor must be applied directly to the
bootstrap weights. If the scaling factors are included in the bootstrap replicate weights, you can still use
the SAS/STAT survey procedures to estimate population parameters and their variances by using the BRR
method (VARMETHOD=BRR), but you no longer need to specify the FAY= method-option.
References
Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd
Edition, Chichester, UK: John Wiley & Sons.
Rao, J. N. K. and Wu, C. F. J. (1988), “Resampling Inference with Complex Survey Data,” Journal of the
American Statistical Association, 83, 231–241.