Using Bootstrap Replicate Weights with SAS/STAT Survey Procedures Overview Providing bootstrap replicate weights with survey data for the purpose of variance estimation is a common practice within the survey sampling community. However, the SAS/STAT survey procedures, at present, do not provide options specifically designed to accommodate bootstrap replicate weights. Fortunately, some survey bootstrap variance estimators share a common mathematical form with the balanced repeated replication (BRR) variance estimator, and all of the SAS/STAT survey procedures do support BRR variance estimation. This example shows how, and under what conditions, this commonality can be exploited so that bootstrap replicate weights can be used with existing SAS/STAT survey procedures to perform variance estimation. The SAS source code for this example is available as an attachment in a text file. In Adobe Acrobat, right-click the icon and select Save Embedded File to Disk. You can also double-click the icon to open the file immediately. source code Analysis Suppose that is a population parameter of interest. Let O be the estimate from the full sample for . Let Or be the estimate from the rth bootstrap replicate subsample and let R represent the total number of replicates. Lehtonen and Pahkinen (2004) show that when there are an equal number, ˛, of primary sampling units (PSUs) per stratum, a consistent bootstrap estimator for the variance of O has the following form: O D b ./ V ˛ ˛ R 1 X O r 1R O 2 rD1 Now compare the bootstrap variance estimator to the variance estimator that is used by the SAS/STAT survey estimators when you use Fay’s BRR method: O D b ./ V 1 R.1 /2 R X Or O 2 rD1 where is the Fay coefficient. The two variance estimators are equivalent if you set 2 F ˛ ˛ 1 D 1 /2 .1 or r D1 ˛ 1 ˛ The equivalence of these two estimators means that you can use bootstrap replicate weights with any of the SAS/STAT survey estimators by specifying VARMETHOD=BRR and setting the appropriate value for the FAY= method-option. Example To demonstrate, this example uses a subset of the Mini-Finland Health Survey data set which is available at the Virtual Laboratory Survey Sampling (VLSS) Web site. The data set has a stratified, two-stage sampling design with two clusters per stratum and equal probabilities of selection. The data set includes the variables STR, CLU, CHRON, SYSBP, X, and a constructed variable WGT. STR and CLU are strata and cluster identifiers, respectively. They are consequential to the example only because they enable you to verify that the sampling design satisfies the requirement that there be a constant number of PSUs per stratum. In this case, there are two clusters per stratum, so ˛ D 2. The variable CHRON contains sums of the observations within each cluster of a binary response variable that indicates chronic morbidity. The variable SYSBP contains sums of the observations within each cluster of a continuous response variable that measures systolic blood pressure. The variable X contains the number of respondents within each cluster. The constructed variable WGT represents the sampling weight and has a constant value of 1, representing the equal probability of selection. data MFH; input STR CLU CHRON SYSBP X; wgt=1; label STR="Stratum ID" CLU="Cluster ID" CHRON="Cluster sample sum of Chronic morbidity (CHRON)" SYSBP="Cluster sample sum for Systolic blood pressure (SYSBP)" X="Number of sample elements in the clusters"; datalines; 1 1 70 29056 204 1 2 74 29417 210 2 1 12 3692 26 2 2 14 4564 30 3 1 15 7741 59 3 2 16 8585 63 4 1 9 6277 45 Example F 3 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 14 10 16 10 6 10 4 12 6 15 23 37 25 11 25 33 24 31 27 22 20 18 22 24 37 19 23 25 29 36 34 9 22 18 34 29 41 22 18 15 7 5668 2322 3960 3080 3252 3966 3261 4156 2852 6617 6616 10552 11032 8759 9876 9901 6828 8624 9390 6960 7130 6646 7094 9841 11786 6910 6446 10742 9026 9350 8912 3810 7098 6998 9970 11146 13215 6596 6002 3808 3148 43 17 30 21 22 27 24 28 20 46 48 73 77 60 72 69 47 61 66 48 49 49 49 69 83 48 45 73 61 65 62 26 51 53 69 79 94 48 41 27 22 ; run; A data set that contains 1000 bootstrap replicate weights was produced using the SAS macro %BOOT, which is available for download from the VLSS web site. This data set is then merged with the MFH data set. data MFH; merge mfh survey.weights; run; The population parameters that are to be estimated are the proportion of the population that exhibits chronic morbidity and the mean systolic blood pressure. You can use the SURVEYMEANS procedure to estimate 4 F both proportions and means. Before submitting a call to PROC SURVEYMEANS, use the knowledge that ˛ D 2 to calculate the value for as follows: r D1 2 1 2 D 0:29289322 In the SURVEYMEANS statement, specify VARMETHOD=BRR and specify the FAY= method-option. Because both the proportion and the mean to be estimated are ratios, the two numerators, CHRON and SYSBP, and the common denominator X must be specified in a VAR statement and the form of the ratios must be specified in a RATIO statement. Use the REPWEIGHTS statement to specify the variables that contain the bootstrap weights, and use the WEIGHT statement to specify the variable that contains the sampling weights. proc surveymeans data=mfh varmethod=brr(fay=.29289322); var chron sysbp x; ratio chron sysbp/ x; repweights w1-w1000; weight wgt; run; Figure 1 displays the results of the analysis. The estimates of the population parameters of interest and their bootstrapped standard errors are listed under the heading “Ratio Analysis.” The estimate of the proportion of the population that exhibits chronic morbidity is 0.397, and its standard error is 0.0108. The estimate of the population mean systolic blood pressure is 141.785, and its standard error is 0.497. Figure 1 SURVEYMEANS Procedure Results Using Bootstrap Replicate Weights The SURVEYMEANS Procedure Data Summary Number of Observations Sum of Weights 48 48 Variance Estimation Method Replicate Weights Number of Replicates Fay Coefficient BRR MFH 1000 0.29289322 Example F 5 Figure 1 continued Statistics Variable Label N ------------------------------------------------------------------------------CHRON Cluster sample sum of Chronic morbidity (CHRON) 48 SYSBP Cluster sample sum for Systolic blood pressure (SYSBP) 48 X Number of sample elements in the clusters 48 ------------------------------------------------------------------------------Statistics Std Error Variable Mean of Mean 95% CL for Mean --------------------------------------------------------------CHRON 22.354167 0.819272 20.74648 23.96186 SYSBP 7972.458333 142.064951 7693.67873 8251.23794 X 56.229167 1.002902 54.26113 58.19720 --------------------------------------------------------------- Ratio Analysis Numerator Denominator N Ratio Std Err -----------------------------------------------------------------CHRON X 48 0.397555 0.010810 SYSBP X 48 141.785106 0.497474 -----------------------------------------------------------------Ratio Analysis Numerator Denominator 95% CL for Ratio ---------------------------------------------CHRON X 0.376342 0.418767 SYSBP X 140.808894 142.761317 ---------------------------------------------- A variant of the bootstrap is the mean bootstrap. To compute mean bootstrap weights, you compute regular bootstrap weights and average the weights in groups of size C. Thus, if you generate K bootstrap weights and average them in groups of size C, you are left with R D K C mean bootstrap replicates. The mean bootstrap variance estimator has the following form: R C˛ 1 X O O b V ./ D r ˛ 1R O 2 rD1 The procedure for using mean bootstrap weights with the SAS/STAT survey estimators is the same as demonstrated in the preceding example except that is now computed as follows: r D1 ˛ 1 C˛ 6 F Consistency of the variance estimator for both the naive bootstrap and the mean bootstrap requires that the number of PSUs per stratum be constant so that the squared deviations can be properly scaled. However, Rao and Wu (1988) show that when the number of PSUs per stratum is variable, alternative methods do exist for computing bootstrap weights. The squared deviations of the variance estimator must still be properly scaled, but the scaling factor is no longer the single constant ˛ ˛ 1 . Therefore, when the sampling design includes a variable number of PSUs per stratum, the scaling factor must be applied directly to the bootstrap weights. If the scaling factors are included in the bootstrap replicate weights, you can still use the SAS/STAT survey procedures to estimate population parameters and their variances by using the BRR method (VARMETHOD=BRR), but you no longer need to specify the FAY= method-option. References Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, Chichester, UK: John Wiley & Sons. Rao, J. N. K. and Wu, C. F. J. (1988), “Resampling Inference with Complex Survey Data,” Journal of the American Statistical Association, 83, 231–241.
© Copyright 2026 Paperzz