ST640: Fall 2010 Examples 1 A few motivating examples. • National Household Pesticide Usage Study. Population of interest: all households in the U.S. Goal: Estimate the proportion of households storing pesticides and of those what proportion are stored in something not the original container. (Center for Rural Environmental Health, Colorado State University.) This study is representative of many national studies where the goal is to sample all households, or all individuals, in the country. This includes all of the major national government surveys for labor statistics and health (NHANES, CSFII). How should we sample here? • Sampling hunters in Massachusetts. Population: All licensed hunters in MA in 198?. Goal: Estimate the total number of individuals ”harvested” for each of a number of species (pheasants, deer, etc.) Paper copy of Licenses stored in bundles by town. (Rich Kennedy, Dept. of Forestry and Wildlife) • Forest regeneration studies in the Quabbin reservoir. Population: MDC land area around Quabbin reservoir, which is broken up into five ”Blocks.” More recently (now) some studies have been interested in looking at regeneration in “openings” that occur from logging. Goal: Estimate the average number of young trees per acre (e.g., less than 4.5” in diameter.) These value are of interest for various reasons including, assessing the impact of the deer hunt on regeneration and the nature of regeneration after logging. (Tom Kyker-Snowman, MDC Forester). • The accuracy of satellite imagery in classifying land use. Population: A portion of Bourne, MA for which different regions were classified by habitat using aerial photos and infrared levels. Goal: Estimate what proportion of area in each class is accurately classified. Done by getting ground truth on a sample and attaching an accuracy level to the map. (Janice Stone). 1 • Movement of graduating Umass students. Ellyse Fosse: Honors project. Wanted to sample all seniors at Umass to estimate what proportion planned to remain and work in Massachusetts and compare these proportions across majors and other groupings. This was motivated by comments by then Gov. Mitt Romney when questioning some higher education expenditures by saying that a lot of Umass students from MA leave the state to work after getting educated. • Sprinkler safety in dorms. A fire occured in a hotel where the sprinklers were of the same type as those used in a number of UMass dorms. In the hotel fire, the sprinkler failed. There was concern about the safety of these sprinklers in general, and in the UMass dorms in particular. The recommendation made to the University was to sample 1% of the sprinklers in the dorm; so in one dorm with 698 sprinklers 7 were sampled. The small recommended number was in part due to the fact that the sampling is destructive and costly. This is a common occurrence in many quality control situations. What can be said about the proportion of defective sprinklers in the dorm with this sample? Is that sample size sufficient? 2 Example 1: Estimating egg mass and defoliation means with an SRS. This data is from a study looking at gypsy moth egg mass densities and defoliation over a number of different ”plots”. Each plot, viewed as a population, is 60 ha (hectare)in size. On each plot, a number of subplots are sampled, each subplot is .1 ha in size. So with a plot being the population there are = 600 “units”. For illustration here the plots are treated as an SR without replacement. This particular data is for different plots in the State of Maryland in 1987. See Buonaccorsi (1994) in “Case Studies in Biometry” ( N. Lange, ... J Greenhouse, eds.) for more background . The analysis below shows estimated means, standard errors and approximate confidence intervals for each of defoliation and egg mass densities. This is done first using proc surveymeans in SAS, which accounts for the finite population and then using a standard univariate analysis for each plot which ignores the finite population correction factor. Since the sample sizes are small relative to the population size of 600 the correction factor doesn’t matter much here. data dd; infile ’87md.dat’; input plot $ subplot egg def; 2 proc sort; by plot; proc surveymeans total=600; var egg def; /* With no other options this assumes an SRS without replacement */ by plot; /* Analyze assuming "random sample" that does not accounting for finite population */ proc means mean stderr clm; class plot; var egg def; run; *** Partial output for plot 460 and plot 4A.*** ------------------------------ plot=460 -----------------------------The SURVEYMEANS Procedure Number of Observations 15 Std Error Lower 95% Upper 95% Variable N Mean of Mean CL for Mean CL for Mean -------------------------------------------------------------------------------egg 15 54.200000 12.815186 26.714159 81.685841 def 15 18.666667 1.894855 14.602606 22.730727 ------------------------------------------------------------------------------------------------------------- plot=4A ------------------------------Number of Observations 15 Std Error Lower 95% Upper 95% Variable N Mean of Mean CL for Mean CL for Mean ------------------------------------------------------------------------------egg 15 21.133333 5.327775 9.706392 32.560275 def 15 13.333333 2.293884 8.413441 18.253226 ------------------------------------------------------------------------------The MEANS Procedure N Lower 95% Upper 95% plot Obs Variable Mean Std Error CL for Mean CL for Mean ---------------------------------------------------------------------------------460 15 egg 54.2000000 12.9784437 26.3640068 82.0359932 def 18.6666667 1.9189944 14.5508329 22.7825004 4A 15 egg def 21.1333333 13.3333333 3 5.3956479 2.3231068 9.5608196 8.3507647 32.7058470 18.3159020 3 Simulations of estimated mean and CI with SRS This illustrates the performance of the sample mean and confidence intervals through simulation using the program simmean.sas. Here we use the Parvin Lake data as the population with the variable “number of fish caught”. It examines the behavior over different samples sizes in two settings. The first uses all 137 days of the season as the population. The second eliminates opening day, which is an outlier. Sample sizes of 10, 20 and 50 are shown here. We know exactly what the expected value and variance (or standard error) of the sample mean are so we don’t need to simulate for that purpose. The value of the simulation is to look at the performance of the confidence intervals and the distribution of the sample mean. Using all N = 137 days, ȳu = 55.55 (population mean) and S 2 = 4597.54 (population “variance”) The mean of Cover, Coverc and Coverct are estimated coverage rates of the approximate normal CI, Chebychev CI using sample variance and Chebychev CI using true variance (just to see how much the estimation of the variance matters). Variable YBAR SAMPLES2 COVER COVERC COVERCT LENGTH LENGTHC LENGTHCT Variable YBAR SAMPLES2 COVER COVERC COVERCT LENGTH LENGTHC LENGTHCT N 1000 1000 1000 1000 1000 1000 1000 1000 sample size 10 true SE of the mean = 20.644505 Mean Std Dev Minimum 54.6009000 19.4458693 20.2000000 4077.89 10329.66 125.9555556 0.8500000 0.3572501 0 0.9850000 0.1216133 0 1.0000000 0 1.0000000 56.6748086 50.9823842 13.3945562 129.3174014 116.3287464 30.5629474 184.6500660 0 184.6500660 Maximum 140.1000000 47356.27 1.0000000 1.0000000 1.0000000 259.7217930 592.6186286 184.6500660 N 1000 1000 1000 1000 1000 1000 1000 1000 sample size 20 true SE of the mean = 14.011368 Mean Std Dev Minimum 55.2352000 13.6250699 30.4000000 4305.13 7264.58 391.0947368 0.8820000 0.3227695 0 0.9960000 0.0631505 0 1.0000000 0 1.0000000 42.8292971 31.4861507 16.0190487 97.7254893 71.8433336 36.5513674 125.3214872 0 125.3214872 Maximum 112.8000000 24074.46 1.0000000 1.0000000 1.0000000 125.6822775 286.7747757 125.3214872 4 sample size 50 true SE of the mean = 7.6414757 Variable YBAR SAMPLES2 COVER COVERC COVERCT LENGTH LENGTHC LENGTHCT N 1000 1000 1000 1000 1000 1000 1000 1000 Mean 55.8094600 4637.26 0.8630000 0.9960000 1.0000000 26.9867472 61.5768470 68.3474366 Std Dev 7.7107916 4157.18 0.3440194 0.0631505 0 13.2999150 30.3470006 0 Minimum 38.5200000 737.3322449 0 0 1.0000000 11.9956570 27.3710178 68.3474366 Maximum 74.5600000 10631.07 1.0000000 1.0000000 1.0000000 45.5492192 103.9316552 68.3474366 N 1000 1000 1000 1000 1000 1000 1000 1000 sample size 10 true SE of the mean = 11.649923 Mean Std Dev Minimum 50.7963000 11.3441035 19.9000000 1429.51 711.7743356 157.8333333 0.8960000 0.3054133 0 0.9880000 0.1089397 0 1.0000000 0 1.0000000 43.6492419 11.3983191 14.9897113 99.5963935 26.0080457 34.2026831 104.2000826 0 104.2000826 Maximum 89.1000000 4784.04 1.0000000 1.0000000 1.0000000 82.5261354 188.3035097 104.2000826 N 1000 1000 1000 1000 1000 1000 1000 1000 sample size 20 true SE of the mean = 7.9040885 Mean Std Dev Minimum 50.6555000 7.9726585 28.3500000 1442.51 485.9184976 259.2078947 0.9180000 0.2745020 0 0.9970000 0.0547174 0 1.0000000 0 1.0000000 30.2929706 5.2582564 13.0330704 69.1208021 11.9979947 29.7381295 70.6963167 0 70.6963167 Maximum 81.0500000 3192.01 1.0000000 1.0000000 1.0000000 45.7356427 104.3570259 70.6963167 With N = 136 (no opening day) ȳu = 50.76 and S 2 = 1464. Variable YBAR SAMPLES2 COVER COVERC COVERCT LENGTH LENGTHC LENGTHCT Variable YBAR SAMPLES2 COVER COVERC COVERCT LENGTH LENGTHC LENGTHCT 5 Variable YBAR SAMPLES2 COVER COVERC COVERCT LENGTH LENGTHC LENGTHCT 4 N 1000 1000 1000 1000 1000 1000 1000 1000 sample size 50 true SE of the mean = 4.3042949 Mean Std Dev Minimum 50.6693000 4.2748816 37.3000000 1457.60 247.0899739 633.2575510 0.9410000 0.2357426 0 1.0000000 0 1.0000000 1.0000000 0 1.0000000 16.7685381 1.4415388 11.0933584 38.2615102 3.2892225 25.3122034 38.4987841 0 38.4987841 Maximum 63.2800000 2154.05 1.0000000 1.0000000 1.0000000 20.4597825 46.6839849 38.4987841 Example 2: Estimating a proportion with an SRS. As discussed there are many situations involving relatively small sample sizes because of the expense, in either time or money, to obtain an observation, or because testing is destructive. This shows some results with very small sample sizes using both approximate normal based confidence intervals and exact confidence intervals based on the hypergeometric. Output was obtained using the program exactp.sas. x is the number of successes in the sample. Notice 6 what happens with 0 success in the sample to the usual approach of using the standard error and an approximate C.I. N 20 20 20 5 n 5 5 5 x phat 2 0.4 1 0.2 0 0.0 SE 0.212132 0.1732051 0.0 Approx CI for p [-0.01578, 0.81577] [-0.13948, 0.53948] [0, 0] Exact CI for p [0.1, 0.8] [0.05, 0.65] [0.0, 0.45] Exact CI for total [2, 16] [1, 13] [0, 9] Simulation in estimating a proportion with an SRS Here are some results illustrating the performance of the sample proportion and confidence intervals through simulation (using the program simbin2.sas.) We know exactly what the expected value and variance (or standard error) of the sample proportion is are so we don’t need to simulate for that purpose. The value of the simulation here is mainly to look at the performance of the confidence intervals. In general, the behavior of the approximate confidence intervals depends on the true p as well as both the population and sample size. In all but the last setting here the approximate CI does badly. 7 YBAR IS YUBAR = MEAN OF MEAN OF MEAN OF MEAN OF THE SAMPLE PROPORTION. true mean = proportion p COVER = proportion of successful CI’s using normal approximation: COVERC = proportion of successful CI’s using Chebychev with estimated SE COVERCT = proportion of successful CI’s using Chebychev with true SE COVERE = proportion of successful CI’s using exact hypergeometric CI. population size = true mean = 20 sample size = 0.05 population s2 = 5 0.05 true SE of the mean = 0.0866025 Variable YBAR SAMPLES2 COVER COVERC COVERCT COVERCE LENGTH LENGTHC LENGTHCT LENGTHE N 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 The MEANS Procedure Mean Std Dev 0.0500000 0.0866459 0.0500000 0.0866459 0.2500000 0.4332294 0.2500000 0.4332294 1.0000000 0 1.0000000 0 0.1697379 0.2941417 0.3872983 0.6711561 0.7745967 0 0.4875000 0.0649844 Minimum 0 0 0 0 1.0000000 1.0000000 0 0 0.7745967 0.4500000 Maximum 0.2000000 0.2000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6789514 1.5491933 0.7745967 0.6000000 population size = 100 sample size = 20 true mean = 0.05 population s2 = 0.0479798 true SE of the mean = 0.0438086 Variable YBAR SAMPLES2 COVER COVERC COVERCT COVERCE LENGTH LENGTHC LENGTHCT LENGTHE N 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 The MEANS Procedure Mean Std Dev 0.0484000 0.0433511 0.0465053 0.0400235 0.6650000 0.4722266 0.6650000 0.4722266 1.0000000 0 0.9940000 0.0772656 0.1354827 0.1011848 0.3091369 0.2308778 0.3918359 0 0.2066100 0.0485733 8 Minimum 0 0 0 0 1.0000000 0 0 0 0.3918359 0.1500000 Maximum 0.2000000 0.1684211 1.0000000 1.0000000 1.0000000 1.0000000 0.3217409 0.7341303 0.3918359 0.3400000 population size = 200 sample size = 40 true mean = 0.02 population s2 = 0.0196985 true SE of the mean = 0.0198487 Variable YBAR SAMPLES2 COVER COVERC COVERCT COVERCE LENGTH LENGTHC LENGTHCT LENGTHE N 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 Mean 0.0196000 0.0192923 0.5790000 0.5790000 1.0000000 0.9970000 0.0575566 0.1313294 0.1775319 0.1041200 Std Dev 0.0201556 0.0195169 0.4939666 0.4939666 0 0.0547174 0.0511733 0.1167644 0 0.0279098 Minimum 0 0 0 0 1.0000000 0 0 0 0.1775319 0.0750000 Maximum 0.1000000 0.0923077 1.0000000 1.0000000 1.0000000 1.0000000 0.1684271 0.3843076 0.1775319 0.1850000 population size = 500 sample size = 40 true mean = 0.2 population s2 = 0.1603206 true SE of the mean = 0.0607238 Variable YBAR SAMPLES2 COVER COVERC COVERCT COVERCE LENGTH LENGTHC LENGTHCT LENGTHE N 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 Mean 0.1970500 0.1589269 0.9340000 0.9980000 1.0000000 0.9740000 0.2354034 0.5371303 0.5431298 0.2492600 Std Dev 0.0571911 0.0352541 0.2484063 0.0446990 0 0.1592148 0.0274412 0.0626137 0 0.0248359 9 Minimum 0.0250000 0.0250000 0 0 1.0000000 0 0.0939966 0.2144761 0.5431298 0.1240000 Maximum 0.4000000 0.2461538 1.0000000 1.0000000 1.0000000 1.0000000 0.2949479 0.6729956 0.5431298 0.3040000 . 10 6 Duck Example: Estimating population size with variable size units using ratio and regression estimators. This example is typical of settings where there are variables size units and the size of the unit can be used as an auxiliary variable. Here there is an area which is 1303 square miles divided into N = 140 transects each 1/2 mile wide buy of varying lengths. The goal is to estimate the number of black-belly ducks in the area. Define xi = length of the ith transect and yi = # of black-belly tree ducks in transect i. The main objective is to estimate ty = 1303ȳU = total number of ducks in the area. The area of the ith transect is (1/2) ∗ xi but we know P that 1303 = i (1/2)xi = tx /2, so we know tx = 2606 (without having to know all of the individual xi ) and x̄U = 2606/140 = 18.614. The number of ducks per square mile is ty /1303 = ty /(tx /2) = 2 ∗ B where B = ȳU /x̄U . Ratio Estimation. The ratio estimator of ȳU is ȳˆr = x̄U ∗ (ȳ/x̄) and the ratio estimator of the total is t̂yr = 140 ∗ ȳˆr = 2606(ȳ/x̄). A simple random sample of n = 14 transects yielded the following data and estimates. Using just the y value leads to an estimated mean of ȳ = 21.07 and an estimated total of t̂y = 2950 with an estimated standard error for the estimated total of 1072.92. Using a ratio estimator the estimated ratio of means is B̂ = 1.22, the estimated mean y is ȳˆr = 22.65 and an estimated total of t̂yr = 3171.49 with an estimated SE of the total of 1098.16 or 1180.6. The estimated SE of the ratio estimator is larger than that obtained with the use of ȳ (but these are estimates so the data analysis does not tell us which estimator is actually better.) Note that there is one outlier (obs 8) that destroys the value of using the ratio or regression estimator here. Obs length ducks Obs 1 6.2 3 8 2 12.4 3 9 3 2.3 30 10 4 22.8 3 11 5 20.7 15 12 6 24.8 16 13 7 22.8 4 14 Estimated mean and total without using x YBAR SEYBAR 21.071429 7.6636848 length 18.6 22.8 18.6 16.6 18.6 16.6 18.6 ducks 114 24 48 0 8 0 27 THAT SETHAT 2950 1072.9159 Ratio Estimators using xubar in variance of Bhat BHAT SEBHAT1 YRHAT SEYRHAT1 TYHATR SETYHATR1 1.2169967 0.4213959 22.653524 7.8439846 3171.4934 1098.1578 11 Ratio Estimators using xbar in variance of Bhat BHAT SEBHAT2 YRHAT SEYRHAT2 TYHATR SETYHATR2 1.2169967 0.4530354 22.653524 8.4329306 3171.4934 1180.6103 In SAS surveymeans you can get B̂ and its estimated standard error using proc surveymeans total=140; var ducks length; ratio ducks/length; run; Numerator Denominator ducks length N 14 Ratio Analysis Ratio Std Err 1.216997 0.453035 95% CL for Ratio 0.23827320 2.19572020 The duck example using a regression estimator The regression fit is: The REG Procedure Dependent Variable: ducks Parameter Standard Variable DF Estimate Error Intercept 1 18.00461 24.91561 length 1 0.17713 1.35473 *Carrying out analysis in IML * t Value 0.72 0.13 Pr > |t| 0.4838 0.8981 B estimated regression coefficients 18.004608 0.1771266 Estimated mean and total using regression estimator YREG SEYREG TREG SETREG 21.301693 7.6582319 2982.237 1072.1525 Once you have the estimated regression coefficients you can also create new variables ye = ducks + 0.1771266*((2606/140) - length) and te = 140*ye so that running these through surveymeans gives us the estimated mean and total and standard errors. The SURVEYMEANS Procedure Std Error Lower 95% Upper 95% Variable N Mean of Mean CL for Mean CL for Mean -----------------------------------------------------------------------------ye 14 21.301693 7.658232 4.757089 37.846297 te 14 2982.237041 1072.152467 665.992457 5298.481626 -----------------------------------------------------------------------------12 Computing variances for the ratio, estimated slope and ratio and regression estimators using the Jackknife. The example below, which continues to use the duck data, was done using a customized IML program (ratreg.sas) that does both ratio and regression estimators for an SRS. The standard errors of the estimates are calculated using the approximate variances resulting from the delta method (see class discussion and text) and using the delete one jackknife. The most recent version of SAS (9.2) now has a “varmethod = “ option which allows for using the jackknife. This in combination with a ratio command will supply a jackknife estimate of either the variance or standard error of B̂ = estimator of ȳU /x̄U . Some of the analyses given earlier are repeated here for easy comparison. Estimated mean and total without using x YBAR SEYBAR THAT SETHAT 21.071429 7.6636848 2950 1072.9159 Ratio Estimators using xubar in variance of Bhat BHAT SEBHAT1 YRHAT SEYRHAT1 TYHATR SETYHATR1 1.2169967 0.4213959 22.653524 7.8439846 3171.4934 1098.1578 Ratio Estimators using xbar in variance of Bhat BHAT SEBHAT2 YRHAT SEYRHAT2 TYHATR SETYHATR2 1.2169967 0.4530354 22.653524 8.4329306 3171.4934 1180.6103 B estimated regression coefficients 18.004608 0.1771266 Estimated mean and total using regression estimator YREG SEYREG TREG SETREG 21.301693 7.6582319 2982.237 1072.1525 Jackknife estimate of ratio 1.2069393 Jackknife SE of ratio, based on pseudo values 0.4544981 Jackknife SE of ratio, using original bhat in se calculation 0.4545058 Jackknife ratio estimate of yubar 22.466313 Jackknife SE of est. mean, based on pseudo values 8.4601578 Jackknife SE of est. mean using original bhat in se calculation 8.4603012 Jackknife regression estimate of yubar 13 21.09824 Jackknife SE based on pseudo values 8.0239971 Jackknife SE using original bhat in se calculation 8.0241757 pratio and preg contain the Pseudo values from deleting one observation at a time. Surveymeans with N = 140 was run on the pseudo values. Variable PRATIO PREG N 14 14 Mean 1.206939 21.098240 Plot of ducks*length. Std Error of Mean 0.454498 8.023997 95% CL for Mean 0.22505583 2.1888228 3.76344817 38.4330318 Legend: A = 1 obs, B = 2 obs, etc. ducks | | A 100 | | | | | 50 | A | | A A | A A A | A 0 | A A B B | |_______________________________________________________________________ 0 5 10 15 20 25 length 14 7 Allocation in Stratified Sampling: Example using the agricultural farm data. Here the objective is to estimate the total acres of farmland in the country. The strata are the states and the counties are the sampling units within the strata. We will use the 1992 sample standard deviations, as the Sk ’s, in determining the optimal allocation and in finding the variances of the estimates based on different sampling schemes. In the program below the total sample size n is specified. Both the proportional and optimal allocation (Neyman allocation) often give non-integers so we round up (unless that exceeds the size for that state), which means the actual sample sizes are often bigger than desired. This could be modified to use rounding to the nearest integer instead of always rounding up. With the optimal allocation there is no guarantee that the sample size for stratum k is less than N k . Here if it comes out larger than Nk , I just set the sample size for that stratum equal to Nk . This happens with larger total n and in this case the actual sample size used under the optimal allocation is less than the specified n. This could also be modified. The purpose here was to give a general idea of the improvements by using optimal rather than proportional allocation and a comparison to the use of a SRS. Finally, I’ve left the sample sizes of 1 in there in some places. This is fine in terms of computing the variance of the estimator when we treat the Sk0 s as known and compare variances but we wouldn’t really use samples of size 1 from a stratum since we can’t get an estimated variance within that stratum. DESCRIPTION OF THE POPULATION Analysis Variable : ACRES92 STATE Obs N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------AK 4 4 59876.00 58992.47 210.00 141338.00 AL 67 67 126131.69 53670.31 35748.00 233422.00 WY 23 23 1429394.39 783353.88 62307.00 2720903.00 ---------------------------------------------------------------------------Obs STATE _TYPE_ _FREQ_ bign ybar s 1 AK 1 4 4 59876.00 58992.47 2 AL 1 67 67 126131.69 53670.31 50 WY 1 23 23 TOTALN = 3058 total sample size desired STATE ALLOC PROPSS AK 0.0003797 1 15 1429394.39 100 OPTSS 1 783353.88 N 4 AL AR AZ 0.0057868 0.0136038 0.0550905 3 3 1 1 2 6 67 75 15 WA WI WV WY 0.0300641 0.0148585 0.0040224 0.0289943 2 3 2 1 4 2 1 3 39 70 55 23 sample sizes used TOTALSS SUMPROP SUMOPT 100 128 129 variances for estimating the mean acres/county VARYBAR VARPROP VAROPT 1.74988E9 655566777 321906326 variances for estimating the total acres VART VARTPROP VARTOPT 1.6364E16 6.1304E15 3.0103E15 standard errors for estimating mean acres/county SEYBAR SEPROP SEOPT 41831.617 25604.038 17941.748 SEYBAR above uses the 100. To make it more comparable to the designs actually used for proportional and optimal SEYBAR with n= 128= 36798.862 and SEYBAR with n = 129 = 36649.698. RATIOOP ratio of se optimal to se proportional 0.700739 RATIOOY ratio of se optimal to se for SRS 0.428904 standard errors for estimating total acres SET SETPROP SETOPT 127921085 78297149 54865866 ******* n = 2000 ***** The total sample size used for the optimal allocation is only 1515. This is because the allocation for a number of the states is bigger than $N_k$ for that state and so it was truncated to $N_k$. 16 total sample size desired STATE ALLOC PROPSS AK AL AZ 0.0003797 0.0057868 0.0550905 2000 OPTSS N 1 12 15 4 67 15 3 44 10 WV WY 0.0040224 36 9 55 0.0289943 16 23 23 sample sizes used TOTALSS SUMPROP SUMOPT 2000 2021 1515 standard errors for estimating mean acres/county SEYBAR SEPROP SEOPT 5594.1409 3993.0167 2053.3689 RATIOOP ratio of se optimal to se proportional 0.51424 RATIOOY ratio of se optimal to se for SRS 0.3670571 standard errors for estimating total acres SET SETPROP SETOPT 17106883 12210645 6279202.2 8 Stratified Sampling in Surveymeans: Estimating egg mass densities with statified samples. h is used here as in Lohr’s book. In other places we’ve used k P The estimate of the population mean is ȳˆstr = H h Nh ȳn /N . If the Nh are all equal then PH N = HNh and so the estimate is h ȳn /H, which is the mean of the strata means. But if you run surveymeans without any weighting it gives you as an estimate just the overall sample mean ȳ. The stratified example given in the SAS documentation is only correct without the use of any weighting because the sampling is proportional to size. That is, all the sampling rates are the same, in which case we know that ȳˆstr = ȳ. They fail to mention that with unequal sampling rates you need to weight. This can be done by creating a weight of Nh /nh for each observation in stratum h. This is demonstrated in the egg mass/defoliation example using the egg values. With no weighting surveymeans will just give the overall mean The sampling rate in one of the strata is different, so we need to weight. There is also an IML program that does the calculation directly as a double check on what surveymeans is doing. 17 option linesize=70; data dd; infile ’87md.dat’; input plot $ subplot egg def; wt=600/15; if plot = ’C2’ then wt=600/30; run; proc means; var egg; run; proc means nway; class plot; var egg; output out=mout mean=megg; run; proc print data=mout; proc means data=mout; var megg; run; /*Here we create a file called totals that has in the strata sizes */ data totals; set mout; _total_=600; keep plot _total_; run; proc print data=totals; run; proc surveymeans data=dd total=totals mean clm sum clsum; stratum plot/list; var egg; weight wt; run; THE OVERALL MEAN N Mean Std Dev Minimum Maximum 135 28.0948148 44.6275086 0 414.6000000 Obs plot _TYPE_ _FREQ_ megg 1 460 1 15 54.2000 2 4A 1 15 21.1333 3 C13 1 15 41.0667 4 C2 1 30 9.0000 5 D1 1 15 29.4667 6 H6 1 15 15.8000 7 N22 1 15 54.9867 8 S42 1 15 18.2000 THE MEAN OF THE STRATA MEANS. N Mean Std Dev Minimum Maximum 8 30.4816667 17.6933031 9.0000000 54.9866667 USING SURVEYMEANS WITH WEIGHTING Obs plot _total_ 18 1 460 600 2 4A 600 3 C13 600 4 C2 600 5 D1 600 6 H6 600 7 N22 600 8 S42 600 Stratum Population Sampling Index plot Total Rate N Obs Variable N --------------------------------------------------------------------1 460 600 2.50% 15 egg 15 2 4A 600 2.50% 15 egg 15 3 C13 600 2.50% 15 egg 15 4 C2 600 5.00% 30 egg 30 5 D1 600 2.50% 15 egg 15 6 H6 600 2.50% 15 egg 15 7 N22 600 2.50% 15 egg 15 8 S42 600 2.50% 15 egg 15 Std Error Variable Mean of Mean 95% CL for Mean ----------------------------------------------------------------egg 30.481667 4.036904 22.4933621 38.4699713 ----------------------------------------------------------------Variable Sum Std Dev 95% CL for Sum ----------------------------------------------------------------egg 146312 19377 107968.138 184655.862 ----------------------------------------------------------------*** FROM AN IML PROGRAM TO DO ANALYSIS FOR STRATIFIED RANDOM SAMPLING. THIS IS STRAT.SAS ON THE WEB SITE. ** N MEAN S2 SD FPC 15 54.2 2526.6 50.265296 0.975 15 21.133333 436.69524 20.897254 0.975 15 41.066667 1374.9238 37.079965 0.975 30 9 144.75862 12.031568 0.95 15 29.466667 769.55238 27.740807 0.975 15 15.8 279.88143 16.729657 0.975 15 54.986667 10361.958 101.79371 0.975 15 18.2 225.74286 15.024742 0.975 YBHAT SEYBHAT THAT SETHAT 30.481667 4.0369041 146312 19377.139 19 9 Single stage cluster example. Ice Cream data. We’ll illustrate the computing for one-stage cluster sampling using the example from SAS documentation but for now we’ll use just the grade 8 data. There is a total of K = 1025 students broken up into 252 study groups. Three study groups are sampled and each student in the study group is observed with the response being the amount spent per week on ice cream. The sample totals, means and number of psu’s are: ts1 = 68, Ms1 = 4, ȳ1 = 17 ts2 = 38, Ms2 = 3, ȳ2 = 12.667 ts3 = 33, Ms3 = 2, ȳ3 = 16.5 with t̄ = 46.3333 and M̄ = 3. The estimated total is t̂ = N t̄ = 252 ∗ 46.3333 = 11676 with and the estimated mean ȳˆ = t̂/1025 = 11.391 with which uses the knowledge of the number of ssu’s. The ratio estimator of ȳu (which doesn’t use the 1025) is ȳˆratio = 46.333/3 = 15.4444. If create a file having the tsi ’s and Msi ’s in it we would just analyze this as an SRS using surveymeans with weight 252/3 and total = 252. As discussed with SRS the sum will give t̂, while with a weight of 252/(3*1025) the sum will give ȳˆunb . The associated standard errors are SE(t̂) = 2737.68 and SE(ȳˆunb ) = 2.6709. Can also use ratio option to get ratio estimate of mean with weight 252/3. This is done near end. Working with the original data directly: With no weighting the estimate of the mean is the ratio estimator. With weight = 252/3 and total = 252 the estimate of the sum is t̂, while with a weight of 252/(3*1025) the estimate of the sum is ȳˆunb . Obs 1 2 3 4 5 6 7 8 9 Grade 8 8 8 8 8 8 8 8 8 Study Group 59 59 59 143 143 59 143 156 156 Spending 20 13 17 12 16 18 10 19 14 20 wt 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 wt2 0.081951 0.081951 0.081951 0.081951 0.081951 0.081951 0.081951 0.081951 0.081951 N StudyGroup Obs N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------59 4 4 17.0000000 2.9439203 13.0000000 20.0000000 143 3 3 12.6666667 3.0550505 10.0000000 16.0000000 156 2 2 16.5000000 3.5355339 14.0000000 19.0000000 ---------------------------------------------------------------------------- Obs 1 2 3 Study Group 59 143 156 _TYPE_ 1 1 1 _FREQ_ 4 3 2 ts 68 38 33 m 4 3 2 grade 8 unweighted The SURVEYMEANS Procedure Number of Clusters 3 Number of Observations 9 Std Error Variable N Mean of Mean 95% CL for Mean ----------------------------------------------------------------------------Spending 9 15.444444 1.435506 9.26796013 21.6209288 ----------------------------------------------------------------------------grade 8 wt = N/n Std Error Variable Mean of Mean 95% CL for Mean Sum ----------------------------------------------------------------------------Spending 15.444444 1.435506 9.26796013 21.6209288 11676 ----------------------------------------------------------------------------Variable Std Dev 95% CL for Sum -----------------------------------------------Spending 2737.681501 -103.29278 23455.2928 -----------------------------------------------grade 8 wt = N/nK Std Error Variable Mean of Mean 95% CL for Mean Sum ----------------------------------------------------------------------------Spending 15.444444 1.435506 9.26796013 21.6209288 11.391220 ----------------------------------------------------------------------------21 Variable Std Dev 95% CL for Sum -----------------------------------------------Spending 2.670909 -0.1007734 22.8832125 -----------------------------------------------working with ts wt = 252/3 Number of Observations 3 Variable Sum Std Dev 95% CL for Sum ----------------------------------------------------------------ts 11676 2737.681501 -103.29278 23455.2928 m 756.000000 144.623650 133.73466 1378.2653 ----------------------------------------------------------------Ratio Analysis Numerator Denominator Ratio Std Err -------------------------------------------------ts m 15.444444 1.435506 -------------------------------------------------working with ts wt = 252/(3*1024) Std Error Variable Mean of Mean 95% CL for Mean Sum ----------------------------------------------------------------------------ts 46.333333 10.863815 -0.4098920 93.0765587 11.391220 ----------------------------------------------------------------------------Variable Std Dev 95% CL for Sum -----------------------------------------------ts 2.670909 -0.1007734 22.8832125 -----------------------------------------------option ls=80 nodate; data IceCreamStudy; input Grade StudyGroup Spending @@; wt=252/3; wt2=252/(3*1025); datalines; 7 34 7 7 34 7 7 412 4 9 27 14 7 34 2 9 230 15 9 27 15 7 501 2 9 230 8 9 230 7 7 501 3 8 59 20 7 403 4 7 403 11 8 59 13 8 59 17 8 143 12 8 143 16 8 59 18 9 235 9 8 143 10 9 312 8 9 235 6 9 235 11 9 312 10 7 321 6 8 156 19 8 156 14 22 7 321 7 78 7 78 3 1 6 7 321 12 7 78 10 7 412 6 7 489 7 489 7 156 2 2 2 7 489 7 156 9 301 9 1 8 ; data grade8; set IceCreamStudy; if grade = 8; proc print; run; proc means data=grade8 nway; class StudyGroup; var spending; output out=gr8out sum=ts n=m; proc print data=gr8out; run; /* Using the original data */ title ’grade 8 unweighted’; proc surveymeans data=grade8 total=252; /* the sum will not be useful here */ cluster StudyGroup; var spending; run; title ’grade 8 wt = N/n ’; proc surveymeans data=grade8 total=252 mean sum clm clsum; cluster StudyGroup; var spending; weight wt; run; title ’grade 8 wt = N/nK ’; /* here the sum will give the unbiased estimator of the per ssu mean */ proc surveymeans data=grade8 total=252 mean sum clm clsum; cluster StudyGroup; var spending; weight wt2; run; proc means data=gr8out; var ts m; run; data tvalues; set gr8out; wt = 252/3; 23 wt2 = 252/(3*1025); run; title ’working with ts wt = 252/3’; proc surveymeans data = tvalues total=252 sum clsum; var ts m; ratio ts/m; weight wt; run; title ’working with ts wt = 252/(3*1024)’; proc surveymeans data = tvalues total=252 mean sum clm clsum; var ts; weight wt2; run; 10 Ice Cream data with stratification and cluster sampling. The number of study groups are 608, 252 and 403 for grades 7, 8 and 9 respectively. The number sampled are 8, 3 and 5. If we assign weight Nh /nh for an observation from stratum h, then the estimate of the mean is the ratio estimator. For the total it is the unbiased estimate of the total. With weight Nh /(nh K) for an observation from stratum h, then the estimate of the mean is still the ratio estimator, but for the sum it is is the unbiased estimate of ȳu . Obs 1 2 3 grade 7 8 9 _total_ 608 252 403 Stratum Population Sampling Index Grade Total Rate N Obs Variable N Clusters ------------------------------------------------------------------------------1 7 608 1.32% 20 Spending 20 8 2 8 252 1.19% 9 Spending 9 3 3 9 403 1.24% 11 Spending 11 5 ------------------------------------------------------------------------------Std Error Variable N Mean of Mean 95% CL for Mean ----------------------------------------------------------------------------24 Spending 40 8.750000 0.634549 7.37913979 10.1208602 ----------------------------------------------------------------------------weighted by N_h/n_h Std Error Variable Mean of Mean 95% CL for Mean Sum ----------------------------------------------------------------------------Spending 8.923860 0.650859 7.51776370 10.3299565 28223 ----------------------------------------------------------------------------Variable Std Dev 95% CL for Sum -----------------------------------------------Spending 3456.556840 20755.1629 35690.0371 -----------------------------------------------weighted by N_h/n_hK Std Error Variable Mean of Mean 95% CL for Mean Sum ----------------------------------------------------------------------------Spending 8.923860 0.650859 7.51776370 10.3299565 7.055650 ----------------------------------------------------------------------------Variable Std Dev 95% CL for Sum -----------------------------------------------Spending 0.864139 5.18879074 8.92250926 -----------------------------------------------option ls=80 nodate; /* The ice cream example with grades as strata */ data ic; input Grade StudyGroup Spending @@; if grade= 7 then wt = 608/8; if grade = 8 then wt=252/3; if grade = 9 then wt = 403/5; wtk = wt/4000; datalines; 7 34 7 7 34 7 7 412 4 9 27 14 7 34 2 9 230 15 9 27 15 7 501 2 9 230 8 9 230 7 7 501 3 8 59 20 7 403 4 7 403 11 8 59 13 8 59 17 8 143 12 8 143 16 8 59 18 9 235 9 8 143 10 9 312 8 9 235 6 9 235 11 9 312 10 7 321 6 8 156 19 8 156 14 7 321 3 7 321 12 7 489 2 7 489 9 7 78 1 7 78 10 7 489 2 7 156 1 7 78 6 7 412 6 7 156 2 9 301 8 25 ; data Stsizes; input grade _total_; datalines; 7 608 8 252 9 403 ; proc print; run; title ’unweighted’; proc surveymeans data=ic total=Stsizes; strata grade/list; cluster StudyGroup; var spending; run; title ’weighted by N_h/n_h ’; proc surveymeans data=ic total=Stsizes mean sum clm clsum; strata grade/list; cluster StudyGroup; var spending; weight wt; run; title ’weighted by N_h/n_hK’; proc surveymeans data=ic total=Stsizes mean sum clm clsum; strata grade/list; cluster StudyGroup; var spending; weight wtk; run; 26 11 Ozone levels and systematic sampling Ozone example from Problem 5-22 in Lohr. The data was stretched out so each case was an hourly reading with 24 successive hours over a date (I’m not sure what is going on with the dates in her file but have assumed they are in chronological order). With two days eliminated this gives 17472 records. There were 264 missing values. Here is the information on each cluster if we sample a random hour and then observe ozone at that hour every day over the two years. hour 1 2 3 4 5 _TYPE_ 1 1 1 1 1 _FREQ_ 728 728 728 728 728 n 720 722 723 722 721 mean 24.2319 23.8407 23.7801 23.6440 23.6519 sum 17447 17213 17193 17071 17053 var 114.003 116.123 116.615 120.759 118.774 20 21 22 23 24 1 1 1 1 1 728 728 728 728 728 721 722 722 703 717 28.0888 26.7355 25.6731 25.0853 24.7294 20252 19303 18536 17635 17731 129.820 118.874 112.478 115.873 116.471 a) Population information. Mean 27.60350 Median 29.00000 Std Deviation Variance 11.43614 130.78536 b) We choose every 24th value after a random starting point from 1 to 24. This means the sample size (not accounting for missing values) is 728. This is equivalent to having 24 clusters, each of size 728 (see above ) and we pick one at random. Note that the sample will have values from the 728 dates, all at the same hour, but with some missing. The sample I chose had 20 missing values c) In analyzing as an SRS, I used 17208 since that was the number of nonmissing values. proc surveyselect method=sys sampsize=728 out=sysout; proc surveymeans total=17208 data=sysout mean clm; var ozone; Std Error Lower 95% Upper 95% Variable Mean of Mean CL for Mean CL for Mean -----------------------------------------------------------------------ozone 30.080508 0.383182 29.328198 30.832819 -----------------------------------------------------------------------The interval misses the true population mean. d) Now we have one stage cluster sampling with n = 4 sampled clusters out of 96. Surveyselect states that it chooses independent samples, so that means the sampling is actually done with 27 replacement. This means there is no need for the finite population correction factor that comes from specifying total=96. But, if you reran until you got 4 distinct clusters, this would be equivalent to sampling without replacement. Here there were four distinct clusters. If you did put in total=96. This leads to a minor change in the standard error, namely .965381 With the clusters of equal size you can run the analysis without any weighting if you ignore missing values. This should be rerun with weighting to account for the missing. proc surveyselect data=a method=sys sampsize=182 out=sysout2 rep=4; run; proc surveymeans data=sysout2 mean clm; cluster replicate; var ozone; run; Number of Clusters 4 Number of Observations 728 Std Error Lower 95% Upper 95% Variable Mean of Mean CL for Mean CL for Mean ----------------------------------------------------------------ozone 25.303621 0.986145 22.165269 28.441974 -----------------------------------------------------------------------Looking at what the effect is of treating a systematic sample (one cluster of 728 as if it were an SRS. To illustrate I filled in all of the missing values with 20. The average of varsrs below is the expected value of estimator of V(ybarhat) treating systematic sample as SRS. The variance of vsys is the true variance of ybarhat for Systematics sampling Variable varsrs vsys N 24 24 Mean 0.1566136 26.9098379 Variance 0.000193907 10.8536016 The real variance of ȳˆu is 10.8536016. The expected value of the estimator of the variance of ȳˆu assuming it was an SRS is .1566136, obviously dramatically wrong. This seems a little strange but if can be cross checked through a simulation. Running 10000 simulations with a systematic sample and getting ybarhat = ȳˆu and varsrs the estimator of variance of ȳˆu assuming it is an SRS. Note the variance of ybarhat and the mean of varsrs give agreement with the theoretical calculations. Any difference is due to sampling error (from 10000 rather than infinitely many simulations). Variable YBARHAT VARSRS N 10000 10000 Mean 27.4863152 0.1633585 28 Variance 10.8206840 0.000203398 Std Dev 3.2894808 0.0142617 12 Clarification on page 31 and model/design based approach with an SRS Top of page 31. The notation is potentially confusing and I inadvertently left off the key line when I rewrote these notes. Here is a more detailed explanation. P In our notation based on the model, Ȳ = i∈S Yi /N . From the model based perspective, the S, designating the units in the sample is fixed. When we include the model and the design the S is random and we can think of Ȳ as random over the model and the design. If we condition on the realized values Y = y = (Y 1 = y1 , . . . , Yn = yn ), then ED (Ȳ |Y = y) = ȳu and VD (Ȳ |Y = y) = (N − n)S 2 /N n, P 2 where S 2 = N the design (here, a i=1 (yi − ȳu ) /(N − 1). This expected value and variance are over PN 2 simple random sample) which is why I’ve written D. We can write S RV = i=1 (Yi − Ȳ )2 /(N − 1) to describe S 2 as random over the model. Over the model and the design with M indicating mean and variance over the model E(Ȳ ) = EM (ED (Ȳ |Y)) = EM (Y¯u ) = µ and 2 /N n+VM (Ȳu ) = (N −n)σ 2 /N n+σ 2 /N = σ 2 /n V (Ȳ ) = EM (VD (Ȳ |Y))+VM (ED (Ȳ |Y)) = EM ((N −n)SRV Cov(Ȳ , Ȳu ) = EM (cov(Ȳ , Ȳu |Y) + covM (E(Ȳ |Y)), E(Ȳu |Y)) = 0 + cov(Ȳ , Ȳu ) = σ 2 /N (page 29). The missing line is : V (Ȳ − Ȳu ) = V (Ȳ ) + V (Ȳu ) − 2Cov(Ȳ , Ȳu ) = (N − n)σ 2 /N n. It is the square root of an estimate of V (Ȳ − Ȳu ) which gets used for the standard error when predicting Ȳu with Ȳ . This variance is is estimated by (N − n)s 2 /N n exactly as when using just the design based approach. 29 13 Analysis of the egg-defoliation data as a two-stage cluster sample We revisit the egg-defoliation data. Now suppose that the plots in the sample are an SRS of 8 plots out of a population of 50 plots (so the total population has K = 50 ∗ 600 subplots). The subplots are treated as a SRS within a plot. So, we have two-stage cluster sampling. If ȳ i , s2i and mi are the sample mean, variance and sample size for cluster i respectively, then the estimated total is P t̂ = 50 i 600ȳi /8 and the estimated variance is V̂ (t̂) = 502 600 − mi 2 50 − 8 2 50 X 6002 st + s . 50 ∗ 8 8 i 600mi i As seen below with direct calculations this leads to t̂unb = 914450 with an estimates SE of 178690.70. (1) Since the M is constant ( = 600) for all clusters over all clusters, the ratio estimator of the mean per ssu is the same as the unbiased estimator which is ȳˆunb = 914450/30000 = 20.48167 with SE = 178690.70/30000 = 5.957. Computing. Using surveymeans with the weight for an observation in cluster i being N M i /nmi , the estimate of the sum gives the correct estimtor of the total and the ratio estimator is the unbiased estimator of the mean per ssu. (If you ran surveymeans with weight wt2 = wt/30000, the analysis for either the mean or the sum is the same as given for the mean below.) As shown in class, however, the estimated standard errors from surveymeans ignores the second term which arises from the second stage sampling. For example, the standard error here of 171999 for the estiamted total is too small; the correct SE is 178690.7. plot Obs N Mean Std Dev Sum ---------------------------------------------------------------------460 15 15 54.2000000 50.2652962 813.0000000 4A 15 15 21.1333333 20.8972543 317.0000000 C13 15 15 41.0666667 37.0799651 616.0000000 C2 30 30 9.0000000 12.0315677 270.0000000 D1 15 15 29.4666667 27.7408071 442.0000000 H6 15 15 15.8000000 16.7296572 237.0000000 N22 15 15 54.9866667 101.7937050 824.8000000 S42 15 15 18.2000000 15.0247415 273.0000000 Obs plot _TYPE_ _FREQ_ megg sdegg tegg m 1 460 1 15 54.2000 50.265 813.0 15 ..... The SURVEYMEANS Procedure Number of Clusters 8 Number of Observations 135 30 Std Error Lower 95% Upper 95% Lower 95% Upper 95% Vae Mean of Mean CL for Mean CL for Mean Sum StdDev CL for Sum CL for Sum -------------------------------------------------- -------------------------------egg 30.481667 5.733285 16.924601 44.038733 914450 171999 507738 1321162 -------------------------------------------------- -------------------------------If we run the analysis with the main plots treated as strata rather than clusters then (see page 19 of notes) we get an SE for the estimated total of 19377. We know from our work with stratified P i 2 1/2 data that this SE, call it SES is ( i 6002 600−m . If we multiply the square by (50/8) we get 600mi si ) the second term in . So, SE(t̂) = (V̂ (t̂))1/2 = (1719992 + (50/8) ∗ 193772 )1/2 = 178690.7. You can also easily program up the direct calculations in either STATA or R. See posted program for example. /* Analysis of the defoliation data as a two-stage cluster sample Each subplot consists of samples of .1 ha. Each plot/psu is 60 ha in size and so consists of M = 600 ssu’s. */ data dd; infile ’87md.dat’; input plot $ subplot egg def; wt=50*600/(8*15); /* weights for running cluster analysis */ if plot=’C2’ then wt=50*600/(8*30); run; proc surveymeans data=dd total=50 mean clm sum clsum; cluster plot; weight wt; var egg; run; 31 14 Two-stage cluster sampling; worm fragments in cans of corn. This data is from Exercise 5.13. Programs for analyses below are posted. Running the data first b = 8531.2921802 + 95.5405672 and an with clustering and then with stratification leads to V ( t) b estimated standard error of SE(t) = 8557.1101. This can also be obtained directly from the twob stage cluster sampling program. Since K = 24 ∗ 580 we get ȳˆ = t/K = 3.611 (also given in both 2 b ˆ ˆ surveymeans runs) V (ȳ ) = V (t)/K = .3779 and SE(ȳ ) = .6147. From the analysis of variance M SBdata = 13.808 and M SWdata = 4.53 with corresponding sums of squares as given. 4.53 is the unbiased estimate of M SW . The estimate of M SB is M SBE = (13.808 − (1 − 3/24)4.53 = 5.69) ∗ (24/3) = 78.77. This is quite different than M SBdata = 13.808. Using the expression in part d) of problem 5-11 (see also page 41 of notes) we can calculate Vb (ȳˆ) = .3779. Our estimate of the variance of the mean per can if we took a general m and n is (see (5.34)) V (ȳˆ) = 1 − n 580 78.77 m + 1− n ∗ 24 24 4.5277 . n∗m This can be used to explore the effect of varying n and m over a fixed total number nm of ssu’s in the sample or over a fixed cost C = nc1 + c2 nm. For the latter the optimal m (approximately) is found using result on page 156 of text. See variances at end for different combinations of m and n with mn = 36. Source Model Error Corrected Total Variable worms DF 11 24 35 Sum of Squares 151.8888889 108.6666667 260.5555556 Mean Square 13.8080808 4.5277778 F Value 3.05 surveymeans with clustering on case Std Error Mean of Mean 95% CL for Mean 3.611111 0.612880 2.26217092 4.96005130 Variable Std Dev 95% CL for Sum worms 8531.292180 31489.4192 69043.9142 survey means stratified on plot. N case Obs N 1 3 3 2 3 3 Mean 4.3333333 3.3333333 32 Std Dev 3.0550505 1.1547005 Pr > F 0.0108 Sum 50267 Sum 13.0000000 10.0000000 3 4 5 6 7 8 9 10 11 12 Variable worms 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1.0000000 5.0000000 7.0000000 3.3333333 3.6666667 1.6666667 5.0000000 2.3333333 6.6666667 0 Mean 3.611111 Variable worms Std Error of Mean 0.331738 Std Dev 95.540567 THAT 50266.667 VTHAT1 VTHAT2 72782946 441186.67 Source Model Error Corrected Total 1.0000000 1.7320508 2.6457513 3.5118846 2.3094011 1.5275252 2.0000000 1.5275252 2.5166115 0 3.0000000 15.0000000 21.0000000 10.0000000 11.0000000 5.0000000 15.0000000 7.0000000 20.0000000 0 95% CL for Mean 2.92643736 4.29578486 95% CL for Sum 842.813961 1237.18604 Sum 1040.000000 VTHAT SETHAT 73224133 8557.1101 The ANOVA Procedure Sum of DF Squares Mean Square 11 151.8888889 13.8080808 24 108.6666667 4.5277778 35 260.5555556 F Value 3.05 Estimated variance of ybhar-hat over different n and m with nm = 36. n 12 9 6 3 m VYBAR 3 0.3778964 4 0.463825 6 0.6356822 12 1.1512537 33 Pr > F 0.0108 15 Sampling with PPS. Illustration of sampling with probability proportional to size for a one-stage sample using the Ag data for Alabama with the cluster being a county and the ssu’s being farms (in 92) in the county. Objective is to estimate total acres in 92, which for the purposes here we pretend we have only for the sampled counties, while the number of farms is assumed to be known beforehand for all counties. option ls=80 nodate; /* read in data from agpop.txt. Note the :$25. format to handle the county name */ data a; infile ’agpop2.txt’ dlm=’,’ firstobs=2; input COUNTY :$25. STATE $ ACRES92 ACRES87 ACRES82 FARMS92 FARMS87 FARMS82 LARGEF92 LARGEF87 LARGEF82 SMALLF92 SMALLF87 SMALLF82 REGION $; if state=’AL’; /* convert missing values */ array numval[*] _numeric_; do i = 1 to dim(numval); if numval[i]=-99 then numval[i]=.; end; keep county state farms92 acres92; run; proc means mean sum n; var farms92 acres92; run; proc surveyselect data=a out=cout method=pps sampsize=4 jtprobs; size farms92; run; proc print data=cout; run; proc surveymeans data=cout mean sum; var acres92; weight SamplingWeight; run; IML CODE TO DO ANALYSIS. SEE POSTED FILE. The MEANS Procedure Variable Mean Sum N ---------------------------------------------FARMS92 565.7462687 37905.00 67 ACRES92 126131.69 8450823.00 67 ---------------------------------------------- 34 Obs 1 2 3 4 Obs 1 2 3 4 The SURVEYSELECT Procedure Selection Method PPS, Without Replacement Size Measure FARMS92 Input Data Set A Random Number Seed 83578 Sample Size 4 Output Data Set COUT Selection COUNTY STATE ACRES92 FARMS92 Prob RUSSELL COUNTY AL 112620 213 0.022477 DALE COUNTY AL 134555 403 0.042527 PICKENS COUNTY AL 106206 404 0.042633 BALDWIN COUNTY AL 167832 941 0.099301 Unit JtProb_1 JtProb_2 JtProb_3 JtProb_4 1 0 .000733392 .000735212 .001712462 2 .000733392 0 .001399373 .003259432 3 .000735212 .001399373 0 .003267641 4 .001712462 .003259432 .003267641 0 Sampling Weight 44.4894 23.5143 23.4561 10.0704 The SURVEYMEANS Procedure Number of Observations 4 Sum of Weights 101.530173 Std Error Variable Mean of Mean Sum Std Dev -----------------------------------------------------------------------ACRES92 121695 8826.356920 12355673 2831163 -----------------------------------------------------------------------T PI JPROB 112620 0.0224772 0 0.0007334 0.0007352 0.0017125 134555 0.0425274 0.0007334 0 0.0013994 0.0032594 106206 0.0426329 0.0007352 0.0013994 0 0.0032676 167832 0.0993009 0.0017125 0.0032594 0.0032676 0 THT 12355673 SE1 SE VAR1 VAR2 VAR 6529858 3161750.2 4.2639E13 -3.264E13 9.9967E12 Continue with pps sampling for the ag data. The program below simulates pps sampling by using surveyselect with reps=200. This generates 200 pps samples. These are analyzed in surveymeans for each rep and then an iml program, gets the HT estimator, the estimated standard error that would come from SAS and the estimated SE from our general approach for using the HT estimator. Note that the latter approach can lead to a 0 variance in which case the SE is set to 0. The full program is on the web. option ls=80 nodate; %macro simpps(nreps,ss); 35 READ DATA ETC. AS BEFORE proc surveyselect data=a out=cout method=pps sampsize=4 jtprobs reps=&nreps; size farms92; run; *proc print data=cout; run; proc surveymeans data=cout sum total=67; var acres92; weight SamplingWeight; by replicate; run; IML program omitted here. It creates a temporary SAS file with the estimate, variances and se’s for each rep. proc print data=new; run; proc means n mean std var; var tht se var sasvar sasse; run; data se0; set new; if se=0; proc print; run; %mend; %simpps(200,4); -------------------------- Sample Replicate Number=1 --------------------------The SURVEYMEANS Procedure Variable Sum Std Dev ---------------------------------------ACRES92 5698397 1355252 ----------------------------------------------------------------- Sample Replicate Number=2 --------------------------Variable Sum Std Dev ---------------------------------------ACRES92 7885011 2697267 ---------------------------------------Obs 1 2 Obs 1 2 THT 5698397.48 7885010.83 VAR 1249061061641 6312328042973 SE1 SE VAR1 2903953.32 1117614.00 8.4329E12 4486693.74 2512434.68 2.013E13 SASVAR SASSE 1.8367089E12 1355252.34 7.2752503E12 2697267.19 From proc means on the simulation results. Variable N Mean 36 Std Dev VAR2 -7.1839E12 -1.3818E13 Variance --------------------------------------------------------------THT 200 8389386.06 2233962.33 4.9905877E12 SE 200 1978946.91 1264255.14 1.5983411E12 VAR 200 5.4666143E12 5.779498E12 3.3402597E25 SASVAR 200 5.0891309E12 5.264758E12 2.7717677E25 SASSE 200 1958429.19 1122490.94 1.2599859E12 --------------------------------------------------------------There were 18 cases where the VAR = estimated variance of the HT estimator was negative leading to an SE of 0. A run with 1000 simulations was done leading to: Variable N Mean Std Dev Variance ---------------------------------------------------------------THT 1000 8455585.13 2277713.11 5.187977E12 SE 1000 1979823.45 1191698.30 1.4201448E12 VAR 1000 5.3011582E12 5.3527862E12 2.8652321E25 SASVAR 1000 4.9047734E12 5.0047122E12 2.5047145E25 SASSE 1000 1925196.50 1095258.61 1.1995914E12 ---------------------------------------------------------------72 of the 1000 cases lead to an SE of 0. With 5000 reps. Variable N Mean Std Dev Variance ---------------------------------------------------------------THT 5000 8514163.91 2370980.32 5.6215477E12 SE 5000 1995240.18 1189328.22 1.4145016E12 VAR 5000 5.3617235E12 5.3534185E12 2.8659089E25 SASVAR 5000 4.8888844E12 4.9823656E12 2.4823967E25 SASSE 5000 1923108.90 1091226.25 1.1907747E12 ---------------------------------------------------------------357 of the cases had SE=0 In the class notes it is shown that the SAS approach for variance and SE in this case is reasonable if n/N is small, the selection probabilities are small and the sampling of psu’s is like with replacement. With a sample of 4 psu’s out of 67 and the selection probabilities here, it looks like this is working fairly well. It does not suffer from the problems that the unbiased estimator of V ( t̂HT ) has of sometimes being negative. 37 16 Bootstrapping the duck data. See Section 6 for analysis of ratio of means and ratio estimate of ȳ u and t where we used delta method and jackkinfe to get standard errors. Here we bootstrap, using 10,000 bootstrap samples. First we generate the samples by sampling 14 times with replacement from the orginal n = 14 observations. Then we create a population of 140 by replicating the sample 10 time and each bootstrap sample consists of a simple random sample without replacement of 14 out of the 140. The b at the end of a variable name indicates value computed from bootstrap sample, xbar = x̄ ybar = x̄, bhatb = ȳ/x̄, yubhatb = ratio estimate of ȳ u , thatb = ratio estimate of t, thatbunb = unbiased estimator of t. bootstrapping the duck data with replacement Variable XBARB YBARB BHATB YUBHATB THATB N 10000 10000 10000 10000 10000 Mean 17.3222507 21.0941929 1.2275736 22.8500553 3199.06 Std Dev 1.6596200 7.7781307 0.4644919 8.6460525 1210.47 Minimum 10.1214286 2.6428571 0.1637893 3.0487738 426.8348827 Maximum 21.8928571 60.0000000 3.2814852 61.0815655 8551.55 bootstrapping the duck data without replacement Variable XBARB YBARB BHATB YUBHATB THATB THATBUNB N 10000 10000 10000 10000 10000 10000 Mean 17.3140521 20.9071143 1.2171495 22.6560202 3171.89 2927.00 Std Dev 1.5750321 7.3492109 0.4413816 8.2158764 1150.24 1028.89 38 Minimum 11.2857143 4.6428571 0.2357635 4.3885020 614.3997098 650.0000000 Maximum 21.7428571 54.5000000 3.4400361 64.0328314 8964.73 7630.00 17 Two-phase ratio estimation using data from Nurses Health study. Here we use a initial sample of 173 nurses. This was actually a validation data set to estimate the relationship between measures of diet intake using an x = FFQ (food frequence questionairre) and y = diet record. We will work with saturated fat intake with variables fqsat and drsat. We’ll suppose that our objective is to estimate the population mean ȳ u , defined in terms of the diet record. We’ll treat the n1 = 173 as an SRS from 1000 and suppose that fsat is measured on all 173 and then drsat is measured on a random subsample of size n 2 = 50. We examine using a two-phase ratio estimator. Since we took an SRS of 173 and the a sub SRS of 50, the final is like an SRS of size 50 from N. (Why?). If we analyze just the 50 we get ȳˆu = 24.6152 and SE = 0.9131477. Using the summary statistics below we have x̄ 1 = 21.92, x̄2 = 20.486, ȳ2 = 24.615 and s2y = 6.6242 (sample variance of y from the 50 stage 2 values.) (1) With SRS’s and notation on page 54 of notes, π i = n1 /N and θi = n2 /n2 and so their product is (2) (2) (2) n2 /N . This leads to tby = N ȳ2 , tbx = N x̄2 and tbx = N x̄2 so the two-phase ratio estimator of t y is tb(2) yr = N x̄1 and of the mean is (2) ȳ2 = 26332.96 x̄2 ȳˆur = tb(2) yr /1000 = 26.33. We can estimate the variance of tbyr as described in exercise 12.2. This leads to ˆ Vb (tb(2) yr ) = 987213.17with SE = 993.58602 and SE( ȳ ur ) = 0.993586. Notice that you could argue for estimating the variance more directly. We can write tb(2) yr = ȳ2 Nb N n1 x̄1 = t2 , n1 x̄2 n1 where tb2 is the estimate of the total of the y’s at the first stage by applying ratio estimation at the second stage. (2) Given stage 1, we know E(tbyr |stage1) = N ȳ1 and we know how to obtain V (tb2 |stage1) (see page 68 of the book). The unconditional variance is then V (ȳ 1 ) + (N/n1 )2 E(V (tb2 |stage1)). We can estimate E(V (tb2 |stage1)) using Vb (tb2 |stage1) from a ratio analysis at the second stage and V (ȳ 1 ) with N 2 (1 − (n/N ))s2y /n. Combined this gives exactly the estimator in exercise 12.2. 39 THAT SETHAT YURHAT 26332.961 993.58603 26.332961 Variable fqsat drsat Variable fqsat drsat SEYURHAT 0.993586 statistics for whole sample N Mean Std Dev 173 21.9156069 9.2753950 50 24.6152000 6.6246695 statistics for validation sample N 50 50 Mean 20.4860000 24.6152000 Std Dev 6.6459443 6.6246695 THAT SETHAT YURHAT 26332.961 993.58603 26.332961 123 Std Error 0.7051952 0.9368697 124 Std Error 0.9398785 0.9368697 SEYURHAT 0.993586 Notice that we have not really gained anything with the ratio approach here. The standard error for the ratio estimator of ȳu is actually bigger than just using the SRS of 50. A better approach would be to consider using a regression estimator, especially since it doesn’t appear a regression through the origin is appropriate. Even if it was the ratio approach is not always best. 18 Coverage and Future Topics Tried to strike balance between applications and doing enough theory for you to work and read on you own in the future. • More on model based approaches. Often a useful approach. Based on regression and related techniques. • More applications to complex surveys. You have all the tools to handle almost any scenario. • More on variance estimation techniques for handling non-linear functions including estimating means using ratio estimators and (later) estmating regression coefficients and other non-linear functions of totals. • More complicated non-response models (e.g., callbacks) • Randomized response for sensitive questions. Easy for you to read on your own. 40 • Estimating regression models (linear and non-linear) and categorical data analysis using data from complex surveys. This ends up often involving non-linear functions of estimated totals. This is one place where model based approaches also come into play. In general the modeling issues can be a bit subtle. • Measurement error in either a quantitative variable or misclassification of a categorical variable. (“Measurement Error in Surveys”, Biemer et al.) 41 Contents 1 A few motivating examples. 1 2 Example 1: Estimating egg mass and defoliation means with an SRS. 2 3 Simulations of estimated mean and CI with SRS 4 4 Example 2: Estimating a proportion with an SRS. 6 5 Simulation in estimating a proportion with an SRS 7 6 Duck Example: Estimating population size with variable size units using ratio and regression estimators. 11 7 Allocation in Stratified Sampling: Example using the agricultural farm data. 15 8 Stratified Sampling in Surveymeans: Estimating egg mass densities with statified samples. 17 9 Single stage cluster example. Ice Cream data. 20 10 Ice Cream data with stratification and cluster sampling. 24 11 Ozone levels and systematic sampling 27 12 Clarification on page 31 and model/design based approach with an SRS 29 13 Analysis of the egg-defoliation data as a two-stage cluster sample 30 42 14 Two-stage cluster sampling; worm fragments in cans of corn. 32 15 Sampling with PPS. 34 16 Bootstrapping the duck data. 38 17 Two-phase ratio estimation using data from Nurses Health study. 39 18 Coverage and Future Topics 40 43
© Copyright 2026 Paperzz