Improved Variance Estimation for Fully Synthetic Datasets UNECE Work Session on Statistical Data Confidentiality 27. October 2011, Tarragona Jörg Drechsler Institute for Employment Research Fully synthetic datasets Originally proposed by Rubin (1993) Closely related to the idea of multiple imputation for nonresponse All values of the original dataset are replaced by synthetic values Offer a very high level of data protection Attractive for very sensitive data such as healthcare data 2 Fully synthetic datasets in theory X Ynot observed YYsynthetisch synthetisch Y synthetisch Y Y synthetisch synthetic Yobserved 3 Fully synthetic datasets in practice Based on the original design, the synthetic populations consist of a large number of synthetic records and a small number of original records. There is a small chance that the released samples from these populations also contain original records. Main advantage of fully synthetic datasets is lost In practice, intermediate step of generating populations is omitted Synthetic samples are generated directly All records are synthetic 4 Combining rules for fully synthetic datasets Raghunathan et al. (2003) developed the combining rules necessary to obtain valid inferences from fully synthetic datasets Let q i be the point estimate obtained from dataset i, i 1,... m Let u i be the estimated variance of q i The following quantities are needed for inference 1 m qm qi / m m i 1 bm (qi qm ) 2 /( m 1) m u m ui / m i 1 5 Combining rules for fully synthetic datasets Final point estimate 1 qm i qi m Final variance estimate T f (1 1 / m)bm u m Two major disadvantages: Variance estimate strictly valid only for the original synthesis design Variance estimate can be negative Reiter (2003) suggested an adjusted variance estimate that is always positive but conservative 6 Alternative variance estimate Closely related to the variance estimate for partially synthetic datasets Only need to adjust for the potentially different sample sizes between the original sample and the synthetic sample nsyn Talt org u m bm / m, norg where org is the finite population correction factor for the original sample Advantages Can never be negative Valid even if all records are synthesized Disadvantages: Only valid for N - consistent estimators Only valid under simple random sampling 7 Illustrative simulations Repeated simulation design One standard normal variable Y Population size N=10,000 Repeatedly draw SRS of different sizes (1%, 5%, 10%, 20%) Generate two versions of synthetic data with nsyn=2norg and m=5,20,100 Based on original synthesis design (RRR approach) Synthesizing all records directly (practical approach) Quantity of interest Y Compute the variance estimates T f and Talt under both synthesis designs Replicate 5,000 times 8 9 10 11 Conclusions Originally proposed variance estimate can be biased if all records are synthesized and the sampling rate is larger than 1%. Alternative variance estimate shows less variability than the original variance estimate can never be negative is always unbiased irrespective of the synthesis design Alternative variance estimate is valid only for N –consistent estimates under simple random sampling Future work: Think about adjustments for complex sampling designs 12 Thank you for your attention
© Copyright 2026 Paperzz