Folie 1

Improved Variance Estimation for Fully
Synthetic Datasets
UNECE Work Session on Statistical
Data Confidentiality
27. October 2011, Tarragona
Jörg Drechsler
Institute for
Employment Research
Fully synthetic datasets
 Originally proposed by Rubin (1993)
 Closely related to the idea of multiple imputation for nonresponse
 All values of the original dataset are replaced by synthetic values
 Offer a very high level of data protection
 Attractive for very sensitive data such as healthcare data
2
Fully synthetic datasets in theory
X
Ynot observed
YYsynthetisch
synthetisch
Y
synthetisch
Y
Y synthetisch
synthetic
Yobserved
3
Fully synthetic datasets in practice
 Based on the original design, the synthetic populations consist of a large
number of synthetic records and a small number of original records.
 There is a small chance that the released samples from these
populations also contain original records.
 Main advantage of fully synthetic datasets is lost
 In practice, intermediate step of generating populations is omitted
 Synthetic samples are generated directly
 All records are synthetic
4
Combining rules for fully synthetic datasets
 Raghunathan et al. (2003) developed the combining rules necessary to
obtain valid inferences from fully synthetic datasets
 Let q i be the point estimate obtained from dataset i, i  1,... m
 Let u i be the estimated variance of q i
 The following quantities are needed for inference
1 m
qm   qi / m
m i 1
bm   (qi qm ) 2 /( m  1)
m
u m   ui / m
i 1
5
Combining rules for fully synthetic datasets
 Final point estimate
1
qm  i qi
m
 Final variance estimate
T f  (1  1 / m)bm  u m
 Two major disadvantages:
 Variance estimate strictly valid only for the original synthesis design
 Variance estimate can be negative
 Reiter (2003) suggested an adjusted variance estimate that is always
positive but conservative
6
Alternative variance estimate
 Closely related to the variance estimate for partially synthetic datasets
 Only need to adjust for the potentially different sample sizes between the
original sample and the synthetic sample
nsyn
Talt   org
u m  bm / m,
norg
where  org is the finite population correction factor for the original sample
 Advantages
 Can never be negative
 Valid even if all records are synthesized
 Disadvantages:
 Only valid for N - consistent estimators
 Only valid under simple random sampling
7
Illustrative simulations
 Repeated simulation design
 One standard normal variable Y
 Population size N=10,000
 Repeatedly draw SRS of different sizes (1%, 5%, 10%, 20%)
 Generate two versions of synthetic data with nsyn=2norg and m=5,20,100
 Based on original synthesis design (RRR approach)
 Synthesizing all records directly (practical approach)
 Quantity of interest Y
 Compute the variance estimates T f and Talt under both synthesis designs
 Replicate 5,000 times
8
9
10
11
Conclusions
 Originally proposed variance estimate can be biased if all records are
synthesized and the sampling rate is larger than 1%.
 Alternative variance estimate
 shows less variability than the original variance estimate
 can never be negative
 is always unbiased irrespective of the synthesis design
 Alternative variance estimate is valid only
 for N –consistent estimates
 under simple random sampling
 Future work: Think about adjustments for complex sampling designs
12
Thank you for your attention