1 A statistical approach to surrogate data Li-Chun Zhang Statistics Norway E-mail: [email protected] 1 A setting of surrogate data • Target data – Directly collected, such as in sample surveys – For a subset of population, if available • Surrogate data – To replace target data for statistical purposes, hence “surrogate” For reasons such as cost, burden, scope, etc. – – – – • Secondary data of nature: Re-use of data collected for other purposes Typically from multiple sources Often for the entire population or a major part of it Two examples: Register-based Employment statistics (surrogate) & LFS (target) Self-administered census (surrogate) & post-census survey (target) Issues of concern: – Conditions for valid substitution – Associated statistical accuracy 2 Unit-specific approach: Equality vs. equivalence • Unit-specific approach – Scheme Link surrogate and target data at the micro level Estimation of relevant unit-specific misclassification rates Propagation of uncertainty to statistics of interest – Two shortcomings Require micro-level linkage Unit-specific consistency may be irrelevant or misleading for uses • Equality vs. equivalence – An example – – – – Two binary data sets of the same size Equal mean without the values being equal for all the units Identical empirical CDF => identical statistical inference Inequality may fail to reveal statistical equivalence 3 Other relevant situations • Some settings – – – – Indirect (proxy) interview Unstable reporting Mode effects Public micro data • Some observations – Unit-specific approach may not be applicable – Unit-specific equality may not even be desirable – To use surrogate data (Z) in place of target data (Y), together with additional data (X) Joint distribution of (Z,Y|X) is not of primary interest for users Distribution of (Z,X) instead of distribution of (Y,X) is the issue 4 Validity and equivalence • Valid surrogate data – Denote by f(x,y) and f(x,z) the distribution functions: f(X=x, Z=y) = f(X=x) f(Z=y | X=x) = f(X=x) f(Y=y | X=x) = f(X=x, Y=y) – Example: X = age-sex grouping, Z = register-employment status, Y = LFS-employment status according to ILO-definition – Example: Z = proxy-interview in LFS, Y = direct-interview in LFS – Equality of distribution can be assessed without linked / linkable data • Empirically equivalent surrogate data – Denote by p(x,z; s1) and p(x,y; s2) the empirical distribution functions: p(X=x; s1) p(Z=y | X=x; s1) = p(X=x; s2) p(Y=y | X=x, s2) – Equality on micro level not necessary & s1 may differ from s2 – Parametric analogy: Statistical equivalence by Sufficiency Principle 5 Similar ideas in disclosure control literature • • • • Fienberg et al. (1998) – Random generation of “pseudo” micro data Z conditional on {x; s} – Parametric f(y | x) or empirical p(y | x) – Conditional validity in expectation, provided unbiased estimation Rubin (1993) – Synthetic data & Bayesian multiple-imputation framework – Random generation of population data + sampling – No particular emphasis on conditioning & validity instead of equivalence SARs – Sample of Anonymised Records from census data – Real data albeit anomymized – Valid surrogate data Micro simulation – Based on sample instead of census data – Random generation of “imaginary” micro data – Validity in expectation provided unbiased estimation of distribution 6 Some applications / implications? • Statistics and inference based on surrogate data – Validity (or equivalence) vs. efficiency (or accuracy) – Example: Employment register (ER) vs. LFS • • Deterministic ER-status by editing rules vs. valid ER-status for specific purposes Bias of invalid ER-status vs. variance of valid ER-status Balance in trade-off may change direction on more detailed levels Micro data for public use – Targeting full empirical equivalence followed by disclosure control (DC) – Equivalent data targeting at coarsened information (embedded DC) Micro calibration of surrogate data – Secondary population (U) data (X, Z1, Z2, …, Zk; U) & target sample data (X, Y1; s1), (X, Y2; s2), …, (X, Yk; sk) --- different units in general – Surrogate data (X*, Z1*, Z2*, …, Zk*) with marginal validity btw. (X; U) and (X*; U), (X*, Z1*; U) and (X, Y1; s1), …, (X*, Zk*; U) and (X, Yk; sk) – Conditional surrogate data (X, Z1*, Z2*, …, Zk*) with marginal validity btw. (X, Z1*; U) and (X, Y1; s1), …, (X, Zk*; U) and (X, Yk; sk)? – Alternative to statistical matching by Conditional Independence Assumption – Uncertainty? 7
© Copyright 2025 Paperzz