A statistical approach to surrogate data

1
A statistical approach to
surrogate data
Li-Chun Zhang
Statistics Norway
E-mail: [email protected]
1
A setting of surrogate data
•
Target data
– Directly collected, such as in sample surveys
– For a subset of population, if available
•
Surrogate data
– To replace target data for statistical purposes, hence “surrogate”
 For reasons such as cost, burden, scope, etc.
–
–
–
–
•
Secondary data of nature: Re-use of data collected for other purposes
Typically from multiple sources
Often for the entire population or a major part of it
Two examples:
 Register-based Employment statistics (surrogate) & LFS (target)
 Self-administered census (surrogate) & post-census survey (target)
Issues of concern:
– Conditions for valid substitution
– Associated statistical accuracy
2
Unit-specific approach: Equality vs. equivalence
• Unit-specific approach
– Scheme
 Link surrogate and target data at the micro level
 Estimation of relevant unit-specific misclassification rates
 Propagation of uncertainty to statistics of interest
– Two shortcomings
 Require micro-level linkage
 Unit-specific consistency may be irrelevant or misleading for uses
• Equality vs. equivalence – An example
–
–
–
–
Two binary data sets of the same size
Equal mean without the values being equal for all the units
Identical empirical CDF => identical statistical inference
Inequality may fail to reveal statistical equivalence
3
Other relevant situations
• Some settings
–
–
–
–
Indirect (proxy) interview
Unstable reporting
Mode effects
Public micro data
• Some observations
– Unit-specific approach may not be applicable
– Unit-specific equality may not even be desirable
– To use surrogate data (Z) in place of target data (Y), together with
additional data (X)
 Joint distribution of (Z,Y|X) is not of primary interest for users
 Distribution of (Z,X) instead of distribution of (Y,X) is the issue
4
Validity and equivalence
•
Valid surrogate data
– Denote by f(x,y) and f(x,z) the distribution functions:
f(X=x, Z=y) = f(X=x) f(Z=y | X=x) = f(X=x) f(Y=y | X=x) = f(X=x, Y=y)
– Example: X = age-sex grouping, Z = register-employment status,
Y = LFS-employment status according to ILO-definition
– Example: Z = proxy-interview in LFS, Y = direct-interview in LFS
– Equality of distribution can be assessed without linked / linkable data
•
Empirically equivalent surrogate data
– Denote by p(x,z; s1) and p(x,y; s2) the empirical distribution functions:
p(X=x; s1) p(Z=y | X=x; s1) = p(X=x; s2) p(Y=y | X=x, s2)
– Equality on micro level not necessary & s1 may differ from s2
– Parametric analogy: Statistical equivalence by Sufficiency Principle
5
Similar ideas in disclosure control literature
•
•
•
•
Fienberg et al. (1998)
– Random generation of “pseudo” micro data Z conditional on {x; s}
– Parametric f(y | x) or empirical p(y | x)
– Conditional validity in expectation, provided unbiased estimation
Rubin (1993)
– Synthetic data & Bayesian multiple-imputation framework
– Random generation of population data + sampling
– No particular emphasis on conditioning & validity instead of equivalence
SARs
– Sample of Anonymised Records from census data
– Real data albeit anomymized
– Valid surrogate data
Micro simulation
– Based on sample instead of census data
– Random generation of “imaginary” micro data
– Validity in expectation provided unbiased estimation of distribution
6
Some applications / implications?
•
Statistics and inference based on surrogate data
– Validity (or equivalence) vs. efficiency (or accuracy)
– Example: Employment register (ER) vs. LFS
•
•
 Deterministic ER-status by editing rules vs. valid ER-status for specific purposes
 Bias of invalid ER-status vs. variance of valid ER-status
 Balance in trade-off may change direction on more detailed levels
Micro data for public use
– Targeting full empirical equivalence followed by disclosure control (DC)
– Equivalent data targeting at coarsened information (embedded DC)
Micro calibration of surrogate data
– Secondary population (U) data (X, Z1, Z2, …, Zk; U) & target sample data
(X, Y1; s1), (X, Y2; s2), …, (X, Yk; sk) --- different units in general
– Surrogate data (X*, Z1*, Z2*, …, Zk*) with marginal validity btw. (X; U) and
(X*; U), (X*, Z1*; U) and (X, Y1; s1), …, (X*, Zk*; U) and (X, Yk; sk)
– Conditional surrogate data (X, Z1*, Z2*, …, Zk*) with marginal validity btw.
(X, Z1*; U) and (X, Y1; s1), …, (X, Zk*; U) and (X, Yk; sk)?
– Alternative to statistical matching by Conditional Independence Assumption
– Uncertainty?
7