Impact on the quality of the SARs Perturbation and Imputation

Perspective on User Needs for
Government Data
Where Do We Go From Here?
Natalie Shlomo
Social Statistics, School of Social Sciences
University of Manchester
[email protected]
1
Topics Covered
•
Traditional forms of statistical outputs
•
Disclosure risk and data utility
•
Differential privacy/Inferential
disclosure
•
Future dissemination strategies
•
•
Table generating servers
•
Synthetic data
•
Remote access
•
Remote analysis
Challenges and limitations
2
Traditional Statistical Outputs
•
•
Survey Microdata
•
Social survey data generally released via data
archives for registered users
•
Business surveys have large sample fractions, eg.
take-all strata, and highly skewed distributions
and are generally not released
Tabular Data
•
Frequency Tables
•
•
•
Census (whole population) counts with careful
design of output variables
Weighted sample counts
Magnitude Tables
•
Mainly for business statistics
3
Types of Disclosure Risk
•
Identity disclosure
Identification is widely referred to in confidentiality
pledges and code of practice, e.g. “…no statistics will
be produced that are likely to identify an individual
unless specifically agreed with them” (principle 5 of
NS Code of Practice)
Examples:
Survey microdata – identify respondent through rare
categories (population unique) and/or response
knowledge
Census tables – a small cell (1 or 2)
4
Types of Disclosure Risks
•
Individual attribute disclosure
Confidential information about a data subject is
revealed and can be attributed to the subject
Identity disclosure a necessary condition
attribute disclosure
for individual
Examples:
Survey microdata - individual identified and survey target
variables learnt, eg. health, income
Census table - unique cell on the margin, i.e. structural
zeros on the rows/columns
5
Types of Disclosure Risks
• Group attribute disclosure
Confidential information is learnt about a group and
may cause harm, i.e. all adults in a village collect
unemployment
Examples:
Survey microdata – difficult to find group attribute
disclosure under survey conditions
Census tables – caused by structural zeros, i.e.
row/column consists of all zeros except one cell
6
Types of Disclosure Risks
• Inferential Disclosure
Confidential information may be revealed exactly or to
a close approximation
Examples:
Survey microdata – a good prediction model with high
R2
Census tables – disclosure by differencing
This type of disclosure has been largely ignored!
7
Standard SDC Methods
• Survey Microdata from Social Surveys
Identity disclosure main concern since it can lead to
attribute disclosure
Disclosure control methods generally non-perturbative:
•
•
Deleting highly identifying variables (eg. geography)
•
Recoding identifying variables (eg. age, ethnicity)
Magnitude Tables
Attribute disclosure (since identities are likely known)
and concern is for dominance in a cell
Disclosure control methods:
•
Table design
•
Cell suppression
8
Standard SDC Methods
• Census Tables
Identity disclosure, attribute disclosure and
disclosure by differencing
Disclosure control methods:
•
Careful design of tables and threshold criteria
•
Fixed variables spanning tables to avoid differencing
•
In some countries, long form is a sub-sample
•
Pre-tabular methods eg. record swapping
•
Post-tabular methods eg. forms of rounding
9
Inferential Disclosure (Differential Privacy)
•
Differential privacy based on disclosure of a target
unit where the intruder has knowledge of the entire
database except for the target unit itself
•
No distinction between key variables and sensitive
variables, types of disclosure risks, or whether data
arises from a sample or population
•
Differential privacy similar to the notion of disclosure
by differencing since in this case even a sum of
counts or averages are disclosive
10
Inferential Disclosure (Differential Privacy)
Definition of Differential Privacy with respect to
statistical databases (Dwork, et al. 2006, Shlomo and
Skinner 2012)
Assume a population database X U  {x1 , x2 ,..., xN }
from which a sample is drawn Xs  {x1 , x2 ,..., xn }
Assume the agency releases a set of counts: f  { f1 , f 2 ,..., f k }
where f k  is I ( xi  k )

Assume the intruder knows the population database
except for one target unit
Let Pr(f | X U ) denote the probability of f with respect
to an SDC mechanism where XU is treated as fixed
11
Inferential Disclosure (Differential Privacy)
Then
  differential privacy holds iff
 Pr(f | X U1 ) 

Max{ln 
 Pr(f | X U ) 
2


for   0 and maximum taken over all possible
pairs ( X U , X U )
1
2
which differ by only one unit and across all possible
vectors of f
•
•
Guarantee of differential privacy by adding noise to
all outputs
Amount of noise depends on the number of units in
the query but independent of the data
12
Inferential Disclosure (Differential Privacy)
Does sampling and the release of microdata guarantee
differential privacy (Shlomo and Skinner, 2012)?
No!
•
•
•
Let fk be a sample count
It is assumed that an intruder knows everything in
the population table except for one unit
If Fk=fk and we move one of the counts of Fk to
another cell than we may obtain Fk<fk which is
impossible
Sampling is not differentially private
How likely is it to get Fk=fk in a sample? Usually 2-3%
Agencies will generally decide to allow the ‘slippage’
and issue the controlled release of microdata
13
Inferential Disclosure (Differential Privacy)
Does perturbation guarantee differential privacy?
Assume a perturbation mechanism:
Pr( xi  j1 | xi  j2 )  M j1 j2
Then the ratio in the definition will contain the elements:
Pr[ ~
xU | xU1 ] Pr[ ~
xi | xi1 ] M ~xi , xi1


~
~
Pr[ xU | xU 2 ] Pr[ xi | xi2 ] M ~xi , xi 2
If the perturbation mechanism does not have a zero
probability, then perturbation schemes are differentially
private
14
Inferential Disclosure (Differential Privacy)
Examples of perturbation mechanisms:
Recoding:
M j1 j2
1 j2  1,..., a and j1  1, or j2  a  1,..., k and j1  j2  a  1

0 otherwise
Random data swapping:
PRAM:
M j1 j2  M j2 j1
 fj 
 
f j1 f j2
2

, M jj    .
n
n
 
 
2
2
M
In practice we control perturbation and add zeros to
ensure edits
15
‘Safe Data’ vs ‘Safe Access’
•
•
In the last decade agencies are increasingly
concerned about breaches of confidentiality,
particularly with large number of open databases
that can be used to attack statistical data
Agencies are restricting access to data with more
stringent licensing and the use of on-site data labs
• How can we make statistical data more
available to users?
• Why aren’t agencies making more use of
‘modern’ dissemination strategies?
16
Future Dissemination Strategies
Census Tables
•
•
•
•
•
On-line flexible table generation based on web
package
Input data are frequency counts in a multidimensional hypercube with small geographical areas
Disclosure risk measures and SDC methods applied
‘on-the-fly’
Set of rules embedded in the package, eg. population
thresholds, proportion of small cells, etc.
To avoid disclosure by differencing, must add noise
17
Example: Simulation Hypercube
Shlomo, Antal and Elliot, 2015
•
Population N=1,500,000







NUTS2 Region - two regions
Gender – 2 categories
Banded age groups – 21 categories
Current Activity Status – 5 categories
Occupation – 13 categories
Educational attainment – 9 categories
Country of citizenship – 5 categories
Cell Value
0
1
2
3-5
6-8
9-10
11 and over
Total
Number of Cells
226,939
4,028
2,112
2,964
1,664
720
7,273
245,700
Percentage of Cells
92.36%
1.64%
0.86%
1.21%
0.68%
0.29%
2.96%
100.00%
18
Flexible Table Generating Servers
•
•
Based on restrictions of the server, define a 3dimensional table with one variable to define the
population: banded age group, education group and
occupation group defined for NUTS2=1
Table has 2,457 cells, 854,539 individuals, average
cell size of 347.8
Cell Value
0
1
2
3
4
5 and over
Total
Number of Cells
Percentage of Cells
1534
44
35
62.43%
1.79%
27
20
797
2457
1.42%
1.10%
0.81%
32.44%
100.00%
19
Information Based Disclosure Risk and
Data Utility Measures
•
To assess attribute disclosure in tables mainly caused
by structural zeros, use the entropy
N log N  i 1 Fi log Fi
K
H(
F
)
N
N
where F  {F1, F2 ,..., FK } vector of frequency counts and N  
Entropy bounded by 0 if all cells are zero except one
cell, and log(K) if all cell values are equal, i.e. cell
proportions are 1/K
F


1   H ( ) log K 
Risk measure:
N


Combine with other measures (proportion of zeros
and size of the population)and define weighted
average:
K
i 1
•
•
1 1
 K

|
F

|

K

N
 N log N  K Fi log Fi 
 i 1 i 2 2

1 
 1
i 1
  (1  w1  w2 ) 
R( F , w1 , w2 )  w1 
log
  w2 1 
K
N log K
e N 


 N




20
Fi
Information Based Disclosure Risk and
Data Utility Measures
•
•
•
Risk measure extended to account for perturbation and
sampling based on conditional entropy
Utility measure: Hellenger’s Distance
1
K
2

HD ( F , F ) 
(
F

F
)
k 1 k
k
2
where original counts F  { F1 , F2 ... FK }
perturbed counts F   { F1, F2 ... FK }
Hellenger’s Distance bounded by 0 and N
used to compare SDC methods:
1
and can be
HD ( F , F ' )
N
21
Results
Original
Perturbed Input
Record Swapping:
Semi-controlled
Random Rounding
Stochastic Perturbation
Perturbed Output:
Semi-Controlled
Random Rounding
Disclosure
Risk R( F , w1 , w2 )
in (3)
Table 1
0.318
Data Utility
in (4) 1  HD ( F , G)
N
-
0.282
0.137
0.988
0.991
0.239
0.995
0.135
0.993
• Comparing the rounding before and after shows that SDC
‘on the fly’ has lower disclosure risk and the highest utility
out of all the methods
22
Future Dissemination Strategies
Synthetic Datasets
Partially-synthetic micro data
 Preserves the record structure of the gold standard
micro data


Replaces data elements with synthetic values
sampled from an appropriate probability model
Future work to assess disclosure risk
Fully-synthetic micro data



Preserves some of the gold standard micro data
Generates synthetic entities and data elements from
appropriate probability models
In practice, very difficult to capture all conditional
relationships between variables and within subpopulations
CTA (controlled tabular adjustment) where suppressed
cells take imputed values
23
Future Dissemination Stategies
Data Enclaves
•
•
•
•
A secure IT environment where researchers can
access confidential data on-site, eg. Virtual Microdata
Lab (VML) at the ONS
Researchers apply to carry out a project and sign a
contract and confidentiality agreement
Minimise risk of disclosure:
•
No removal of data, no printers, not linked to
internet
•
All outputs checked manually by staff
•
Training course for understanding security rules
Research needed on what is a disclosive output
24
Future Dissemination Stategies
Remote Access
•
Access to data through remote connection to secure
server, typically at Universities and Research
Institutes
•
Carry out analysis as if on personal PC and view
results on screen
•
Outputs dropped in a mail box to be manually
checked and emailed back to researchers
25
Future Dissemination Strategies
Remote Analysis
•
Some agencies (eg. Census Bureau, ABS) developing
platforms for remote analysis or allowing researchers
to submit code to be run on-site
•
Aim to protect outputs without the need for
intervention
Example (O’keefe and Shlomo, 2012):
•
Comparison of confidential input versus confidential
outputs
•
338 Sugar Canes Farm Data from a 1982 survey of
sugar cane industry in Queensland, Australia: Region
(4 categories) and 5 continuous variables: Area,
Harvest, Receipts, Costs, Profits (=Receipts-Costs)
•
Confidentialized input by additive noise and removing
outliers
26
Future Dissemination Strategies
Remote Analysis
•
Receipts:
Original
Input
Output
Future Dissemination Strategies
Remote Analysis
•
Receipts:
Original
Input
Output
Example
Future Dissemination Strategies
•
Residuals:
Original
Input
Output
Challenges and Discussion
•
•
•
•
•
•
In recent years, managing disclosure risk is about
restricting access to data
More government initiatives for ‘open data’
Agencies need to use modern dissemination strategies to
accommodate increasing demands for ‘open data’
Need stricter and tighter definitions of disclosure risk but
users will have to work with perturbative SDC methods
Agencies should release the methods and parameters of
the perturbation so researchers can cope with
measurement error
For ‘on the fly’ SDC methods, agencies should release
utility measures based on the original file/tables