Outcomes from Cross Portfolio Data Integration Oversight Board

October 2011
Linda Fardell
Cross Portfolio Data Integration Secretariat
What is it & why should you care?
• It’s about obligations – legal/ethical
• Aim – protect identity and
release useful data
• It’s more than removing name & address
• Trust of providers is essential to get good stats
Information is power
• Banker in Maryland obtained a list of patients
with cancer
• compared with list of clients with outstanding
loans
• called in the loans of clients with cancer.
Source: Data confidentiality: a review of methods for statistical
disclosure limitation and methods for assessing privacy (Statist. Surv.
Volume 5 (2011), 1-29.
Legislative obligations
• Privacy Act
• Specific legislation governing collection & use of
information e.g.
• Social Security (Administration) Act 1999
• Taxation Administration Act 1953
Other obligations
• Principles based obligations
e.g. High Level Principles for Data Integration Involving
Commonwealth Data for Statistical and Research
Purposes
How agencies meet these obligations
• Implement procedures to address all aspects of
data protection
• To ensure that identifiable information:
•
•
•
•
is not released publicly;
is available on a ‘need to know’ basis;
can’t be derived from disseminated data; and
is maintained and accessed securely.
Managing identification risk
Understand your obligations
Establish policies and procedures
De-identify the data
Assess potential identification risks
Test and evaluate to
mitigate risks
Manage the risks of identification - confidentialise
Provide safe access to data
Access to other information
• Keep track of all information released from the
dataset.
When should a cell be confidentialised?
• Common confidentiality rules:
• frequency (threshold) rule
• cell dominance (cell concentration) rule
• Keep specific confidentiality procedures secret
(e.g. the particular value chosen when applying
the threshold rule)
Two general methods
• Data reduction
• Data modification (perturbation)
Example: frequency rule - 5
Age
Before
Income
Low
Med
High
Total
15–19
20
0
0
20
20–29
14
11
8
33
30–39
8
12
7
27
40–49
6
18
24
48
50–59
4
5
14
23
60+
12
9
7
28
Total
64
55
60
179
Example: cont.
After
Age
Income
Low
Med
High
Total
15–19
20
0
0
20
20–29
14
11
8
33
30–39
8
12
7
27
40–49
6 n.p.
18 n.p.
24
48
50–59
4 n.p.
5 n.p.
14
23
60+
12
9
7
28
Total
64
55
60
179
Alternative: concealing totals
Age
Income
Low
Medium
High
Total
15–19
20
0
0
20
20–29
14
11
8
33
30–39
8
12
7
27
40–49
6
18
24
48
50–59
n.p.
5
14
>19
60+
12
9
7
28
Total
>60
55
60
>175
E.g. 2 – the cell dominance (n,k) rule
Widget brand
Profit ($m)
A
150
B
93
C
21
D
13
E
8
F
8
G
6
H
1
Total
300
• Cell unsafe if combined
contributions of the ‘n’
largest members of the cell
represent more than ‘k’% of
the total value of the cell
• n & k values are set by data
custodian
• Example: (2, 75) rule
• A & B contribute 81% of
total profit, so profit needs
protecting
Data modification methods
Before
rounding
RR3
Age
Income
Low
Med
High
Total
15–19
20
0
0
20
20–29
14
11
8
33
30–39
8
12
7
27
40–49
6
18
24
48
50–59
4
5
14
23
60+
12
9
7
28
Total
64
55
60
179
Data modification methods
After
rounding
RR3
Age
Income
Low
Med
High
Total
15–19
20 21 0
0
20 21
20–29
14 15 11 12
89
33
30–39
89
12
76
27
40–49
6
18
24
48
50–59
43
56
14 15
23 24
60+
12
9
76
28 27
Total
64
63
55
54
60
179
180
Microdata
• Valuable resource
• 2 key types of disclosure risk:
1. spontaneous recognition
2. deliberate (malicious) attempt
Microdata – managing risks
• confidentialising
• deterrents
• restricting access
• educating data users about their obligations
• safe environment for access
Microdata – methods to assess risks
• cross-tabulation of variables;
• comparing sample data with pop’n data to see if the
unique characteristics in the sample are unique in the
population; and
• acquiring knowledge of other datasets & publicly
available info. that could be used for list matching.
Protecting microdata
• 1st level of protection: remove direct identifiers
• Common ways to protect microdata are:
1. confidentialising; and/or
2. restricting access to the file
Confidentialising microdata
• Same principles as protecting aggregate data:
• limit variables
• introduce small amounts of random error (e.g. data
swapping)
• combine categories (e.g. age in 5 year ranges)
• top/bottom code
• suppress particular values/records that can’t
otherwise be protected.
Restricting access to microdata
What affects the risk of identification?
• motivation
• level of detail
• presence of rare characteristics
• accuracy of the data
• age of the data
• coverage of the data (completeness)
• presence of other information
A note on terminology…
• Confusion between de-identification and
confidentialisation
More information – www.nss.gov.au