October 2011 Linda Fardell Cross Portfolio Data Integration Secretariat What is it & why should you care? • It’s about obligations – legal/ethical • Aim – protect identity and release useful data • It’s more than removing name & address • Trust of providers is essential to get good stats Information is power • Banker in Maryland obtained a list of patients with cancer • compared with list of clients with outstanding loans • called in the loans of clients with cancer. Source: Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy (Statist. Surv. Volume 5 (2011), 1-29. Legislative obligations • Privacy Act • Specific legislation governing collection & use of information e.g. • Social Security (Administration) Act 1999 • Taxation Administration Act 1953 Other obligations • Principles based obligations e.g. High Level Principles for Data Integration Involving Commonwealth Data for Statistical and Research Purposes How agencies meet these obligations • Implement procedures to address all aspects of data protection • To ensure that identifiable information: • • • • is not released publicly; is available on a ‘need to know’ basis; can’t be derived from disseminated data; and is maintained and accessed securely. Managing identification risk Understand your obligations Establish policies and procedures De-identify the data Assess potential identification risks Test and evaluate to mitigate risks Manage the risks of identification - confidentialise Provide safe access to data Access to other information • Keep track of all information released from the dataset. When should a cell be confidentialised? • Common confidentiality rules: • frequency (threshold) rule • cell dominance (cell concentration) rule • Keep specific confidentiality procedures secret (e.g. the particular value chosen when applying the threshold rule) Two general methods • Data reduction • Data modification (perturbation) Example: frequency rule - 5 Age Before Income Low Med High Total 15–19 20 0 0 20 20–29 14 11 8 33 30–39 8 12 7 27 40–49 6 18 24 48 50–59 4 5 14 23 60+ 12 9 7 28 Total 64 55 60 179 Example: cont. After Age Income Low Med High Total 15–19 20 0 0 20 20–29 14 11 8 33 30–39 8 12 7 27 40–49 6 n.p. 18 n.p. 24 48 50–59 4 n.p. 5 n.p. 14 23 60+ 12 9 7 28 Total 64 55 60 179 Alternative: concealing totals Age Income Low Medium High Total 15–19 20 0 0 20 20–29 14 11 8 33 30–39 8 12 7 27 40–49 6 18 24 48 50–59 n.p. 5 14 >19 60+ 12 9 7 28 Total >60 55 60 >175 E.g. 2 – the cell dominance (n,k) rule Widget brand Profit ($m) A 150 B 93 C 21 D 13 E 8 F 8 G 6 H 1 Total 300 • Cell unsafe if combined contributions of the ‘n’ largest members of the cell represent more than ‘k’% of the total value of the cell • n & k values are set by data custodian • Example: (2, 75) rule • A & B contribute 81% of total profit, so profit needs protecting Data modification methods Before rounding RR3 Age Income Low Med High Total 15–19 20 0 0 20 20–29 14 11 8 33 30–39 8 12 7 27 40–49 6 18 24 48 50–59 4 5 14 23 60+ 12 9 7 28 Total 64 55 60 179 Data modification methods After rounding RR3 Age Income Low Med High Total 15–19 20 21 0 0 20 21 20–29 14 15 11 12 89 33 30–39 89 12 76 27 40–49 6 18 24 48 50–59 43 56 14 15 23 24 60+ 12 9 76 28 27 Total 64 63 55 54 60 179 180 Microdata • Valuable resource • 2 key types of disclosure risk: 1. spontaneous recognition 2. deliberate (malicious) attempt Microdata – managing risks • confidentialising • deterrents • restricting access • educating data users about their obligations • safe environment for access Microdata – methods to assess risks • cross-tabulation of variables; • comparing sample data with pop’n data to see if the unique characteristics in the sample are unique in the population; and • acquiring knowledge of other datasets & publicly available info. that could be used for list matching. Protecting microdata • 1st level of protection: remove direct identifiers • Common ways to protect microdata are: 1. confidentialising; and/or 2. restricting access to the file Confidentialising microdata • Same principles as protecting aggregate data: • limit variables • introduce small amounts of random error (e.g. data swapping) • combine categories (e.g. age in 5 year ranges) • top/bottom code • suppress particular values/records that can’t otherwise be protected. Restricting access to microdata What affects the risk of identification? • motivation • level of detail • presence of rare characteristics • accuracy of the data • age of the data • coverage of the data (completeness) • presence of other information A note on terminology… • Confusion between de-identification and confidentialisation More information – www.nss.gov.au
© Copyright 2026 Paperzz