Anonimised Integrated Event History Datasets for researchers

1
Anonymised Integrated Event
History Datasets for Researchers
Johan Heldal
Statistics Norway
1
Contents
• The Social Security Database FD-trygd
• About the Norwegian Social Science Data Service
• Event history data from FD-trygd
• Data to researchers – principles laid down
• Anonymisation
• Establishing measure of disclosure risk
2
Social security event history data base
FD-trygd
• Contains all events related to the Norwegian social security
system for every person with residence I Norway since 1992.
• All benefits and associated variables
• All dates for events
• All demographic histories (birth/immigration, sex, marital
status, children)
• All places of residence
• Are kept in different oracle “tables” that can be exactly
merged by a personal identification code.
3
FD-trygd
• Has high data quality
• Can also be merged to
– Education histories
– Incomes from the yearly assessment
• Is extremely valuable for research purposes.
4
NSD
The Norwegian Social Science Data Services
• Established (1971) to simplify access to data for
researchers and students in Norway.
• Distributes anonymised survey datasets from Statistics
Norway and others.
• NSD requested (2009) a 20 % sample of all individual
histories in FD-trygd for anonymisation to researchers at
their own premises.
• The request has been approved by Statistics Norway, but
– SN has (by law) the responsibility for the confidentiality
– Anonymisation and dissemination must take place according to rules
set by SN.
5
A hypothetical event history for a woman immigrating to Norway 22. November 1999.
Bold face in cell indicates the changed variable.
Event history variables
Serial
number
Birth
year
Dates
Resident
Sex
Marital
status
Employment
Residence
Children
1234567
1975
220199
Yes
F
U
Empl
Oslo
0
…
1234567
1975
170601
Yes
F
U
Sick
Oslo
0
86 000 …
1234567
1975
170901
Yes
F
U
Empl
Oslo
0
…
1234567
1975
080202
Yes
F
U
Maternity
Oslo
1
…
1234567
1975
250502
Yes
F
M
Maternity
Oslo
1
252 000 …
1234567
1975
081102
Yes
F
M
Empl
Oslo
1
…
1234567
1975
310303
Yes
F
M
Empl
Ski
1
…
1234567
1975
220205
Yes
F
M
Unempl
Ski
1
…
1234567
1975
270705
Yes
F
M
Unempl
Ski
2
…
1234567
1975
090407
Yes
F
M
Empl
Ski
2
…
1234567
1975
310508
Yes
F
D
Empl
Ski
2
…
1234567
1975
…
…
…
…
…
…
…
Benefits
6
Classes of event-variables in FD-trygd
• Demographic variables
• Pensions
• Supports
• Rehabilitation
• Labour market
• Education
• Income from assessment
7
Principles laid down
•
•
Combining tables in FD-trygd is Data Integration.
•
It should not with reasonable means be possible to identify
someone in the dataset.
•
Important for SN to establish clear rules for NSD’s
anonymisation based on this.
Must respect the principles laid down in Principles and
guidelines on Confidentiality Aspects of Data integration
Undertaken for Statistical or Related Research Purposes
8
Rules should
• Manage realistic disclosure scenarios
• Be able to stand scrutiny from investigating
journalists
• Be transparent for the researchers
• Adapt to each researchers needs as well as
possible (Need to know principle)
• Creating one complete anonymised 20 %
sample is out of question.
9
Restrict information to researchers wrt.
• Sample size (from the 20% sample)
• Variable scope
• Length of event histories
• Detail for each variable
• Details for dates
If too large:
• Different samples with different variables for
different analyses
• To find the best balance is a challenge
10
For strict rules:
• Need to establish a model for risk for this type of
data.
• Can the μ-Argus risk measure (Franconi &Polettini
2004) be extended to event history data?
• Must take into account increased risk from
– Identifying variables given at all times
– Model for memory on event history
– Precision of times for events
11
Preliminary rules
• Limit sample sizes to 10 percent of the target population as
represented in the 20 % sample, i.e. about 2 % of the total
target.
• Restrict detail for the most visible identifying variables.
• Round economic benefits associated with states
• Restrict all datings to YYYYMM.
• Only five levels for education
• Positive incomes only in quintiles of distribution.
• NSD has started test deliveries based on these rules
• The tests will be evaluated next year.
12
With a good measure of risk
• the researchers could be able to choose larger
sample size and less variable scope
• or variable detail
• or smaller sample size and larger variable scope
and detail
• as long as the total risk stays within a limit.
• We hope the experiences from the test deliveries
will be useful here
13
Thank you for your
attention
14