PPT - unece

Anonymisation of Linked Employer
Employee Datasets using the example of
the German Structure of Earnings Survey
Hans-Peter Hafner and Rainer Lenz
Research Data Centre of the Statistical Offices of the Länder
University of Applied Sciences Mainz
UNECE Work Session on Statistical Data Confidentiality
Manchester, 17 December 2007
Overview
•
Introduction: Linked Employer Employee Datasets
•
Anonymisation of the Employer Data
•
Anonymisation of the Employee Data
•
German Structure of Earnings Survey: Methodology and
Anonymisation
•
Conclusion
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 2
Introduction: Linked Employer Employee Datasets
•
Observed effects can be split in fractions dependent of employer and
employee
•
Anonymisation is difficult!
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 3
Anonymisation of the Employer Data
• Additional knowledge from commercial databases
• Matching experiments for risk estimation
 In general 0.5 is accepted as the upper bound
for re-identification risk
 Risk minimisation:
Categorical variables: Combining classes
Metric variables: Microaggregation
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 4
Anonymisation of the Employee Data
•
Re-identification risk for employees is negligible
•
But: The information about the employees can
increase the re-identification risk for the corresponding company
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 5
Anonymisation of the Employee Data
Assumptions
•
S employer characteristic, T employee characteristic
•
w.l.o.g.: S and T categorial
•
Strong Association between S and T
•
S is anonymised (coarsening of categories)
•
Data intruder knows the marginal distribution of T for every category
of the original S
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 6
Anonymisation of the Employee Data
Concept
•
Data intruder compares the distribution of T (of a certain company)
with the marginal distributions of the possible categories of S
•
Characteristical category d of T:



•
Marginal distribution of a possible category x of S: Fraction of d exceeds a value f.
Distribution of d over the categories of S: Fraction of d in category x exceeds some
threshold.
Fractions of d not nearly equal for all possible categories of S.
Distribution differences of the characteristical categories are very
informative
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 7
Anonymisation of the Employee Data
Conclusions
•
Fraction of correct assignments
new risk estimation for companies
•
Re-identification risk too high
combine characteristical categories
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 8
German Structure of Earnings Survey:
Methodology
•
Sample survey in industry and services; every four years
•
Information about earnings and activities of employees
•
The survey is conducted in all EU contries for comparable
information all over Europe
•
For the year 2001 in Germany about 22.000 companies with data on
over 845.000 employees
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 9
German Structure of Earnings Survey:
Anonymisation
Employer
•
Coarsening of economic sectors (around 40 groups) and regional
information (5 regions)
•
Microaggregation of the number of employees, if the company has
500 and more employees
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 10
Table 1. Measures of association between the
economic sector and the attributes of the employees
Sex
Wage Tax Class
Allowance for Children
Position in Job
Education
Type of Contract of Employment
Occupation Class (2-digit)
Paid Working Hours Total
Gross Earnings in Accounting Period
Extra Pay for Shift Work
Extra Pay for Night Work
Income Tax
Pension and Unemployment Insurance
Health and Care Insurance
Gross Annual Earnings
Supplementary Grants in the Reporting Year
Net Annual Earnings
Holiday Entitlement
Net Earnings in Accounting Period
© Statistische Ämter der Länder, Forschungsdatenzentrum
0.044
0.045
0.043
0.119
0.071
0.022
0.659
0.003
0.041
0.077
0.113
0.035
0.048
0.055
0.035
0.051
0.033
0.003
0.043
Folie 11
German Structure of Earnings Survey:
Anonymisation
Table 2. Fractions of the most frequent occ. classes:
Drapery / Clothing Trade
Table 3. Fractions of the most frequent occupation classes:
Leather Industry
Textile Fabricators
19.3 %
Office Workers
14.6 %
Textile Producers
9.3 %
Product Inspectors, Shipping Finalisers
6.3 %
Technicians
5.5 %
Product Traders
5.5 %
Storekeepers, Warehousemen, Transp. workers 4.4 %
Spinning Occupations
4.4 %
Textile Refiners
3.0 %
Leather Producers, Leather and Coat Fabricators
Office Workers
Product Traders
Technicians
Entrepeneurs, Organisers, Accountants
Plastics Fabricators
Storekeepers, Warehousemen, Transport Workers
47.4 %
16.9 %
4.8 %
3.9 %
3.1 %
3.1 %
3.0 %
Characteristic Occupations
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 12
Conclusion
•
Procedure looks promising
•
Further testing is required
•
Future perspective: Automatization of the procedure
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 13
Thank you for your attention!
Contact
Research Data Centre of the Statistical Offices of the Länder
Hans-Peter Hafner
[email protected]
University of Applied Sciences Mainz, Dept. I
Rainer Lenz
[email protected]
www.forschungsdatenzentrum.de
© Statistische Ämter der Länder, Forschungsdatenzentrum
Folie 14