Anonymisation of Linked Employer Employee Datasets using the example of the German Structure of Earnings Survey Hans-Peter Hafner and Rainer Lenz Research Data Centre of the Statistical Offices of the Länder University of Applied Sciences Mainz UNECE Work Session on Statistical Data Confidentiality Manchester, 17 December 2007 Overview • Introduction: Linked Employer Employee Datasets • Anonymisation of the Employer Data • Anonymisation of the Employee Data • German Structure of Earnings Survey: Methodology and Anonymisation • Conclusion © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 2 Introduction: Linked Employer Employee Datasets • Observed effects can be split in fractions dependent of employer and employee • Anonymisation is difficult! © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 3 Anonymisation of the Employer Data • Additional knowledge from commercial databases • Matching experiments for risk estimation In general 0.5 is accepted as the upper bound for re-identification risk Risk minimisation: Categorical variables: Combining classes Metric variables: Microaggregation © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 4 Anonymisation of the Employee Data • Re-identification risk for employees is negligible • But: The information about the employees can increase the re-identification risk for the corresponding company © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 5 Anonymisation of the Employee Data Assumptions • S employer characteristic, T employee characteristic • w.l.o.g.: S and T categorial • Strong Association between S and T • S is anonymised (coarsening of categories) • Data intruder knows the marginal distribution of T for every category of the original S © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 6 Anonymisation of the Employee Data Concept • Data intruder compares the distribution of T (of a certain company) with the marginal distributions of the possible categories of S • Characteristical category d of T: • Marginal distribution of a possible category x of S: Fraction of d exceeds a value f. Distribution of d over the categories of S: Fraction of d in category x exceeds some threshold. Fractions of d not nearly equal for all possible categories of S. Distribution differences of the characteristical categories are very informative © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 7 Anonymisation of the Employee Data Conclusions • Fraction of correct assignments new risk estimation for companies • Re-identification risk too high combine characteristical categories © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 8 German Structure of Earnings Survey: Methodology • Sample survey in industry and services; every four years • Information about earnings and activities of employees • The survey is conducted in all EU contries for comparable information all over Europe • For the year 2001 in Germany about 22.000 companies with data on over 845.000 employees © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 9 German Structure of Earnings Survey: Anonymisation Employer • Coarsening of economic sectors (around 40 groups) and regional information (5 regions) • Microaggregation of the number of employees, if the company has 500 and more employees © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 10 Table 1. Measures of association between the economic sector and the attributes of the employees Sex Wage Tax Class Allowance for Children Position in Job Education Type of Contract of Employment Occupation Class (2-digit) Paid Working Hours Total Gross Earnings in Accounting Period Extra Pay for Shift Work Extra Pay for Night Work Income Tax Pension and Unemployment Insurance Health and Care Insurance Gross Annual Earnings Supplementary Grants in the Reporting Year Net Annual Earnings Holiday Entitlement Net Earnings in Accounting Period © Statistische Ämter der Länder, Forschungsdatenzentrum 0.044 0.045 0.043 0.119 0.071 0.022 0.659 0.003 0.041 0.077 0.113 0.035 0.048 0.055 0.035 0.051 0.033 0.003 0.043 Folie 11 German Structure of Earnings Survey: Anonymisation Table 2. Fractions of the most frequent occ. classes: Drapery / Clothing Trade Table 3. Fractions of the most frequent occupation classes: Leather Industry Textile Fabricators 19.3 % Office Workers 14.6 % Textile Producers 9.3 % Product Inspectors, Shipping Finalisers 6.3 % Technicians 5.5 % Product Traders 5.5 % Storekeepers, Warehousemen, Transp. workers 4.4 % Spinning Occupations 4.4 % Textile Refiners 3.0 % Leather Producers, Leather and Coat Fabricators Office Workers Product Traders Technicians Entrepeneurs, Organisers, Accountants Plastics Fabricators Storekeepers, Warehousemen, Transport Workers 47.4 % 16.9 % 4.8 % 3.9 % 3.1 % 3.1 % 3.0 % Characteristic Occupations © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 12 Conclusion • Procedure looks promising • Further testing is required • Future perspective: Automatization of the procedure © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 13 Thank you for your attention! Contact Research Data Centre of the Statistical Offices of the Länder Hans-Peter Hafner [email protected] University of Applied Sciences Mainz, Dept. I Rainer Lenz [email protected] www.forschungsdatenzentrum.de © Statistische Ämter der Länder, Forschungsdatenzentrum Folie 14
© Copyright 2026 Paperzz