PowerPoint-presentatie

Methods for
DataIntegration
Ton de Waal
14 March, 2017
Overview of project
• Project “Estimation methods for the integration of administrative
sources”
• Aim of the project: identifying and presenting statistical methods
for the integration of administrative data into a statistical
production system
• April 2016 – (end of) March 2017
• Part of ESS.VIP Admin Project (Work package 2 “Statistical
methods”, Lot 1 “Methodological support”)
• Sogeti is main contractor
• Experts from ISTAT, Statistics Netherlands and University of
Southampton
Overview of project
• Task 1: Specify usages of admin data
• Task 2: Identification and description of statistical tasks where using
estimation methods can be envisaged in order to integrate
administrative sources
• Task 3: Comprehensive identification and enumeration of possible
estimation methods that could be used for cases identified in Task 2
• Task 4: Literature review presenting actual examples for types of
usage and tasks identified in Task 2 or 3
• Task 5: Methods description
• Task 6 & 7: Final presentation and report
Task 1: Usages of admin data: Direct
• Direct Tabulation: Admin data used to produce statistics without
resorting to any statistical data.
‒ Exploiting only one administrative data source
‒ Exploiting multiple administrative data sources
• Substitution and supplementation for direct collection: Admin data
directly used as input but are not sufficient for achieving all
objectives
‒ Split-population approach Population is split into two or more
parts. Admin data used for units where these data are of
sufficient quality, and statistical sources used for the remainder
of the units
‒ Split data approach Administrative data used to provide some of
the variables for all population units
Task 1: Usages of admin data: Indirect
•
•
•
•
Creation and maintenance of registers and survey frames
•
Identification of frame units and their connections to population elements
•
Identification of classification and auxiliary variables
Editing and imputation
•
Construction of edit rules
•
Construction of models to find errors in data
•
Auxiliary data to construct imputation models
Indirect estimation
•
Creation of population benchmarks to be used for calibration
•
Use administrative data in a predictive setting
•
Estimation where administrative and statistical data are used on an equal footing
Data validation/confrontation
•
Validation of survey estimates and/or other administrative data sources
•
Address quality issues
Task 2: Possible statistical tasks
•
We have matched statistical tasks to usages by means of GSBPM
•
Statistical tasks for using integrated administrative data
I. Data editing and imputation
II. Creation of joint statistical micro data
a) Data linkage: Identification of the set of unique units residing in multiple datasets
b) Statistical matching: Inference of joint distribution based on marginal observations
III. Alignment of statistical data
a) Alignment of units: Harmonization of relevant units, creation of target statistical units
b) Alignment of measurements: Harmonization of variables, derivation of target variables
IV. Multisource estimation at aggregated level
a) Population size estimation: multiple lists with imperfect coverage of target population
b) Univalent estimation: numerically consistent estimation of common variables
c) Coherent estimation: aggregates that relate to each other
Task 3: Possible estimation methods
I. Data editing and imputation
• Most methods usually applied for surveys can also be applied for
Admin data
• There are editing methods developed specifically for data obtained
through an integration process (micro-integration)
II. Creation of joint statistical microdata
• Identification of the set of unique units residing in multiple
datasets and probabilistic record linkage
• Inference of joint distribution based on marginal observations
(statistical matching)
Task 3: Possible estimation methods
III. Alignment of statistical data
• a) Alignment of units
• b) Alignment of measurements: recently, latent variable models
have been proposed
IV. Multisource estimation at aggregated level
a) Population size estimation: multiple lists with imperfect coverage of
population
b) Univalent estimation: numerically consistent estimation of common
variables
• Obtaining univalent estimates at cross-sectional level
• Obtaining univalent estimates at longitudinal level
c) Coherent estimation: aggregates that relate to each other in terms
of accounting equations
Task 4: Examples at NSIs
• We have focused on those examples that offer most interesting
information
• We have given actual examples in NSIs for
• Direct tabulation
• Split data approach
• Indirect estimation
• Data validation
Task 4: Examples at NSIs
• Direct tabulation:
Use of probabilistic record linkage for statistics on victims and
injured people of road accidents (Istat)
• Spit data approach:
Creation of a social policy simulation database by means of
statistical matching (Statistics Canada)
• Indirect estimation:
Use of repeated weighting and macro-integration for the
construction of the Dutch Population and Housing Census (Statistics
Netherlands)
• Data validation:
Estimation of classification errors in admin and survey variables on
home ownership (Statistics Netherlands)
Task 5: Methods description
We will describe
• Editing and imputation methods, including micro-integration
• Methods for creation of joint microdata, including probabilistic
linkage and statistical matching
• Methods for alignment of statistical data, including latent variable
models
• Methods for multi source estimation at aggregated level, including
multiple lists with imperfect coverage of population, and methods for
obtaining univalent estimates (at cross-sectional level and at
longitudinal level)
Thank you
• Thank you for your attention
• Any questions or comments?