Integration of Surveys and Administrative Data

ESSnet Statistical Methodology – area ISAD
Integration of Surveys and Administrative Data
Executive summary of the project
WP1, WP2, WP3 reports and WP4 activities
1. Introduction
2. Statistical methodologies for the integration of surveys and
administrative data
2.1 Record linkage
2.2 Statistical matching
2.3 Microintegration processing
3. Report of WP1 – State of the art of methodologies for integration
of surveys and administrative data
4. Report of WP2 – Recommendations on the use of methodologies for the
integration of surveys and administrative data
5. Report of WP3: Software Tools for Integration Methodologies
6. Activities of WP0 and WP4
6.1 Project website
6.2 ESSnet course on integration of surveys and
administrative data
6.3 ESSnet workshop on integration of surveys and
administrative data
7. Open issues
1
3
3
3
5
5
6
7
9
9
10
10
12
1. Introduction
The project "ESSnet Statistical Methodology -- Area ISAD" (Integration of Survey and
Administrative Data), consisting of the NSIs of Austria, the Czech Republic, Italy, the
Netherlands and Spain, aims at promoting knowledge and application in practice of
sound methodologies for the joint use of existing data sources in the production of
official statistics. The NSIs in the ESS face growing duties of data supply and
dissemination, while budgetary constraints and increasing public concern about data
privacy and response burden on the respondent units - either individuals, households or
enterprises - arise. The integration of different data sources, especially data previously
produced for administrative purposes, seems to be a cheap and reliable alternative.
Integration of surveys and administrative data is a relatively new topic, if compared to
other official statistics areas as sampling, statistical disclosure control, editing and
imputation procedures. As a result, there is still a lack of common methodological
background in this area. A problem related to the joint use of surveys and administrative
data may risk to be tackled by ad-hoc procedures, with a difficulty in assessing the
optimality of the adopted procedure and the quality of the results.
This project focuses on the statistical methodologies for the integration of surveys and
administrative data. The word “statistical” is the key concept of this project. In fact,
techniques which do not consider a statistical objective, i.e. the estimation of a
1
statistical distribution or of some of its characteristics from one or more data sets, have
not been taken into account. Under this point of view, IT technologies and
infrastructures for integration of data bases were not considered in the project. On the
contrary, the project focuses on the statistical problems related to integration of different
data sources, and on the corresponding methodologies for their solution. The statistical
problems are essentially the following three:
1. Probabilistic record linkage: i.e. identification of pairs of records in different
data sources that refer to the same individual/unit, when individual/unit
identifiers may be affected by errors;
2. Statistical matching: i.e. joint analysis of variables never jointly observed but
available in different samples without units in common;
3. Microintegration processing: i.e. all those actions (as checks, editing and
imputation procedures to get better estimates) that ensure better quality of
integrated data sets, as quality and timeliness of the matched files.
The project aims at developing actions that could enhance knowledge and
understanding of the above mentioned methodologies in the ESS. The activities are split
in different workpackages (WP),
WP1 (subcoordinator CBS): assessment of the state of the art on the application and
development of the methodologies. This WP includes the definition of the above
mentioned problems as statistical problems, in order to describe a framework for
identifying optimal solutions under the statistical point of view for probabilistic record
linkage, statistical matching and microintegration processing.
WP2 (subcoordinator INE Spain): Description of the phases/steps to perform in order to
tackle the above mentioned problems and list of recommendations on the use of the
methodologies.
WP3 (subcoordinator CZSO): List of the available software tools, description of their
contents and limits.
WP4 (subcoordinator STAT): Events for spreading knowledge and understanding of the
methodologies used to tackle the three above mentioned problems. During the project
activities, two events were organized:

Course on integration of surveys and administrative data

Workshop on integration of surveys and administrative data
Istat is the scientific coordinator, putting in place all those actions for the coordination
between the project members. Furthermore, in order to spread knowledge, Istat
developed and maintained the project web page.
All the project members jointly contributed to the successful development of each WP.
In the following paragraphs, the three statistical problems (focus of this project) are
briefly introduced. Afterwards, the contents of the WP reports and activities are
summarized.
2
2. Statistical methodologies for the integration of surveys and
administrative data
2.1 Probabilistic record linkage
Probabilistic record linkage consists in identifying pairs of records, coming from either
the same or different data files, which belong to the same entity, on the base of the
agreement between common indicators.
The previous figure is taken from Fortini et al. (2006)1 and shows record linkage of two
data sets A and B. Links aim at connecting records belonging to the same unit,
comparing some indicators (name, address, telephone number). It is possible that some
agreement is not perfect (as in the telephone number of the first record of the left data
set and third record of the right data set), but the records still belong to the same unit.
This problem is important in many applications, as:
1. study of the relationships between variables collected on the same individuals
but coming from different sources
2. elimination of duplicates from data sets
3. development and management of registers.
Record linkage is a pervasive technique also in business context, for instance for the
development of information systems for customer relationship management, or for the
collection of information useful for marketing purposes. Recently, an increasing interest
in e-government applications comes also from public institutions.
2.2. Statistical matching
Statistical matching (or data fusion or synthetical matching) refers to a group of
methods whose objective is the integration of two (or more) data sources (samples)
drawn from the same target population. The data sources are characterized by the fact
they all share a subset of variables (common variables) and, at the same time, each
Fortini M., Scannapieco M., Tosco L., Tuoto T. (2006). “Towards an Open Source Toolkit for Building
Record Linkage Workflows”. Proceedings SIGMOD 2006 Workshop on Information Quality in
Information Systems (IQIS’06), Chicago, USA, 2006.
1
3
source observes distinctly other sub-sets of variables. Moreover, there is a negligible
chance that data in different sources observe the same units (disjoint sets of units).
As a result, this set of procedures is quite different in the inputs (i.e. the data sets to be
integrated) and in the outputs.
As far as the input is concerned, the data sets are usually two distinct samples without
any unit in common (no overlap between the data sources). On the contrary record
linkage requires at least a partial overlap between the two sources.
In the simplest case of two samples, the classical statistical matching framework can be
represented in the following manner (see also Kadane2, 1978, D’Orazio et al3, 2006):
Y
X
Data source A
X
Z
Data source B
In this situation X is the set of common variables, Y is observed only in A but not in B
and Z is observed in B but not in A (Y and Z are not jointly observed).
The second difference between record linkage and statistical matching is the output. In
record linkage, the objective is to recognize records belonging to the same unit in two
distinct but partially overlapping data sets. For this reason the focus is only on the X
variables and on how to deal with the possibility that these variables are recorded with
error.
On the contrary, statistical matching methods aim at integrating the two sources in order
to study the relationships existing among the two sets of variables not jointly observed,
i.e. Y and Z or, more in general, to study how X, Y and Z are related. This objective can
be achieved by using two seemingly distinct approaches:
1. Imputation procedures; the result is a complete but synthetic data set.
2. Estimation procedures on a partially observed data set; the result is the estimate
of contingency tables, or parameters on the association of the never jointly
observed variables.
Kadane J.B. (1978). “ Some statistical problems in merging data files”. 1978 Compendium of Tax
Research, Office of Tax Analysis, Department of the Treasury, pp.159-171. Washington, DC: U.S.
Government Printing Office. Reprinted in Journal of Official Statistics (2001), 17, 423-433.
3
D’Orazio M., Di Zio M., Scanu M. (2006). Statistical Matching: Theory and Practice. Wiley,
Chichester.
2
4
2.3. Microintegration processing
Micro integration processing consists of putting in place all the necessary actions aimed
to ensure better quality of the matched results as quality and timeliness of the matched
files. It includes defining checks, editing and imputation procedures to get better
estimates, etc. It should be kept in mind that some data sources are more reliable than
others. Some sources have a better coverage than others, and there may even be
conflicting information between sources. So, it is important to recognize the strong and
weak points of all the data sources used.
Since there are differences between sources, a micro integration process is needed to
check data and adjust incorrect data. It is believed that integrated data will provide far
more reliable results, because they are based on an optimal amount of information. Also
the coverage of (sub) populations will be better, because when data are missing in one
source, another source can be used. Another advantage of integration is that users of
statistical information will get one figure on each social phenomenon, instead of a
confusing number of different figures depending on which source has been used.
3. Report of WP1 – State of the art of methodologies for integration
of surveys and administrative data
The objective of this document is to provide a complete and updated overview of the
state of the art of the methodologies regarding integration of different data sources. The
different NSIs (within the ESS) can refer to this unique document if they need to:
1. define a problem of integration of different sources according to the
characteristics of the data sets to integrate;
2. discover the different solutions available in the statistical literature;
3. understand which problems still need to be tackled, and motivate the research on
these issues;
4. look at the characteristics of many different projects that needed the integration
of different data sources.
This document consists of five chapters that can be broadly clustered in two groups.
The first three chapters are mainly methodological. They describe the state of the art
respectively for i) probabilistic record linkage, ii) statistical matching, and iii) micro
integration processing. Each chapter is indeed a collection of references. This part of the
document is intended as a tool enabling orientation through the wide amount of papers
on different integration methodologies. This aspect should not be considered as a
secondary issue in the production of official statistics. The main problem is that
methodologies for the integration of different sources are, most of the times, still in their
infancy. On the contrary, the current informative needs for official statistics require an
increasingly more sophisticated use of multiple sources for the production of statistics.
Whoever is in charge of a project on integration of different sources must be conscious
of all the available alternatives and should be able to justify the chosen method.
The last two chapters are an overview of integration experiences in the ESS. Chapter 4
collects detailed information on many different projects that need a joint use of two or
more sources in the participating NSIs of this ESSnet. Chapter 5 illustrates the results of
a survey on the use and/or development of integration methodologies among EU MS
and other countries, carried out as a part of this project. The results showed that at least
5
19 countries from the ESS are engaged in one or more data integration activities. This is
mainly due to new regulations at the EU level, new demands of data for policy makers
and growing pressure for facilitating researchers access to official data. But the use of
administrative datasets by the NSIs must be often faced in absence of previous
experience.
A general output of the WP1 report is that statistical methodologies aiming at the
integration of different datasets have the following advantages:

They provide a methodology for data integration on a probabilistic basis, which
guarantees minimum standards in quality and comparability.

Since the output achieved consists of a new dataset at the micro level, information
available is useful for all kind of final users, either policy makers or researchers.

The stock of probabilistic rules and models that are part of these integration
technologies provide a method for processing data on an automated, regular, mass
and inexpensive basis.
4. Report of WP2 – Recommendations on the use of methodologies
for the integration of surveys and administrative data
The implementation of the procedures described in theory in the WP1 report needs to
focus on some other aspects that cannot be developed on scientific papers.
This document provides some recommendations and hints to consider during the
application of the probabilistic record linkage, statistical matching and microintegration
processing procedures. It cannot still be considered as a set of practical guidelines. In
fact, research is still needed in order to assess optimal choices and hence guidelines on
the application of the procedures.
On the contrary, the experiences already performed by the ESSnet partners allow to
1. highlight the practical aspects to tackle in an applied context,
2. suggest hints that successfully worked in some applications,
3. warn about possible flaws and misinterpretations.
This report essentially reviews these three issues. These issues are introduced by a flowchart of the practical steps to perform, which describes the overall process of data
integration performed by probabilistic record linkage or statistical matching
methodologies.
The document consists of 5 chapters.
The first chapter introduces a preliminary but crucial step in every integration project:
the necessity to harmonize definitions and concepts of the different data sources. The
next three chapters review the practical aspects to deal with in an integration process.
Some common aspects are the following:
1. The criteria for choosing the most appropriate model variables for the
identification of records are introduced.
2. Since these matching procedures rely on probabilistic models, and therefore
some steps of the decision-making process are not completely automated, some
6
hints on the most common available solutions and principles to be taken into
account are set out.
3. Once information from different data sources has been linked, the report shows
how to manage and arrange the datasets generated in the process.
The last chapter details some important applications in the ESSnet participating
countries.
The implementation of these methodologies together with the observance of the set of
recommendations provided in WP2 can result in a marked reduction of the burden and
resource consumption associated to large statistical operations such as, e.g., the next
round of Censuses.
5. Report of WP3: Software Tools for Integration Methodologies
Software tools for data integration consist of
1. record linkage tools,
2. statistical matching tools,
3. commercial software tools for data quality and record linkage in the process of
microintegration.
The document is organized in three chapters.
The first chapter is on software tools for record linkage. On the basis of the underlying
research paradigm, three major categories of record linkage tools can be identified:

Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter
model (Fellegi and Sunter4, 1969).

Tools for empirical record linkage, which are mainly focused on performance
issues and hence on reducing the search space of the record linkage problem by
means of algorithmic techniques such as sorting, tree traversal, neighbour
comparison, and pruning.

Tools for knowledge-based linkage, in which domain knowledge is extracted
from the files involved and reasoning strategies are applied to make the decision
process more effective.
In such a variety of proposals, this document restricts the attention to the record linkage
tools that have the following characteristics:

They have been explicitly developed for record linkage;

They are based on a probabilistic paradigm.
Two sets of comparison criteria were used for comparing several probabilistic record
linkage tools. The first one considers general characteristics of the software: cost of the
software; domain specificity (i.e. the tool can be developed ad-hoc for a specific type of
data and applications); maturity (or level of adoption, i.e. frequency of usage - whereas
available - and number of years the tool is around). The second set considers which
functionalities are performed by the tool: preprocessing/standardization; profiling;
comparison functions; decision method.
Fellegi I. P., Sunter A. B. (1969). “A theory for record linkage”. Journal of the American Statistical
Association, 64, 1183-1210.
4
7
Chapter 2 deals with software tools for statistical matching. Software solutions for
statistical matching are not as widespread as in the case of record linkage, because
statistical matching projects are still quite rare in practice. Almost all the applications
are conducted by means of ad hoc codes. Sometimes, when the objective is micro it is
possible to use general purpose imputation software tools. On the other hand, if the
objective is macro, it is possible to adopt general statistical analysis tools which are able
to deal with missing data.
In this chapter, the available tools, explicitly devoted to statistical matching purposes,
were reviewed. Only one of them (SAMWIN) is a software that can be used without
any programming skills, while the others are software codes that can be used only by
those with knowledge of the corresponding language (R, S-Plus, SAS) as well as a
sound knowledge in statistical methodology.
The criteria used for comparing the software tools for statistical matching were slightly
different from those for record linkage. The attention is restricted to costs, domain
specificity and maturity of the software tool. As far as the software functionalities are
concerned, the focus is on: i) the inclusion of pre-processing and standardization tools;
ii) the capacity to create a complete and synthetic data set by the fusion of the two data
sources to integrate; iii) the capacity to estimate parameters on the joint distribution of
variables never jointly observed; iv) the assumptions on the model of the variables of
interest under which the software tool works (the most known is the conditional
independence assumption of the variables not jointly observed given the common
variables in the two data sources); v) the presence of any quality assessment of the
results.
Furthermore, the software tools are compared according to the implemented
methodologies. Strengths and weaknesses of each software tool are highlighted at the
end.
Chapter 3 focuses on commercial software tools for data quality and record linkage in
the process of microintegration. The vendors in the data quality market are often
classified within their entire position in IT business, where focus on the specific
business knowledge and experience in specific business domain plays an important role.
Quality of vendors and their products on the market are characterized by: i) product
features and relevant services; ii) vendor characteristics, domain business
understanding, business strategy, creativity, innovation; iii) sales characteristics,
licensing, prices; iv) customer experience, reference projects; v) data quality tools and
frameworks.
The software vendors of tools in the statistics oriented “data quality market” propose
solutions addressing all the tasks in the entire life cycle of the data oriented management
programs and projects: data preparation, survey data collection, improving of quality
and integrity, setting up for reports and studies, etc.
According to the software/application category, the tools to perform or support the data
oriented projects in record linkage in statistics should have several common
characteristics:
1. portability in being able to function with statistic researchers' current
arrangement of computer systems and languages,
2. flexibility in handling different linkage strategies, and
8
3. operational expenses or low costs in TCO (Total Cost of Ownership) parameters
and in both, computing time and researchers' efforts.
In this chapter the evaluation focused on three commercial software packages, which
according to the data quality scoring position in Gartner reports (the so called “magic
quadrants” available on the web page http://www.gartner.com) belong to important
vendors in this area. The three vendors are: Oracle (represents the convergence of tools
and services in the software market), SAS/DataFlux (data quality, data integration and
BI (Business Intelligence) player on the market), Netrics (which disposes with the
advanced technology complementing the Oracle data quality and integration tools).
The set of comparison tables was prepared according to the following structure: linkage
methodology, data management, post-linkage function, standardization, costs and
empirical testing, case studies and examples.
6. Activities of WP0 and WP4
In order to spread knowledge and understanding on the statistical methodologies for
integration of surveys and administrative data, the ESSnet-ISAD project organized:
1. a project website (developed under WP0),
2. a course on statistical methods for the integration of surveys and administrative
data (developed under WP4),
3. a workshop on statistical methods for the integration of surveys and
administrative data (developed under WP4).
6.1. The project website
The project website is available at the internet address http://cenex-isad.istat.it.
The first page of the website introduces the project activities and its members. The
website contains the following links.
Public area - In the public area there is the agenda of the project; the list of links
related to record linkage, statistical matching and microintegration processing; the
documents already produced by the project members; the possibility to add special
announcements; the possibility to establish a forum of people interested in the topic.
ESSnet course on integration of surveys and administrative data (ISAD) – In this
area it is possible to find the documents, presentations and material used for the ESSnet
ISAD course held in Budapest (14-16 November 2007)
ESSnet workshop on integration of surveys and administrative data (ISAD) – In
this area it is possible to find the documents, presentations, slides and material used for
the ESSnet ISAD workshop held in Vienna (29-30 May 2008).
6.2. ESSnet course on integration of surveys and administrative data
This course aimed at providing the state of the art of the methodologies for record
linkage, statistical matching and micro integration processing. The course was a
combination of teaching sessions, where the theory was explained, and software
9
sessions, complemented with practical exercises. Each topic focused on one main
application.
The course was structured into three days, each day focused respectively on
probabilistic record linkage, statistical matching and microintegration processing.
Course material is available on the project website. Comments on the course contents,
material and features were gathered at the end of the course. The comments were
generally positive.
The course was open to all countries. The organizers aimed at having attendees from as
many different ESS countries as possible.
The course was attended by 30 people, from 18 countries (including one Eurostat
member), according to the following table
Country
Austria
Bulgaria
Croatia
Cyprus
Czech Rep
Eurostat
Finland
Greece
Hungary
Latvia
Netherlands
Poland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
UK
Participants
2
1
1
1
2
1
1
1
5
1
1
2
1
1
2
2
1
1
2
6.3. ESSnet workshop on integration of surveys and administrative
data
One of the project activities is a workshop on Integration of Surveys and Administrative
data, held in Vienna, 29-30 May 2008.
The aim of the workshop was:
1. disseminate the ESSnet-ISAD project results in the ESS;
2. allow to present and discuss the different experiences in the ESS MS;
3. gather researchers from the ESS countries interested in the topic, and create a
community of experts;
4. present and compare new insights and methodological solutions from ESS NSIs
and key academic researchers.
10
The workshop consisted of 7 sessions:
Session 1
Session 2
Session 3
Session 4
Session 5
Session 6
Session 7
ESSnet ISAD results
Record linkage
Statistical matching and forecasting
Conceptual aspects for integration
Integration of registers and samples 1
Integration of registers and samples 2
Register based statistics
The workshop was open to all the countries. The organizers aimed at having attendees
from as many different ESS countries as possible. The workshop was attended by 59
invited attendees, from 24 countries (plus Eurostat personnel) according to the
following table.
Countries
Austria
Belgium
Bulgaria
Croatia
Czech
Republic
Estonia
Eurostat
Finland
Germany
Hungary
Ireland
Italy
Latvia
Lithuania
Luxembourg
Netherlands
Poland
Romania
Slovenia
Spain
Sweden
Switzerland
Turkey
UK
Total
Participants Speakers
12
2
1
1
4
3
1
2
1
3
1
1
9
1
2
1
2
2
1
2
2
1
2
1
3
59
1
1
1
1
1
4
2
2
1
1
1
1
3
22
Workshop material is available on the project website. The workshop papers will be
published on a Eurostat official publication.
11
7. Conclusions and open issues
This project was the first opportunity to provide the ESS MS a common basis for the
application of statistical methodologies whose aim is the integration of surveys and
administrative data.
This objective was tackled by analyzing the available statistical literature on the subject
and the practical experiences already performed. This analysis allowed the definition of
workflows on the application of the statistical methodologies under study. Furthermore,
recommendations on the application of the methodologies have been gathered.
There are still open issues on the topic. The ESS could benefit of the joint efforts of
different NSIs in tackling these open issues, summarized in the following lines:
1. how to solve conflicts between the timing of events in different data sources;
differences between the observed population and the target population;
populations changing over time, and dynamic registers; coverage errors;
differences in unit definitions in different data sources;
2. how to obtain consistent estimates from integrated surveys and administrative
data;
3. how to analyse the integrated data sets taking into account possible linkage
errors;
4. investigate the possibility to use best practices in the editing and imputation area
also for the integration of data sources.
Furthermore, a strict collaboration between ESS MS is advisable for the definition of
common harmonised and open source software tools.
12