ESSnet Statistical Methodology – area ISAD Integration of Surveys and Administrative Data Executive summary of the project WP1, WP2, WP3 reports and WP4 activities 1. Introduction 2. Statistical methodologies for the integration of surveys and administrative data 2.1 Record linkage 2.2 Statistical matching 2.3 Microintegration processing 3. Report of WP1 – State of the art of methodologies for integration of surveys and administrative data 4. Report of WP2 – Recommendations on the use of methodologies for the integration of surveys and administrative data 5. Report of WP3: Software Tools for Integration Methodologies 6. Activities of WP0 and WP4 6.1 Project website 6.2 ESSnet course on integration of surveys and administrative data 6.3 ESSnet workshop on integration of surveys and administrative data 7. Open issues 1 3 3 3 5 5 6 7 9 9 10 10 12 1. Introduction The project "ESSnet Statistical Methodology -- Area ISAD" (Integration of Survey and Administrative Data), consisting of the NSIs of Austria, the Czech Republic, Italy, the Netherlands and Spain, aims at promoting knowledge and application in practice of sound methodologies for the joint use of existing data sources in the production of official statistics. The NSIs in the ESS face growing duties of data supply and dissemination, while budgetary constraints and increasing public concern about data privacy and response burden on the respondent units - either individuals, households or enterprises - arise. The integration of different data sources, especially data previously produced for administrative purposes, seems to be a cheap and reliable alternative. Integration of surveys and administrative data is a relatively new topic, if compared to other official statistics areas as sampling, statistical disclosure control, editing and imputation procedures. As a result, there is still a lack of common methodological background in this area. A problem related to the joint use of surveys and administrative data may risk to be tackled by ad-hoc procedures, with a difficulty in assessing the optimality of the adopted procedure and the quality of the results. This project focuses on the statistical methodologies for the integration of surveys and administrative data. The word “statistical” is the key concept of this project. In fact, techniques which do not consider a statistical objective, i.e. the estimation of a 1 statistical distribution or of some of its characteristics from one or more data sets, have not been taken into account. Under this point of view, IT technologies and infrastructures for integration of data bases were not considered in the project. On the contrary, the project focuses on the statistical problems related to integration of different data sources, and on the corresponding methodologies for their solution. The statistical problems are essentially the following three: 1. Probabilistic record linkage: i.e. identification of pairs of records in different data sources that refer to the same individual/unit, when individual/unit identifiers may be affected by errors; 2. Statistical matching: i.e. joint analysis of variables never jointly observed but available in different samples without units in common; 3. Microintegration processing: i.e. all those actions (as checks, editing and imputation procedures to get better estimates) that ensure better quality of integrated data sets, as quality and timeliness of the matched files. The project aims at developing actions that could enhance knowledge and understanding of the above mentioned methodologies in the ESS. The activities are split in different workpackages (WP), WP1 (subcoordinator CBS): assessment of the state of the art on the application and development of the methodologies. This WP includes the definition of the above mentioned problems as statistical problems, in order to describe a framework for identifying optimal solutions under the statistical point of view for probabilistic record linkage, statistical matching and microintegration processing. WP2 (subcoordinator INE Spain): Description of the phases/steps to perform in order to tackle the above mentioned problems and list of recommendations on the use of the methodologies. WP3 (subcoordinator CZSO): List of the available software tools, description of their contents and limits. WP4 (subcoordinator STAT): Events for spreading knowledge and understanding of the methodologies used to tackle the three above mentioned problems. During the project activities, two events were organized: Course on integration of surveys and administrative data Workshop on integration of surveys and administrative data Istat is the scientific coordinator, putting in place all those actions for the coordination between the project members. Furthermore, in order to spread knowledge, Istat developed and maintained the project web page. All the project members jointly contributed to the successful development of each WP. In the following paragraphs, the three statistical problems (focus of this project) are briefly introduced. Afterwards, the contents of the WP reports and activities are summarized. 2 2. Statistical methodologies for the integration of surveys and administrative data 2.1 Probabilistic record linkage Probabilistic record linkage consists in identifying pairs of records, coming from either the same or different data files, which belong to the same entity, on the base of the agreement between common indicators. The previous figure is taken from Fortini et al. (2006)1 and shows record linkage of two data sets A and B. Links aim at connecting records belonging to the same unit, comparing some indicators (name, address, telephone number). It is possible that some agreement is not perfect (as in the telephone number of the first record of the left data set and third record of the right data set), but the records still belong to the same unit. This problem is important in many applications, as: 1. study of the relationships between variables collected on the same individuals but coming from different sources 2. elimination of duplicates from data sets 3. development and management of registers. Record linkage is a pervasive technique also in business context, for instance for the development of information systems for customer relationship management, or for the collection of information useful for marketing purposes. Recently, an increasing interest in e-government applications comes also from public institutions. 2.2. Statistical matching Statistical matching (or data fusion or synthetical matching) refers to a group of methods whose objective is the integration of two (or more) data sources (samples) drawn from the same target population. The data sources are characterized by the fact they all share a subset of variables (common variables) and, at the same time, each Fortini M., Scannapieco M., Tosco L., Tuoto T. (2006). “Towards an Open Source Toolkit for Building Record Linkage Workflows”. Proceedings SIGMOD 2006 Workshop on Information Quality in Information Systems (IQIS’06), Chicago, USA, 2006. 1 3 source observes distinctly other sub-sets of variables. Moreover, there is a negligible chance that data in different sources observe the same units (disjoint sets of units). As a result, this set of procedures is quite different in the inputs (i.e. the data sets to be integrated) and in the outputs. As far as the input is concerned, the data sets are usually two distinct samples without any unit in common (no overlap between the data sources). On the contrary record linkage requires at least a partial overlap between the two sources. In the simplest case of two samples, the classical statistical matching framework can be represented in the following manner (see also Kadane2, 1978, D’Orazio et al3, 2006): Y X Data source A X Z Data source B In this situation X is the set of common variables, Y is observed only in A but not in B and Z is observed in B but not in A (Y and Z are not jointly observed). The second difference between record linkage and statistical matching is the output. In record linkage, the objective is to recognize records belonging to the same unit in two distinct but partially overlapping data sets. For this reason the focus is only on the X variables and on how to deal with the possibility that these variables are recorded with error. On the contrary, statistical matching methods aim at integrating the two sources in order to study the relationships existing among the two sets of variables not jointly observed, i.e. Y and Z or, more in general, to study how X, Y and Z are related. This objective can be achieved by using two seemingly distinct approaches: 1. Imputation procedures; the result is a complete but synthetic data set. 2. Estimation procedures on a partially observed data set; the result is the estimate of contingency tables, or parameters on the association of the never jointly observed variables. Kadane J.B. (1978). “ Some statistical problems in merging data files”. 1978 Compendium of Tax Research, Office of Tax Analysis, Department of the Treasury, pp.159-171. Washington, DC: U.S. Government Printing Office. Reprinted in Journal of Official Statistics (2001), 17, 423-433. 3 D’Orazio M., Di Zio M., Scanu M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester. 2 4 2.3. Microintegration processing Micro integration processing consists of putting in place all the necessary actions aimed to ensure better quality of the matched results as quality and timeliness of the matched files. It includes defining checks, editing and imputation procedures to get better estimates, etc. It should be kept in mind that some data sources are more reliable than others. Some sources have a better coverage than others, and there may even be conflicting information between sources. So, it is important to recognize the strong and weak points of all the data sources used. Since there are differences between sources, a micro integration process is needed to check data and adjust incorrect data. It is believed that integrated data will provide far more reliable results, because they are based on an optimal amount of information. Also the coverage of (sub) populations will be better, because when data are missing in one source, another source can be used. Another advantage of integration is that users of statistical information will get one figure on each social phenomenon, instead of a confusing number of different figures depending on which source has been used. 3. Report of WP1 – State of the art of methodologies for integration of surveys and administrative data The objective of this document is to provide a complete and updated overview of the state of the art of the methodologies regarding integration of different data sources. The different NSIs (within the ESS) can refer to this unique document if they need to: 1. define a problem of integration of different sources according to the characteristics of the data sets to integrate; 2. discover the different solutions available in the statistical literature; 3. understand which problems still need to be tackled, and motivate the research on these issues; 4. look at the characteristics of many different projects that needed the integration of different data sources. This document consists of five chapters that can be broadly clustered in two groups. The first three chapters are mainly methodological. They describe the state of the art respectively for i) probabilistic record linkage, ii) statistical matching, and iii) micro integration processing. Each chapter is indeed a collection of references. This part of the document is intended as a tool enabling orientation through the wide amount of papers on different integration methodologies. This aspect should not be considered as a secondary issue in the production of official statistics. The main problem is that methodologies for the integration of different sources are, most of the times, still in their infancy. On the contrary, the current informative needs for official statistics require an increasingly more sophisticated use of multiple sources for the production of statistics. Whoever is in charge of a project on integration of different sources must be conscious of all the available alternatives and should be able to justify the chosen method. The last two chapters are an overview of integration experiences in the ESS. Chapter 4 collects detailed information on many different projects that need a joint use of two or more sources in the participating NSIs of this ESSnet. Chapter 5 illustrates the results of a survey on the use and/or development of integration methodologies among EU MS and other countries, carried out as a part of this project. The results showed that at least 5 19 countries from the ESS are engaged in one or more data integration activities. This is mainly due to new regulations at the EU level, new demands of data for policy makers and growing pressure for facilitating researchers access to official data. But the use of administrative datasets by the NSIs must be often faced in absence of previous experience. A general output of the WP1 report is that statistical methodologies aiming at the integration of different datasets have the following advantages: They provide a methodology for data integration on a probabilistic basis, which guarantees minimum standards in quality and comparability. Since the output achieved consists of a new dataset at the micro level, information available is useful for all kind of final users, either policy makers or researchers. The stock of probabilistic rules and models that are part of these integration technologies provide a method for processing data on an automated, regular, mass and inexpensive basis. 4. Report of WP2 – Recommendations on the use of methodologies for the integration of surveys and administrative data The implementation of the procedures described in theory in the WP1 report needs to focus on some other aspects that cannot be developed on scientific papers. This document provides some recommendations and hints to consider during the application of the probabilistic record linkage, statistical matching and microintegration processing procedures. It cannot still be considered as a set of practical guidelines. In fact, research is still needed in order to assess optimal choices and hence guidelines on the application of the procedures. On the contrary, the experiences already performed by the ESSnet partners allow to 1. highlight the practical aspects to tackle in an applied context, 2. suggest hints that successfully worked in some applications, 3. warn about possible flaws and misinterpretations. This report essentially reviews these three issues. These issues are introduced by a flowchart of the practical steps to perform, which describes the overall process of data integration performed by probabilistic record linkage or statistical matching methodologies. The document consists of 5 chapters. The first chapter introduces a preliminary but crucial step in every integration project: the necessity to harmonize definitions and concepts of the different data sources. The next three chapters review the practical aspects to deal with in an integration process. Some common aspects are the following: 1. The criteria for choosing the most appropriate model variables for the identification of records are introduced. 2. Since these matching procedures rely on probabilistic models, and therefore some steps of the decision-making process are not completely automated, some 6 hints on the most common available solutions and principles to be taken into account are set out. 3. Once information from different data sources has been linked, the report shows how to manage and arrange the datasets generated in the process. The last chapter details some important applications in the ESSnet participating countries. The implementation of these methodologies together with the observance of the set of recommendations provided in WP2 can result in a marked reduction of the burden and resource consumption associated to large statistical operations such as, e.g., the next round of Censuses. 5. Report of WP3: Software Tools for Integration Methodologies Software tools for data integration consist of 1. record linkage tools, 2. statistical matching tools, 3. commercial software tools for data quality and record linkage in the process of microintegration. The document is organized in three chapters. The first chapter is on software tools for record linkage. On the basis of the underlying research paradigm, three major categories of record linkage tools can be identified: Tools for probabilistic record linkage, mostly based on the Fellegi and Sunter model (Fellegi and Sunter4, 1969). Tools for empirical record linkage, which are mainly focused on performance issues and hence on reducing the search space of the record linkage problem by means of algorithmic techniques such as sorting, tree traversal, neighbour comparison, and pruning. Tools for knowledge-based linkage, in which domain knowledge is extracted from the files involved and reasoning strategies are applied to make the decision process more effective. In such a variety of proposals, this document restricts the attention to the record linkage tools that have the following characteristics: They have been explicitly developed for record linkage; They are based on a probabilistic paradigm. Two sets of comparison criteria were used for comparing several probabilistic record linkage tools. The first one considers general characteristics of the software: cost of the software; domain specificity (i.e. the tool can be developed ad-hoc for a specific type of data and applications); maturity (or level of adoption, i.e. frequency of usage - whereas available - and number of years the tool is around). The second set considers which functionalities are performed by the tool: preprocessing/standardization; profiling; comparison functions; decision method. Fellegi I. P., Sunter A. B. (1969). “A theory for record linkage”. Journal of the American Statistical Association, 64, 1183-1210. 4 7 Chapter 2 deals with software tools for statistical matching. Software solutions for statistical matching are not as widespread as in the case of record linkage, because statistical matching projects are still quite rare in practice. Almost all the applications are conducted by means of ad hoc codes. Sometimes, when the objective is micro it is possible to use general purpose imputation software tools. On the other hand, if the objective is macro, it is possible to adopt general statistical analysis tools which are able to deal with missing data. In this chapter, the available tools, explicitly devoted to statistical matching purposes, were reviewed. Only one of them (SAMWIN) is a software that can be used without any programming skills, while the others are software codes that can be used only by those with knowledge of the corresponding language (R, S-Plus, SAS) as well as a sound knowledge in statistical methodology. The criteria used for comparing the software tools for statistical matching were slightly different from those for record linkage. The attention is restricted to costs, domain specificity and maturity of the software tool. As far as the software functionalities are concerned, the focus is on: i) the inclusion of pre-processing and standardization tools; ii) the capacity to create a complete and synthetic data set by the fusion of the two data sources to integrate; iii) the capacity to estimate parameters on the joint distribution of variables never jointly observed; iv) the assumptions on the model of the variables of interest under which the software tool works (the most known is the conditional independence assumption of the variables not jointly observed given the common variables in the two data sources); v) the presence of any quality assessment of the results. Furthermore, the software tools are compared according to the implemented methodologies. Strengths and weaknesses of each software tool are highlighted at the end. Chapter 3 focuses on commercial software tools for data quality and record linkage in the process of microintegration. The vendors in the data quality market are often classified within their entire position in IT business, where focus on the specific business knowledge and experience in specific business domain plays an important role. Quality of vendors and their products on the market are characterized by: i) product features and relevant services; ii) vendor characteristics, domain business understanding, business strategy, creativity, innovation; iii) sales characteristics, licensing, prices; iv) customer experience, reference projects; v) data quality tools and frameworks. The software vendors of tools in the statistics oriented “data quality market” propose solutions addressing all the tasks in the entire life cycle of the data oriented management programs and projects: data preparation, survey data collection, improving of quality and integrity, setting up for reports and studies, etc. According to the software/application category, the tools to perform or support the data oriented projects in record linkage in statistics should have several common characteristics: 1. portability in being able to function with statistic researchers' current arrangement of computer systems and languages, 2. flexibility in handling different linkage strategies, and 8 3. operational expenses or low costs in TCO (Total Cost of Ownership) parameters and in both, computing time and researchers' efforts. In this chapter the evaluation focused on three commercial software packages, which according to the data quality scoring position in Gartner reports (the so called “magic quadrants” available on the web page http://www.gartner.com) belong to important vendors in this area. The three vendors are: Oracle (represents the convergence of tools and services in the software market), SAS/DataFlux (data quality, data integration and BI (Business Intelligence) player on the market), Netrics (which disposes with the advanced technology complementing the Oracle data quality and integration tools). The set of comparison tables was prepared according to the following structure: linkage methodology, data management, post-linkage function, standardization, costs and empirical testing, case studies and examples. 6. Activities of WP0 and WP4 In order to spread knowledge and understanding on the statistical methodologies for integration of surveys and administrative data, the ESSnet-ISAD project organized: 1. a project website (developed under WP0), 2. a course on statistical methods for the integration of surveys and administrative data (developed under WP4), 3. a workshop on statistical methods for the integration of surveys and administrative data (developed under WP4). 6.1. The project website The project website is available at the internet address http://cenex-isad.istat.it. The first page of the website introduces the project activities and its members. The website contains the following links. Public area - In the public area there is the agenda of the project; the list of links related to record linkage, statistical matching and microintegration processing; the documents already produced by the project members; the possibility to add special announcements; the possibility to establish a forum of people interested in the topic. ESSnet course on integration of surveys and administrative data (ISAD) – In this area it is possible to find the documents, presentations and material used for the ESSnet ISAD course held in Budapest (14-16 November 2007) ESSnet workshop on integration of surveys and administrative data (ISAD) – In this area it is possible to find the documents, presentations, slides and material used for the ESSnet ISAD workshop held in Vienna (29-30 May 2008). 6.2. ESSnet course on integration of surveys and administrative data This course aimed at providing the state of the art of the methodologies for record linkage, statistical matching and micro integration processing. The course was a combination of teaching sessions, where the theory was explained, and software 9 sessions, complemented with practical exercises. Each topic focused on one main application. The course was structured into three days, each day focused respectively on probabilistic record linkage, statistical matching and microintegration processing. Course material is available on the project website. Comments on the course contents, material and features were gathered at the end of the course. The comments were generally positive. The course was open to all countries. The organizers aimed at having attendees from as many different ESS countries as possible. The course was attended by 30 people, from 18 countries (including one Eurostat member), according to the following table Country Austria Bulgaria Croatia Cyprus Czech Rep Eurostat Finland Greece Hungary Latvia Netherlands Poland Slovakia Slovenia Spain Sweden Switzerland Turkey UK Participants 2 1 1 1 2 1 1 1 5 1 1 2 1 1 2 2 1 1 2 6.3. ESSnet workshop on integration of surveys and administrative data One of the project activities is a workshop on Integration of Surveys and Administrative data, held in Vienna, 29-30 May 2008. The aim of the workshop was: 1. disseminate the ESSnet-ISAD project results in the ESS; 2. allow to present and discuss the different experiences in the ESS MS; 3. gather researchers from the ESS countries interested in the topic, and create a community of experts; 4. present and compare new insights and methodological solutions from ESS NSIs and key academic researchers. 10 The workshop consisted of 7 sessions: Session 1 Session 2 Session 3 Session 4 Session 5 Session 6 Session 7 ESSnet ISAD results Record linkage Statistical matching and forecasting Conceptual aspects for integration Integration of registers and samples 1 Integration of registers and samples 2 Register based statistics The workshop was open to all the countries. The organizers aimed at having attendees from as many different ESS countries as possible. The workshop was attended by 59 invited attendees, from 24 countries (plus Eurostat personnel) according to the following table. Countries Austria Belgium Bulgaria Croatia Czech Republic Estonia Eurostat Finland Germany Hungary Ireland Italy Latvia Lithuania Luxembourg Netherlands Poland Romania Slovenia Spain Sweden Switzerland Turkey UK Total Participants Speakers 12 2 1 1 4 3 1 2 1 3 1 1 9 1 2 1 2 2 1 2 2 1 2 1 3 59 1 1 1 1 1 4 2 2 1 1 1 1 3 22 Workshop material is available on the project website. The workshop papers will be published on a Eurostat official publication. 11 7. Conclusions and open issues This project was the first opportunity to provide the ESS MS a common basis for the application of statistical methodologies whose aim is the integration of surveys and administrative data. This objective was tackled by analyzing the available statistical literature on the subject and the practical experiences already performed. This analysis allowed the definition of workflows on the application of the statistical methodologies under study. Furthermore, recommendations on the application of the methodologies have been gathered. There are still open issues on the topic. The ESS could benefit of the joint efforts of different NSIs in tackling these open issues, summarized in the following lines: 1. how to solve conflicts between the timing of events in different data sources; differences between the observed population and the target population; populations changing over time, and dynamic registers; coverage errors; differences in unit definitions in different data sources; 2. how to obtain consistent estimates from integrated surveys and administrative data; 3. how to analyse the integrated data sets taking into account possible linkage errors; 4. investigate the possibility to use best practices in the editing and imputation area also for the integration of data sources. Furthermore, a strict collaboration between ESS MS is advisable for the definition of common harmonised and open source software tools. 12
© Copyright 2026 Paperzz