2.6 Guidelines on selective editing options for the S

in partnership with
Title:
Selective editing options for a Statistical Data Warehouse
WP:
2
Deliverable:
2.6
Version: 2.0 - Final
Date:
31-7-2013
Autor:
NSI:
ONS (UK)
ONS (UK)
ISTAT (Italy)
Hannah Finselbach
Daniel Lewis
Orietta Luzi
ESSNET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Index
1.
Introduction ........................................................................................................................ 3
1.1
Combining data sources .............................................................................................. 3
1.2
Editing data sources .................................................................................................... 4
2. Selective editing for administrative/external data in a (statistical) data warehouse
framework .................................................................................................................................. 8
3.
2.1
Selective editing........................................................................................................... 8
2.2
Selective editing in a Statistical Data Warehouse framework .................................... 9
Selective editing case studies ........................................................................................... 11
3.1 SeleMix ......................................................................................................................... 11
3.1.1 Case study with SBS data (ISTAT) ......................................................................... 14
3.2 SELEKT .......................................................................................................................... 19
3.2.1 Use of SELEKT in a Statistical Data Warehouse .................................................... 20
4.
Recommendations ........................................................................................................... 23
5.
References ........................................................................................................................ 26
2
1. Introduction
The objective for the ESSnet Project on Micro Data Linking and Data Warehousing
(ESSnet DWH) is to provide assistance in the development of more integrated databases
and data production systems for business statistics in ESS Member States. This report is
one output of Work Package 2, deliverable 2.6: Examine selective editing options for a
Statistical Data Warehouse and provide guidelines for best practice - Includes the options
for weighting the importance of different outputs.
The outcome from Phase 1 (SGA-1) of the ESSnet DWH shows that the design and
implementation of a Statistical Data Warehouse is “work in progress” for most surveyed
National Statistics Institutes (NSIs), or that their system is partially integrated. Flexibility
in output, data linking, and efficiency in process are the main motivations for
implementing a Statistical Data Warehouse. A key challenge identified from the
questionnaire is the different statistical requirements of the Data Warehouse. Work
Package 2 addresses this and other methodological challenges in moving to a Statistical
Data Warehouse.
There are many benefits of implementing a Data Warehouse approach. These include
decreased cost of data access and analysis, using a common data model, as well as
common tools, and faster and more automated data management and dissemination.
Few NSIs have implemented a Data Warehouse encompassing all phases of the Generic
Statistical Business Process Model (GSBPM).
This report examines options for efficient editing in a Statistical Data Warehouse,
specifically exploring how selective editing may be used in this context. The report
begins with background on combining and editing data sources, before describing
selective editing in a traditional survey context, and how and when the method could be
used in a Statistical Data Warehouse. The final part of the report focuses on two widely
available selective editing tools, to consider if they could be used for efficient editing in a
Data Warehouse environment.
Combining data sources
Many NSIs have increased the use of administrative data sources for producing
statistical outputs. The potential advantages of using administrative sources include a
reduction in data collection and statistical production costs; the possibility of producing
estimates at a very detailed level thanks to almost complete coverage of the population;
and re-use of already existing data to reduce respondent burden.
Removing the traditional “stove-pipe” models and moving to an integrated system using
a common data model, as well as common tools, to produce business statistics should
also reduce costs and burden. Re-engineering projects to combine and standardize
systems for combining data sources to produce business statistics are also increasing
amongst NSIs, see Casciano et al (2011), Brisebois et al (2011), and Brion (2008) for
specific case studies in ISTAT, Statistics Canada and INSEE, respectively.
3
There are also drawbacks to using administrative data sources. The economic data
collected by different agencies are usually based on different unit types. For example a
legal unit used to collect VAT information by the Tax Office is often different to the
statistical unit used by the NSI. These different unit types complicate the integration of
sources to produce statistics. This can lead to coverage problems and data
inconsistencies on linked data sources.
Another complication affecting the use of administrative data is timeliness. For example,
there is often too much of a lag between the reporting of economic information to the
Tax Office and the reporting period of the statistic to be produced by the NSI. The
ESSnet Admin Data (Work Package 4) has addressed some of these issues, and produced
recommendations on how they may be overcome. Reports will be available at
http://www.cros-portal.eu/content/admindata-sga-3 when finalised.
Definitions of variables can differ between sources. Work Package 3 of the ESSnet Admin
Data aims to provide methods of estimation for variables that are not directly available
from administrative sources. In addition, in many cases the administrative sources alone
do not contain all of the information that is needed to produce the detailed statistics
that NSIs are required to produce, and so a mixed source approach is usually required.
See Lewis and De Waal (2013) for more information. Burger et al (2013) provide
examples of estimating the accuracy of mixed source estimators.
For business statistics there are many logical relationships (or edit constraints) between
variables. When sources are linked, inconsistencies will arise, and the linked records do
not necessarily respect these constraints. A micro-integration step is usually necessary
to integrate the different sources to arrive at consistent integrated micro data. The
ESSnet Data Integration outlines a strategy for detecting and correcting errors in the
linkage and relationships between units of integrated data. Gåsemyr et al
(2008) advocate the use of quality measures to reflect the quality of integrated data,
which can be affected by the linkage process.
A Data Warehouse will combine survey, register and administrative data sources, which
could be collected by several modes. The data register and administrative data is not
only being used as business or population frames and auxiliary information for survey
sample based statistics, but also as the main sources for statistics, and as sources of
quality assessment.
Editing in the Data Warehouse is required for different purposes: maintaining the
register and its quality; for a specific output and its integrated sources; and to improve
the statistical system. The editing process is one part of quality control in the Statistical
Data Warehouse – finding error sources and correcting them.
Editing data sources
All data sources feeding into the Data Warehouse will be subject to different types of
errors and anomalies. The UNECE Glossary of Terms on Statistical Data Editing defines
editing as “an activity that involves assessing and understanding data, and the three
phases of detection, resolving, and treating anomalies…”
4
There are well established methods for editing survey data. The EDIMBUS project (Luzi et
al, 2007) describes the potential errors in business survey data, and defines a wide range
of methods to detect and correct errors. The report gives clear recommendations on the
implementation of a strategy to design, test, and assess a suitable editing process for
business surveys. The survey editing process is described graphically in the following
diagram.
Figure 1: General flow of an Editing and Imputation (E&I) process, taken from Luzi et al (2007), page 7.
The interactive editing and imputation stage may include close inspection and analysis by
subject matter experts. For business surveys especially, the interactive stage can involve
re-contacting respondents to verify the data they provided. The decision on whether an
error is influential or not is usually taken as part of a selective editing procedure. For
more details see section 2 below.
Administrative sources provide different challenges for editing. The size of the datasets
mean that thorough interactive editing is generally implausible due to the time and cost
that would be required; the fact that the data are collected by a different organisation
can make it difficult for NSIs to understand the causes and nature of errors; and in most
cases it is not possible, for legal or practical reasons, to re-contact the people or
businesses that provided the data.
5
The literature for editing administrative data is not as advanced as for survey data,
nevertheless some useful guidance is available. Verschaeren et al (2013) describes
methods for editing administrative data in general and in the specific context of some of
the main sources of administrative data for economic statistics: Value Added Tax data,
Social Security data on employees, and Company Accounts data. One of the key
principles is the importance of good communication with the administrative data holder,
in order to understand possible causes of errors and methods for detecting them, and
potentially for the NSI to feed back improvements to their collection and processing
systems to reduce the occurrence of errors. Suspicious values can generally be detected
using traditional style edit rules, but the inability to re-contact respondents may mean
that automatic correction using imputation methods is the only way to resolve them.
However, Wallgren and Wallgren (2007) point out that interactive editing (by statistical
staff) can be effective in understanding error sources and the nature of the
administrative data themselves. This will improve the future use of the administrative
data for statistical purposes. The size of administrative datasets means that interactive
editing of this type will necessarily be limited.
Statistics Sweden use a register system to receive and process administrative data.
Holmberg et al (2011) display the main editing processes in this system, reproduced in
figure 2 below.
Figure 2:
A simplified overview of editing activities in Statistics Sweden’s register system,
taken from Holmberg et al (2011).
6
Other representations of the process of editing administrative data are possible. For
example see Wallgren and Wallgren (2007), page 101. The ongoing ESSnet MEMOBUST
will also contain a module on “Editing administrative data”, which proposes different
possible scenarios for the editing and imputation process in the case of using integrated
administrative sources for statistical purposes.
The integration of survey and administrative data is a key part of the Statistical Data
Warehouse. Brion (2011) describes split processes of editing for administrative data,
survey data, and data common to both sources. Editing requirements may vary,
depending on how each data source will be used.
7
2. Selective editing for administrative/external data
in a (statistical) data warehouse framework
Traditionally, editing is one of the most time consuming processes in the production of
official statistics. Yet the manual follow-up of many edit failures result in no change;
many editing changes have negligible effect on the survey estimates. When manual
editing involves re-contact with respondents, it will also increase the respondent
burden, and can introduce further errors from over-editing. In view of this most NSIs
have developed editing strategies that improve efficiency, without impacting adversely
on data quality. A major element of this has been the development of suitable
methodology for selective (or significance) editing.
2.1 Selective editing
The selective editing approach prioritises those contributors with suspicious values that
would have substantial influence on publications if they proved to be erroneous.
Selective editing can be applied to both micro and macro data. For this report, we focus
on the selective editing procedure applied at the micro (unit) level, at the early stages of
data collection. By prioritising units for attention, selective editing leads to clear
boundaries for manual intervention and more efficient editing processes overall. Other
methods of macro/output editing (macro-selection) are designed to be used when the
data collection is complete.
By no longer intensively editing all data returned, there is a bias left in the estimates.
This bias needs to be balanced against having a more cost effective and efficient
process. A historical background of selective editing can be found in De Waal et al
(2011), and a more detailed description can be found in, for example, Lawrence and
McKenzie (2000).
Observations are prioritised according to the values of a score function that expresses
the impact of their potential error on the estimates of interest (Latouche and Berthelot,
1992). The EDIMBUS Project (Luzi et al, 2007) recommends that the score should consist
of a risk (suspicion) and influence (potential impact) component. Local scores (for key
variables of interest) are calculated, and then combined into a global (unit) score
(Hedlin, 2008) and thresholds are constructed for the global score to determine whether
an observation will be manually edited or not. Various quality indicators such as bias,
change rates and savings can be used throughout the process in order to choose
appropriate threshold values.
Selective editing is now part of the editing strategy for most NSIs. Some recent updates
on the implementation of selective editing methods in business surveys include:
Skentelbery et al (2011), Lindgren (2011), Cloutier (2011), and Brion (2011) from the UK,
Sweden, Canada and France, respectively.
8
As the use of administrative data increases, more NSIs are investigating the ways in
which administrative data can be used to improve the selective editing process. ISTAT
(Italy) and ONS (UK) have investigated the benefits of using administrative data as
auxiliary information in the selective editing score function (Luzi et al, 2010; Lewis,
2011).
Brion (2009) describes the three steps implemented in the INSEE (France) data editing
strategy for its integrated SBS system, where administrative data are used as a primary
source of data. Step one is editing the survey data, step two is editing the administrative
data, and then a third step addresses the coherence between the data sources. Gros
(2012) outlines the algorithm to determine which source should be used when variables
or characteristics are available from multiple sources. This involves calculating a score of
differences between data sources to determine the units that need to be manually
compared or edited.
Selective editing is applied to administrative data in the French Esane structural business
statistics system. The score function is the same as applied to survey data but without
the use of sampling weights, as the administrative data are exhaustive, see Gros (2012).
There are a number of available tools developed by NSIs that perform selective editing
on business surveys. SeleMix (Buglielli, M. T. et al, 2011) and SELEKT (Norberg and
Arvidson, 2008) are considered in detail in section 3 of this report. Other tools are
available to perform selective or significance editing, such as the Significance Editing
Engine (SEE) developed by the Australian Bureau of Statistics (Farewell and Shubert,
2011), and SLICE developed by Statistics Netherlands (Hoogland and Smit, 2008).
2.2 Selective editing in a Statistical Data Warehouse framework
Editing in the Data Warehouse is required for different purposes: maintaining the
register and its quality; some will aim at a specific output and its integrated micro data
sources; and some at micro data directed to improve the statistical system. The editing
process is one part of quality control in the Data Warehouse – finding error sources and
correcting them. Statistics Sweden has developed a data warehousing and registercoordination strategy. Holmberg et al (2011) describe the editing processes on
administrative data and in a register system.
The survey design weight generally used when selectively editing survey data is specific
to the design-based approach of survey sampling. This would not be meaningful in a
Statistical Data Warehouse, but is an important part of the scoring methodology. One
possibility is to consider that the weight of each unit in the Data Warehouse is 1, as
combining sources to cover the whole population would mean that each unit only needs
to represent itself. Lorenc et al (2012) suggest that this weight-less approach would be
more aligned with the concept of “comprehensive systems” of data, such as a Statistical
Data Warehouse. The paper also suggests the possibility of developing several sets of
weights for the same set of original data, tailored for different uses.
9
Selective editing may not be an appropriate method of editing for data that are
collected without a clear purpose of their exact use. Lorenc et al (2012) asks, how
statistical data editing can be done without a purpose and suggests alternative
approaches. These include the use of methods developed for computer-intensive data
processing (as opposed to statistical processing) such as pattern matching. Further work
is needed to evaluate the options.
Another challenge to selective editing in a Statistical Data Warehouse is that multiple
uses of the data means it is difficult to identify more serious or significant errors, and the
most cost effective ways of dealing with them. Section 3 will explore possible methods
for addressing this issue, including the use of importance weights for different outputs in
selective editing. Holmberg et al (2011) advise that early automatic correction
(imputation) of errors should be avoided if there are many uses of the data.
Congruence editing or cross checking between data sources (for cross sectional and
longitudinal coherence), should also be part of the editing process within the Statistical
Data Warehouse. Quality indicators would be required for each data source, to enable a
systematic method for making the preferred choice of data item from multiple sources.
Markers would be required to trace a data item to each output it is used for. Laitila,
Wallgren and Wallgren (2011) describe quality measures for administrative data, which
could be incorporated into a Statistical Data Warehouse.
A scoring method could be used to identify when manual intervention is needed to
compare data sources, as implemented by INSEE and described in section 2.1. Other
options may include combining selective editing local scores from several sources to
determine a global score for the unit. The ability to set meaningful thresholds for such a
score needs to be considered.
10
3. Selective editing case studies
In this section we consider two selective editing tools which may be useful for editing
data in a Statistical Data Warehouse. The first tool was developed in ISTAT and is called
SeleMix (Selective editing via Mixture models). There is a description of the method
implemented in SeleMix and of an ongoing study to apply the method to Italian mixedsource structural business statistics (SBS) data. The second tool was developed in
Statistics Sweden and is called SELEKT. There is a description of the method and
consideration of how it may be used to selectively edit mixed-source data and data in a
Statistical Data Warehouse.
3.1 SeleMix
Several methods have been proposed to define a score function and a threshold to be
used for selective editing. Most of them however are not able to formally relate these
elements to the level of accuracy of the publication figures. In the application delineated
here we aim to evaluate the performance of a selective editing method based on
explicitly modeling both error-free data and error mechanism (Bellisai et al, 2009;
Buglielli et al, 2011). In this approach the “score function” is defined in terms of the
expected error in data so that the threshold identifying the units to be manually
reviewed is statistically interpreted and associated to the accuracy of the target
estimates.
It is assumed that the true data, possibly in log-scale, follow a Gaussian distribution
whose mean vector depends linearly on a set of error-free covariates. We use a latent
class model (contamination model) to reflect the intermittent nature of the error
mechanism. Furthermore, conditionally on the event that the observed value of a
variable is not the true one, we assume an additive model where the error mechanism is
specified through a Gaussian random variable with zero mean vector and covariance
matrix proportional to the covariance matrix characterizing the true data distribution.
The specification of the true data distribution and error mechanism makes it possible to
explicitly derive the conditional distribution of the true data given the observed (possibly
contaminated) data. According to this distribution, expected errors are computed and
used to define the score function. Any error remaining in data after having corrected the
units with the highest score can also be estimated, so that the threshold for interactive
editing can be related to this residual error.
This method is implemented in the R package SeleMix available on the website
http://www.R-project.org. This package includes functions for the estimation of the
model parameters via the EM algorithm, computation of prediction of true values
conditional on observed values, and prioritization of units for interactive editing
according to a user-specified threshold. Missing values in the response contaminated
variables are allowed. In this case SeleMix can also be used as a tool for (robust)
imputation of incomplete data. The covariates included in the model are supposed to be
error-free and not affected by non response. Thus, the efficiency of the SeleMix
approach also depends on the reliability of the available auxiliary information.
In the following, the model is described in more detail (see Luzi et al, 2012).
11
i) True data model
We assume that two sets of variables are observed: the variables of the first group, say
X-variables, are supposed to be correctly measured while the second set of variables,
say Z-variables, correspond to items possibly affected by measurement errors. This setup can be particularly useful when some variables are available from administrative
sources or are measured with high accuracy. It is quite natural to treat the variables that
are observed with error as response variables and the reliable variables as covariates.
This framework includes as a special case, the situation where reliable covariates X are
not available, so that what is to be modelled is the joint distribution of the Z variables. In
the following we model true data through a log-normal probability distribution. This
seems a reasonable assumption in many cases where economic data are to be analysed.
According to the previous assumptions, true data corresponding to possible
contaminated items are represented as an n × p matrix Z* of n independent realizations
from a random p-vector assumed to follow a log-normal distribution whose parameters
may depend on some set of q covariates not affected by error. Thus, if Y*= lnZ*, we
have the regression model:
Y*= XB + U
(1)
where X is a n × q matrix whose rows are the measures of the q covariates on the n
units, B is the q × p matrix of the coefficients, and U is an n × p matrix whose rows ui
(i=1,..,n) represent normal residuals:
ui ~ N (0, )
(2).
ii) Error model
In order to model the intermittent nature of the error mechanism we introduce a
Bernoulli random variable I with parameter π, where I =1 if an error occurs and I =0
otherwise. In the sequel, Z and Y will denote possible contaminated variables in original
and logarithmic scales respectively.
When I =0, it must hold Z=Z* (Y=Y*). When I =1, errors affect data through an additive
mechanism represented by a Gaussian random variable with zero mean and covariance
matrix   proportional to Σ , i.e. given the event { I =1}:
ε~N (0 ,Σ ) , Σ ε  (α - 1) Σ ,  >1.
ε
The error model can be specified though the conditional distribution of the observed
data given the true data as follows:
f Y |Y * ( y | y*)  (1   ) ( y  y*)  N ( y; y*,   )
(3)
Y  Y *  ,
where π (mixing weight) is the “a priori" probability of contamination and  (t 't ) is the
delta-function with mass at t . In case that the set of X- variables is empty, the variables
Yi (i= 1,…,n) are normally distributed with common mean vector μ. From the above
specifications it easily follows that the distribution of the observed data is given by:
(4).
f Y ( y)  (1   ) N ( y; B' x, )  N ( y; B' x,)
This distribution can be easily estimated by maximizing the likelihood based on n sample
units via a suitable ECM algorithm.
12
iii) Selective Editing
The relevant distribution for selective editing is the distribution of error-free data Y*
conditional on observed data (including covariates X). Straightforward application of
Bayes formula provides:
~
f Y *| X ,Y ( y* | x, y)   1 ( x, y) ( y *  y)   2 ( x, y) N ( y*; ~x, y , )
(5)
where  1 and  2 are the posterior probabilities of belonging to correct and erroneous
data respectively:
 1 ( xi , yi )  Pr( yi  yi * | xi , yi )
 2 ( xi , yi )  Pr( yi  yi * | xi , yi )  1   1 ( xi , yi )
i=1,…,n
and
( y  (  1) B' x)
;
~x , y 

1
~
  (1  ) .

With an obvious shift of notation the corresponding distribution in the original scale is:
~
(6)
f Z *|Z ( z* | z )   1 (ln( z )) ( z *  z )   2 (ln( z )) LN ( z*; ~x,ln z , )
where LN( ∙, , ) denotes the lognormal density with parameters (  ), and for the
sake of simplicity, we have suppressed the X variables in the notation whenever they
appear as conditioning variables. Estimation of the distribution (6) is obtained by
replacing the corresponding parameters with the estimates of (, ) resulting
from the ECM algorithm.
Once the target distribution (6) has been estimated, ”predictions” of ”true” values zi*,
conditional on observed values z i can be obtained for all the observations i=1,…,n as:
zˆi  E ( z i* | z i )   z i* f Z *|Z ( z * | z )dz i*
The model predictions can be used to define the expected errors
 i  ( zˆi  z i ) ,
i=1,.,.n
and to compute “robust estimates” of some parameters of a finite population U.
For instance, assume that the target estimate is given by the total Tz of the variable Z,
i.e. Tz   z i and an estimator Tˆz   wi z i is used where wi are sampling weights
iU
iS
attached to each unit of a sample S of size n. A robust version of Tˆz is given by
Tˆz* 
 w zˆ , where the last estimator is obtained by the previous one by replacing
i i
iS
observed values zi with the predictions ẑ i .
A suitable local score function can be defined in terms of expected errors and reference
robust estimates. This definition is particularly useful in that it makes it possible to
estimate the residual error remaining in the data after the units with the highest
expected error have been corrected.
13
It follows that the number of units to be interactively reviewed can be chosen such that
the residual error is below a prefixed threshold. Specifically, if for a certain variable Z, ri
denotes the relative individual error defined as the ratio between the (weighted)
expected error and the reference estimate Tz*
ri 
wi ( zˆi  zi )
,
Tˆz*
then the score function is defined as: SFi | ri | . Moreover, let RM be the absolute value
of the (approximated) expected residual percentage error remaining in data after
removing errors in the units belonging to the set M:
RM |  ri | , where M denotes the complement of M in S.
iM
Once an “accuracy" threshold η is chosen, the selective editing procedure consists of:
1. sorting the observations in descending order according to the value of SFi ;
2. selecting the first k units for reviewing, where:
k  min{ k  (1,.., n) | RM j   , j  k} and M m is the set composed of the first m
units.
A suitable extension to the multivariate case is easily obtained by taking the maximum
of the different local scores as global score.
3.1.1 Case study with SBS data (ISTAT)
ISTAT has recently launched an innovative initiative aimed at modernizing the
production system of structural business statistics (SBS) based on the implementation of
a new firm-level integrated Data Warehouse (SBS DWH). The SBS DWH is based on the
primary use of administrative and fiscal data, integrated with direct survey data to
estimate variables which are not available from administrative sources and not covered
sub-populations: the DWH will provide a multidimensional set of estimates (frame) for a
set of key business statistics, like Production value, Turnover, Intermediate costs, Value
added, Wages, and Labor cost.
It will be possible to obtain estimates at an extremely refined level of detail from the SBS
DWH, overcoming some limitations of the current statistical production strategy, which
is essentially based on direct surveys integrated with administrative records for the non
respondents. Improvements are expected in terms of reduction of both survey costs and
statistical burden (further reduction of sample size and of the number of variables
requested to the enterprises), increased accuracy of cross-sectional estimates within
and across statistical domains, better coherence over time of the SBS estimates, and
better exploitation of existing statistical and administrative micro-data for the
dissemination of larger, more detailed and better focused business data.
14
The development of the new production system based on the SBS DWH implies a deep
revision of the sampling and estimation strategy of the SBS surveys, and requires the
identification of challenging solutions to a number of methodological issues:

harmonization of concepts (target units, target variables) in the different
archives;

evaluation of the quality and usability of administrative sources - considered
either individually and/or integrated with each other;

management of the data integration process and of the possible associated
linkage and consistency errors;

design specific data editing approaches to deal with measurement errors, and
develop imputation strategies for total non-response (due to the sources’
incompleteness in terms of target population) and item non-response (due to the
sources’ incompleteness in terms of target variables), for each source and for the
integrated SBS DWH;

design innovative estimation strategies in the new, more complex information
context.
The reference frame for the DWH is represented by the Italian Business Register (BR),
which contains about 4.5 million enterprises. The statistical and administrative / fiscal
sources which are combined in the SBS DWH are shortly illustrated below.
S1: the sample survey on Small and Medium Enterprises (SME), which is carried out on
enterprises with less than 100 persons employed (about 105,000 firms selected) and
collects information on profit-and-loss statements and balance sheets, as well as
information regarding turnover, purchases of goods and services, changes in stocks,
employment and investments. Many variables are collected for both SBS and NA use.
S2: the Financial Statements (FS) of limited enterprises, collected each year by the
Chambers of Commerce (about 700,000 enterprises). This source is the best harmonized
with the SBS Regulation definitions.
S3: the Fiscal Survey on Sector Studies (SS),which includes each year about 3.5 million
enterprises with a turnover higher than 30,000 euros and lower than 7.5 million euros.
S4: the Tax returns forms (Unico), from the Ministry of Economy and Finance, based on
a unified model of tax declarations by legal form, containing different economic
information for different legal forms.
S5: the Social Security Register (Emens), from the National Security Institute (INPS)
includes firm level data and individual (employees) data on wages and labor cost (about
1,4 million units).
The four administrative sources have different degrees of coverage (with respect to the
SBS target population), completeness and accuracy (in terms of definitions and content
with respect to the statistical target variables). They are prioritized mainly based on the
level of the discrepancies between the adopted definitions and the SBS ones.
15
The overall production process of the SBS DWH which is under development is shown in
figure 3 (see Masselli and Luzi, 2012 for more details).
Focusing on the editing activities, among the possible alternative strategies, the
following one has been adopted, consisting of a preliminary, “light” validation process of
each single source, followed by a comprehensive editing process after the sources’
integration.
For each single administrative source:

verification of the coverage and completeness, harmonization of definitions and
preliminary check of microdata (e.g. treatment of duplicates, correction of
obvious and systematic errors like unit measurement errors).
On the integrated DWH:

Identification and treatment of linkage errors (errors in identification codes,
analysis of non linked units – e.g. fusions/splits).

Statistical data editing (based on a classical survey scheme, see Luzi et al, 2007):
- selective editing on the subset of key SBS variables;
- automatic consistency editing;
- robust imputation of item non-response;
- macro-editing.

Treatment of unit non-response (about 5% of non-covered units).
In this report we focus on the selective editing phase. The experimental applications
performed aim at evaluating possible advantages and drawbacks of using the
multivariate and robust model-based approach to selective editing on integrated micro
data for identifying influential errors on some key variables of the enterprises, like
Turnover, Production value, Intermediate costs, Value added. The information from the
BR concerning economic activity, legal form, number of persons employed and turnover
are used as auxiliary known information.
The estimation domains are known in advance and are very detailed: 4 digit NACE; 3
digit NACE by Number of employees (in classes1); 2 digit NACE by Region.
As the SBS DWH covers the entire SME’s target population, each unit has weight 1: in
this context, a direct estimation of the impact of errors at the various estimation
domains can be obtained.
In such a situation, the application of selective editing could be very complex and time
consuming (e.g. in terms of scores and other parameter settings). A preliminary
evaluation is required to determine which target estimates are key. The selective editing
will ensure the accuracy of the key estimates only.
1
Class of number of employees: [0-1] [2-5] [6-9] [10-19] [20-49] [50-99]
16
Different application scenarios need to be defined, depending on the amount of
information available on sub-populations of SMEs. Units covered by only one source
(either statistical or administrative) require a selective editing approach mainly based on
variable relations, historic information and auxiliary data from the BR. In the case of
overlapping sources, the extra information on units covered by more than one source
can be exploited to improve the effectiveness of selective editing.
One issue is represented by the definition of the level of accuracy to be ensured for the
key estimates at the required level of detail: generally speaking, a trade-off between the
amount of influential units detected in all domains and the expected precision is
adopted. In most cases, the domains/variables showing the highest number of units
requiring manual inspection are those where either the harmonization process need
further revision, or correspond to sub-populations where, due for example to
legal/administrative rules, enterprises may provide very different information to very
similarly defined variables. In all cases, the main outcome of the interactive revision is
not only the possible correction of data, but also a deepest knowledge of the
administrative data contents and logic with respect to the specific SBS purposes.
One issue is also the assessment of the effectiveness of the (robust) multivariate modelbased approach of SeleMix to predict suitable values for both influential units, and for
item non responses. As the SeleMix approach allows to jointly exploit all the information
available on each unit (longitudinal, from one or more sources and from the BR), fully
coherent micro information for the key variables are expected from the imputation
phase, as well as the preservation of multivariate distributions.
17
BR
(A)
DWH of integrated
micro data: A, X(SME),
X(FS), …, X(Emens)
SME
(X, Y)
Variables harmonization
(X), preliminary treatment
of microdata for each
source
Treatment of inconsistent
data; nonresponse
imputation A, X(SME),
X(FS), …, X(Emens)
Final DWH of integrated
microdata (A,X)
FS
SS
Core SBS
estimation (X)
Treatment of
microdata NA(X)
Unico
Emens
DWH of integrated
micro data for NA (A,
X, Y)
NA estimates,
other outputs
Dissemination
of economic
variables
Frame for all Business
Surveys under Europ.
Regulation
Figure 3: The production system of the SBS DWH2
2
Legend:
X: Variables common to SBS and national Accounts (NA)
Y: Specific variables for National Accounts (NA)
A: Auxiliary variables from Business Register (BR)
SME: Sample Survey on Small and Medium Enterprises
FS: Financial Statements
SS: Fiscal survey on Sector Studies
Unico: Tax returns forms
Emens: Social Security Register
18
3.2 SELEKT
Statistics Sweden have produced a SAS-based tool for selective editing, which may also
be useful in the context of editing in a Statistical Data Warehouse. The development of
the tool and methods underpinning it are described in Norberg and Arvidson (2008).
SELEKT uses a generic score function for each unit, variable and domain grouping of the
form:
Score = Suspicion  Impact  Importance
The three components of the score are described below.
(i) Suspicion
The Suspicion part of the score measures how likely it is that a returned value is in error.
In traditional editing, a value is usually considered to be either suspicious or not. In other
words, the suspicion is either 0 (passes editing) or 1 (fails editing). SELEKT has a more
flexible approach, with three different ways of calculating Suspicion: fatal edits, query
edits and test variables.
Fatal edits are used to detect invalid responses, that cannot possibly be correct. For
example a negative number of employees or component variables not summing to a
total. For any value failing a fatal edit, the Suspicion is set to 1 and the record fails
selective editing.
Query edits are traditional-style edits used to detect implausible responses. For example
unrealistically high values for a variable. Any value failing a query edit is given a fixed
Suspicion value, between 0 and 1. The choice of value depends on the importance given
to each query edit in identifying potential errors.
Test variables are arithmetic expressions based on (individual or combinations of) survey
variables, past survey data or auxiliary variables. SELEKT uses the variables specified to
calculate a general measure of Suspicion. SELEKT calculates a threshold for each test
variable and then calculates a measure of the relative distance of the returned value
from the threshold. This produces a value between 0 and 1 for each test variable
relevant to a particular returned value. SELEKT summarises the information for each
returned value by taking the maximum suspicion from each of the test variables that the
value fails. For more details see Norberg and Arvidson (2008).
(ii) Impact
The Impact part of the score measures the potential impact on the domain estimate for
the particular variable if the returned value is an error. It is calculated as the absolute
difference between the returned value and a prediction of that value, multiplied by the
survey weight (if the method is applied to admin data, the weight will need to be
defined in some other way). The predicted values can be calculated either from one or
more previous values for the same unit, or by taking an average of previous values over
units in the same homogeneous group.
19
(iii) Importance
The Importance part of the score has two roles. Firstly it allows the possibility of giving
extra weight to particular variables or domain groupings when calculating the score.
Secondly, this part of the score formula is used to standardise the score by dividing it
either by an estimate of the domain total for the variable or an estimate of the standard
error of the domain total. The Importance is calculated as the product of the variable
and domain weights, divided by the standardising factor.
Once the score has been calculated for a particular unit, variable and domain grouping,
it is aggregated to create a unit score. The method of aggregation is based on the
Minkowski distance measure, which allows for a range of options including a simple sum
of scores, a Euclidean distance or taking the maximum of the scores. Thresholds are
then used to determine if the score fails or passes selective editing. The thresholds are
defined analytically, by testing the effect of different thresholds on the number of
failures and resulting pseudo-bias (the additional bias incurred by only following up
those units that fail selective editing).
3.2.1 Use of SELEKT in a Statistical Data Warehouse
The flexible approach of SELEKT has the potential to be useful for efficient editing of
data in a Statistical Data Warehouse. The success of the method depends on the
particular use being made of the Data Warehouse. In this section we consider how
SELEKT could work firstly with mixed-source outputs and then in different Statistical
Data Warehouse scenarios.
(i) Using SELEKT with mixed-source outputs
It is useful to begin by simply considering how SELEKT can be used to selectively edit
mixed-source outputs. A statistical output may be compiled using a combination of
admin and survey data. For example, the variable turnover can be estimated using
survey and VAT data. In this setup, the use of selective editing will be different from that
where only a survey is used. The key element of selective editing is the impact a
particular unit has on published outputs. In the mixed source case, the method for
estimating that impact needs to take account of the fact that the survey source only
contributes to a part of that output. It is possible to distinguish two cases: in the first
case it is not possible to manually edit the admin data in any way, and in the second
case some kind of manual editing is carried out on both the survey and admin sources.
It is usually not possible to re-contact the units who contributed to admin data sources.
In some cases, manual inspection of suspicious data by statisticians can resolve potential
errors. However, when the data or available expertise does not make this possible, the
only option may be to automatically edit the admin data (see Verschaeren et al, 2013).
In this case, the selective editing is only applied to the survey component of the output.
Rather than using standard survey weights in the Impact part of the score, it will be
20
necessary to define weights based on the contribution each unit has to the mixedsource output.
The method for doing this will depend on how the estimates are constructed in creating
the output. For examples of estimation approaches used in this context see Lewis and
De Waal (2013). When calculating the Importance part of the score, the standardising
factor used in the denominator should be an estimate of the mixed-source domain
estimate. It does not make sense to use the estimated standard error in this case unless
there is a reliable method for estimating the mean square error of the mixed-source
estimates (see Burger et al, 2013 for advice on estimating mean square error for mixedsource estimators). When setting the thresholds in SELEKT, the effect on the accuracy of
the mixed-source domain estimate should be used as the determining factor to decide
whether units are followed up or not. Note that this means that many less units may
need to be followed up in cases where the mixed-source estimates are dominated by
admin data.
The size of admin data sets make it unfeasible to manually inspect every suspicious unit.
However, if it is possible to resolve suspicious values by manual inspection, selective
editing can help identify a reasonably sized subset of units to inspect. Since admin
sources are not sample surveys, it is tempting to give all units a weight of 1 in the Impact
part of the score in SELEKT. However, in a mixed-source output it is important to
consider the true impact that each unit has on the final estimate. The fact that survey
data is needed in addition to admin data suggests in itself that the admin data do not
contain enough information to represent the units in the target population without
being modified in some way. As in the first case, consideration of the specific estimator
being used is the only way to derived appropriate weights. The standardising factor in
the Importance part of the score, and the thresholds should again be set considering the
overall mixed-source domain estimates. In this case, the resulting selective editing
should fail more units from whichever source makes the biggest contribution to those
estimates.
(ii) Using SELEKT in a Statistical Data Warehouse
SELEKT can be used to selectively edit survey or admin data in a Statistical Data
Warehouse using similar methods to those described above. However, it is important to
consider how the data will be used. The key factor used in prioritising units for editing is
the impact they would have on outputs if they were in error. Data in a Statistical Data
Warehouse can be used for a wide range of different outputs, so in theory it is necessary
to consider the impact of editing on each of those outputs. The more outputs that are
derived from the data in the Warehouse, the more units are likely to need editing if it is
required to produce estimates with the required level of accuracy. At the limit, it may be
necessary to edit all units to ensure the accuracy of a limitless amount of outputs. In this
case, selective editing is clearly not feasible. A more appropriate approach would be to
automatically edit the data using methods similar to those described in Verschaeren et
al (2013). Alternatively, a very simple prioritisation may be implemented by only editing
the largest units.
21
If the Statistical Data Warehouse is going to be used for a relatively small, well defined
series of outputs, SELEKT offers functionality that could be used to implement selective
editing. The Impact and standardising part (the denominator) of the Importance score,
and the thresholds can be developed using similar techniques to those described for
selectively editing mixed-source data above. The numerator of the Importance score
gives importance weights for different variables and domain groupings. It may be
possible to use this functionality to construct weights that reflect the importance a
particular variable and domain has in each of the outputs derived from the Data
Warehouse. This would reflect, to some degree, the varying outputs making use of the
data in the Warehouse. The weights chosen could also be defined to take account of the
fact that some of the outputs may be more important than others. To derive such
weights would clearly not be completely straightforward and would require prior
analysis of the data and their role in the multiple outputs before setting up the
Importance part of the SELEKT score. To test the feasibility of this approach, it would be
necessary to carry out a dedicated study. Clearly the approach becomes more
complicated as the number of outputs derived from the Data Warehouse increases. It
would be sensible to begin testing the approach based on the simplest scenario, with
only two outputs using the data in the Warehouse.
22
4. Recommendations
Table 1 gives recommendations on issues associated with selectively editing combined
data and Statistical Data Warehouse functions, and provides solutions adopted by NSIs
or recommendations from this or other projects.
Table 1: Recommendations for selective editing of combined data sources in a Statistical Data
Warehouse framework
Situation/Issue
Recommendation / Best practice for selective editing
Data is
combined/matched and
stored in Warehouse,
but not used in
production of estimates
Data Integration ESSnet:
The strategy to distinguish linkage from other errors
needs to be specified explicitly and should be
automated if possible. See http://www.crosportal.eu/sites/default/files//WP2.pdf
Errors should be identified by analysing units that
cannot be linked. For the case study described the
turnover was ordered, to identify the most influential
errors first.
Additional indicators should be developed to
determine the type of linkage error.
Manual intervention by subject matter experts is
recommended to deal with errors identified. For
smaller companies it may be useful to automatically
collect internet information on name and activities to
help the correction process.
Gros (2012) outlines an algorithm to determine which
source should be used when variables or characteristics
are available from multiple sources. This involves
calculating a score of differences between data
sources to determine the units that need to be
manually compared or edited.
23
Administrative (or any
external) data in Data
Warehouse used as
auxiliary for survey data
in production of
estimates
Data Integration EssNet:
Error detection and correction is a 3-phase approach:
Phase 1 and 2 concentrate on linkage with business
register and target population frame for output. Phase
3 is focused on editing after making population
estimates using suspicion scores. See
http://www.crosportal.eu/sites/default/files//WP2.pdf
Luzi et al (2010); Lewis (2011)
It is possible to improve the quality of estimates by
using administrative data in editing and imputation
process. Highly accurate external data are not needed
as they are mainly used as auxiliary information to
optimize the error identification/treatment process.
Data in Warehouse
(all sources) used to
update the business
register
WP2.2 of ESSnet DWH
Guidelines (incl. options) on how the BR interacts with
the S-DWH
http://www.cros-portal.eu/sites/default/files//DWHSGA2-WP2%20%202.2.1%20Guidelines%20on%20how%20the%20BR
%20interacts%20with%20the%20SDWH_v%202.0.docx
Selective editing is generally not suitable for variables
being used to update the business register. Variables
updating the business register should be prioritised as
“critical” in the editing process.
Survey and
Administrative data
combined to produce
estimates
- Edited separately
Gåsemyr et al (2008)
In Statistics Norway, the ISEE system has separate
editing processes for each data source: selective
editing in surveys, and automatic editing of
administrative data.
Gros (2012)
Esane system for SBS – selective editing of an
administrative source with a very large number of
variables is difficult to manage – can lead to multiple
calls to companies regarding their administrative and
survey returned data. INSEE are testing the use of
selective editing only key variables instead.
Data experts on the administrative sources and survey
outputs are needed.
24
Survey and
Administrative data
combined to produce
estimates
- Edited when
integrated
Section 3.2.1
Survey and
Administrative data
combined to produce
estimates
Section 3.1.1
-
Edited both
separately and
when integrated
All data sources
combined and re-used
extensively for several
outputs
It should be possible to selectively edit mixed-source
estimates as long as it is possible to calculate the
contribution of each of the units to those estimates.
Editing data from all sources will focus on ensuring high
quality variables are used in linking.
The trade-off between costs and accuracy needs to be
evaluated as more than one selective editing
procedures are to be consistently defined.
Section 3.2.1
Selective editing may not be appropriate where data
sources are re-used for a large number of outputs.
Other options would be automatic editing or a simple
prioritisation of the largest units.
It may be possible to use selective editing when data
sources are only re-used for a small number of
outputs, using importance weights such as those
implemented in the SELEKT tool.
25
5. References
Bellisai, D., Di Zio, M., Guarnera, U., Luzi, O. (2009), A selective editing approach based on
contamination models: a comparative application to an Istat business survey. Proceedings of
UNECE Work session on Statistical Data Editing , Neuchatel, 5-7 October 2009. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2009/wp.27.e.pdf
Brion, P. (2012), Methodological Questions Raised by the Combined Use of Administrative and
Survey Data for French Structural Business Statistics. Proceedings of UNECE Work session on
Statistical Data Editing, Oslo, Norway, September 2012. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2012/19_France.pdf
Brion, P. (2011), First elements relative to the data editing strategy used for the new system of
French Structural Business Statistics. Proceedings of UNECE Work session on Statistical Data
Editing, Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.31.e.pdf
Brion, P. (2009), The Implementation of the new system of French structural business statistics.
Proceedings of UNECE Work session on Statistical Data Editing, Neuchâtel, Switzerland, October
2009. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2009/wp.20.e.pdf
Brion, P. (2008), The Future System of French Structural Business Statistics: The Role of the
Estimates. Proceedings of UNECE Work session on Statistical Data Editing, Vienna, Austria, April
2008. Web: http://www.unece.org/fileadmin/DAM/stats/documents/2008/04/sde/wp.13.e.pdf
Brisebois, F., Laroche, R., and Manriquez, R. (2011), Processing Methodology of Tax Data at
Statistics Canada. Proceedings of UNECE Work session on Statistical Data Editing, Ljubljana,
Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.7.e.pdf
Buglielli, T., Di Zio, M., Guarnera, U. (2011), Selective Editing of Business Survey Data Based on
Contamination Models: an Experimental Application. NTTS 2011 New Techniques and
Technologies for Statistics , Bruxelles, 22-24 February 2011.
Buglielli, M. T., Di Zio, M., Guarnera, U. and Pogelli, F. R. (2011), An R package for selective
editing based on a latent class model. Proceedings of UNECE Work session on Statistical Data
Editing, Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.45.e.pdf
Burger, J., Davies, J., Lewis, D., van Delden, A., Daas, P. and Frost, J-M. (2013), Guidance on the
accuracy of mixed-source statistics. Deliverable 6.3 (SGA 3) of the ESSnet Admin Data. Web:
http://essnet.admindata.eu/Document/GetFile?objectId=5989
Casciano, C., De Giorgi, V., Luzi, O., Oropallo, F., Seri, G., and Siesto, G. (2011), Combining
administrative and survey data: potential benefits and impact on editing and imputation for a
structural business survey. Proceedings of UNECE Work session on Statistical Data Editing,
Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.9.e.pdf
Cloutier, L (2009). Selective Editing for Business Surveys at Statistics Canada. Proceedings of
UNECE Work session on Statistical Data Editing, Neuchâtel, Switzerland, October 2009. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2009/wp.46.e.pdf
De Waal, T., Pannekoek, J., and Scholtus, S. (2011), Handbook of Statistical Data Editing and
Imputation. John Wiley & Sons.
26
Farewell, K. and Schubert, P. (2011) A Macro Significance Editing Framework to Detect and
Prioritise Anomalous Estimates. Proceedings of UNECE Work session on Statistical Data Editing,
Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.13.e.pdf
Gåsemyr, S., Nordbotten, S., and Morten, A. Q. (2008) Role of editing and imputation in
integration of sources for Structural Business Statistics. Proceedings of UNECE Work session on
Statistical Data Editing, Vienna, Austria, April 2008. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/2008/04/sde/wp.10.e.pdf
Gros, E. (2012), Assessment and Improvement of the Selective Editing Process in Esane (French
SBS). Proceedings of UNECE Work session on Statistical Data Editing, Oslo, Norway, September
2012. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2012/25_France.pdf
Gros, E. (2011) Quality improvement of individual data and statistical outputs based on
combined use of administrative and survey data. Proceedings of UNECE Work session on
Statistical Data Editing, Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.6.e.pdf
Hedlin, D. (2008), Local and Global Score Functions in Selective Editing. Proceedings of UNECE
Work session on Statistical Data Editing, Vienna, Austria, April 2008. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/2008/04/sde/wp.31.e.pdf
Holmberg, A., Blomqvist, K., and Engdahl, J. (2011) A strategy to improve the register system to
store, share and access data and its connections to a generic statistical information model
(GSIM). Proceedings of UNECE Work session on Statistical Data Editing, Ljubljana, Slovenia, May
2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.37.e.pdf
Hoogland, J. and Smit, R. (2008) Selective Automatic Editing of Mixed Mode Questionnaires for
Structural Business Statistics. Proceedings of UNECE Work session on Statistical Data Editing,
Vienna, Austria, April 2008. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/2008/04/sde/wp.2.e.pdf
Jug, M. (2009) Uptake of service orientated architecture in statistical agencies – are we really so
different? Paper presented at the conference Modernisation of Statistics Production,
(Stockholm, Sweden 2-4 November 2009).
http://www.scb.se/Grupp/Produkter_Tjanster/Kurser/ModernisationWorkshop/final_papers/C_
1_SOA_Jug_final.pdf
Laitila, T., Wallgreen, A., and Wallgren, B. (2011), Quality Assessment of Administrative Data
http://www.scb.se/statistik/_publikationer/OV9999_2011A01_BR_X103BR1102.pdf
Latouche, M. and Berthelot, J. M. (1992), Use of a Score Function to Prioritize and Limit
Recontacts in Editing Business Surveys. Journal of Official Statistics 8, pp. 389-400.
Lawrence, D. and McKenzie, R. (2000), The General Application of Significance Editing. Journal of
Official Statistics 16, pp. 243-253.
Lewis, D. (2011), Evaluating the benefits of using VAT data to improve the efficiency of editing in
a multivariate annual business survey. Proceedings of UNECE Work session on Statistical Data
Editing, Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.8.e.pdf
27
Lewis, D. and De Waal, T. (2013), Guide to estimation methods. Deliverable 3.4 (SGA 3) of the
ESSnet Admin Data. Web:
http://www.cros-portal.eu/sites/default/files//SGA%202011_Deliverable_3.4.pdf
Lindgren, K. (2011), Selektive Editing in the International Trade in Services. Proceedings of UNECE
Work session on Statistical Data Editing, Ljubljana, Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.19.e.pdf
Lorenc, B., Engdahl, J. and Blomqvist, K. (2012), Two Paradigms for Official Statistics Production.
Proceedings of UNECE Work session on Statistical Data Editing, Oslo, Norway, September 2012.
Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2012/12_Sweden.pdf
Luzi, O., di Zio, M., Guarnera, U., Manzari, A., de Waal, T., Pannekoek, J., Hoogland, J.,
Tempelman, C., Hulliger, B. and Kilchmann, D. (2007), Recommended practices for editing and
imputation in cross-sectional business surveys, manual, EDIMBUS project. Web:
http://epp.eurostat.ec.europa.eu/portal/page/portal/quality/documents/RPM_EDIMBUS.pdf
Luzi, O., Guarnera, U., Silvestri, F., Buglielli, T., and Nurra, A. (2010), Using administrative data in
multivariate selective editing: an application to the Italian survey on ICT usage and e-commerce.
Proceedings of Q2012 European Conference on Quality in Survey Statistics, Athens, Greece, May
2012.
http://www.q2012.gr/articlefiles/sessions/29.3_Luzi%20et%20Using%20administrative%20data
%20in%20multivariate%20selective%20editing.pdf
Luzi O., Guarnera U., Silvestri F., Buglielli T., Nurra A., Siesto G. (2012), Multivariate selective
editing via mixture models: first applications to Italian structural business surveys. UNECE Work
Session on Statistical Data Editing, Oslo, 24-26 September 2012. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2012/03_Italy.pdf
Masselli, M. and Luzi, O. (2012), Some considerations on developing a DWH for SBS estimates.
Intermediate report of the ESSnet DWH, March 2012.
Norberg, A. and Arvidson, G. (2008), New Tools for Statistical Data Editing. Proceedings of
UNECE Work session on Statistical Data Editing, Vienna, Austria, April 2008. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/2008/04/sde/wp.33.e.pdf
Skentelbery, R., Finselbach, H., and Dobbins, C. (2011), Improving the efficiency of editing for
ONS business surveys. Proceedings of UNECE Work session on Statistical Data Editing, Ljubljana,
Slovenia, May 2011. Web:
http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.20.e.pdf
Verschaeren, F., Lewis, D., Lewis, P., Benedikt, L., Vekeman, G., Putz, C. and Miltiadou, M. (2013),
Reference document Section 3: “Checking for Errors and Cleaning the Incoming Data”.
Deliverable 2.3 (SGA 3) of the ESSnet Admin Data. Web:
http://www.cros-portal.eu/sites/default/files//SGA%202011_Deliverable_2.3.pdf
Wallgren, A. and Wallgren, B. (2007), Register-based Statistics: Administrative Data for Statistical
Purposes. New York, Wiley.
28