DWH-SGA2-WP2 - 2.2.2 Guidelines (including options)

in partnership with
Title:
Guidelines (including options) on how the BR interacts with the
SDWH
WP:
2
Deliverable:
2.2.2
Version: 4.0 - Final
Date:
2-10-2013
Autor:
NSI:
Netherlands
Pieter Vlag
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Contents
Contents ...................................................................................................................................... 2
1.
Introduction ................................................................................................................................. 4
1.1
Definition of a statistical DataWareHouse ....................................................................... 4
1.2
The position of the Business Register in a statistical-DWH ............................................. 5
1.3 The relationship between this document and architectural and technical elements of a
statistical-DWH .......................................................................................................................... 6
2.3
Linking different data sources: units and populations ..................................................... 8
2.4
The Business Register and the population frame.............................................................. 9
2.5
The Business Register and the statistical DWH................................................................ 9
3.
Statistical units and population.................................................................................................. 10
3.1
Statistical units and population ...................................................................................... 10
3.2
A statistical-DWH: the population frame ....................................................................... 11
3.3 Backbone of the statistical-DWH: integrated population frame, turnover and
employment............................................................................................................................... 12
3.4
4.
Target populations of active enterprises ........................................................................ 15
Linking datasources to the statistical unit ................................................................................. 16
4.1
Linking other datasources to the backbone of the statistical-DWH ............................... 16
4.2
Variation in input units ................................................................................................... 17
4.3
Variation in output units ................................................................................................. 17
4.4
The statistical unit and the process of a statistical-DWH .............................................. 18
4.5
The concept of a statistical unit base .............................................................................. 18
5
Correcting information in the population frame and feedback to SBR ..................................... 20
5.1
The position of the Business Register in a statistical Data Warehouse.......................... 20
5.2
Dealing with conflicting information .............................................................................. 22
5.3
Panel surveys and correcting population characteristics .............................................. 23
5.4
Timing of correcting data in the backbone and the SBR ................................................ 23
2
5.5
6
Timing of feedback to the SBR ........................................................................................ 24
Conclusions ............................................................................................................................... 25
3
1.
Introduction
1.1
Definition of a statistical Datawarehouse
The main goal of the ESSnet on “micro data linking and data warehousing” is to prepare
recommendations about better use of data that already exist in the statistical system and to
create fully integrated data sets for enterprise and trade statistics at micro level: a 'data
warehouse' approach to statistics.
The broad definition of a data warehouse to be used in this ESSnet is:
‘A common conceptual model for managing all available data of interest, enabling the NSI to
(re)use this data to create new data/new outputs, to produce the necessary information and
perform reporting and analysis, regardless of the data’s source.’ (FPA, 2010)
The project describes a generic datawarehouse (DWH) for statistics or statistical
datawarehouse (statistical-DWH) as: a central statistical data store, regardless of the data’s
source, for managing all available data of interest, improving the NSI to:
- (re)use data to create new data/new outputs,
- perform reporting,
- execute analysis,
- produce the necessary statistical output.
This corresponds with a central repository able to support several kind of data, micro, macro
and meta, entering into the S-DWH in order to support cross-domain production processes
and statistical design, fully integrated in terms of data, metadata, process and instruments.
In practice, the statistical-DWH is subdivided into two separate environments:

The first is where all available information is collected and built-up, usually defined as
Extraction, Transformation and Loading (ETL) functions. The aim of this environment is
to create a set of fully integrated data.

Second is the actual data warehouse, i.e. where data analysis, or mining, and reports for
executives are realised. The aim of this environment is to disseminate the fully integrated
data as consistent outputs.
Figure 1 shows a schematic view of a commercial data warehouse with staging area. Data
staging is represented as a small part of the datawarehouse which is focussed on data analysis,
data mining and reporting tools. This illustrates the main difference between a commercial
and a statistical-DWH. The statistical-DWH is also focussed on the staging area, i.e. the
process of creating a set of fully integrated data (from different data sources), while the
commercial datawarehouse is – in most cases - focussed on the actual datawarehouse to
produce flexible outputs etc.
4
Figure 1 Architecture of a Data Warehouse with staging area. Illustration taken from
Oracle9/DataWareHousing Guide (2002).
Workpackage 2 (WP 2) of this ESSnet covers all essential methodological elements for
designing, building and implementing the statistical-DWH. It concentrates on the
methodological aspects of creating a set of integrated data
This document describes an essential part of the creation of an integrated dataset: the role of
the statistical business register as a frame to integrate data from different sources.
1.2
The position of the Business Register in a statistical-DWH
The purpose of this document is to describe the central role of the
 statistical units,

population frame, which includes number of enterprises,

total turnover derived from the Value Added Tax (VAT) data,

total employment derived from social security data
in a statistical-DWH. It is the reference to which all flexible input data are linked to obtain an
integrated set of data.
The position of the Business Register in a statistical-DWH is relatively simple in general
terms. The Business Register provides information about statistical units, the population,
turnover derived from VAT and wages plus employment derived from tax and/or social
security data. As this information is available for almost all units, the Business Register
allows us to produce flexible output for turnover, employment and number of enterprises.
5
The aim of the statistical-DWH is to link all other information to the Business Register in
order to produce consistent and flexible output for other variables. In order to achieve this, an
architectural and technical structure of a statistical-DWH has been developed. This
architecture is described in paragraph 1.3.
We realise that some National Statistical Institutes (NSI) have separate production systems to
calculate totals for turnover and employment outside the Statistical Business Register (SBR).
These systems are linked to the population frame of the SBR. The advantage of doing this is
that such a separate process acknowledges that producing admin data based turnover and
employment estimates requires specified knowledge about tax rules and definition issues.
Nevertheless the final result of calculating admin data based totals for turnover and
employment within or outside the SBR is the same. As this tax information is available for
almost all units and linked with the SBR, it is possible to produce flexible output for turnover,
employments and number of enterprises regardless of whether totals are calculated within or
outside the Business Register.
Therefore, we discuss the role of (flexible) population totals like number of enterprises,
turnover and employment in a statistical-DWH, but we don’t discuss whether total of turnover
and employment should be calculated within or outside the SBR. This decision is left to the
individual NSI.
The same is true for whether the SBR is part of the statistical-DWH or not. It is up to an
individual NSI whether or not the statistical-DWH uses extracts of
 statistical units,

population frame,

total turnover derived from the Value Added Tax (VAT) data,

total employment derived from social security data.
from the SBR for period t or includes the entire SBR-system with all this information in the
statistical-DWH. In chapter 2.5, however, we will discuss some pros and cons of including
the SBR in a statistical-DWH or not.
1.3
The relationship between this document and architectural and technical
elements of a statistical-DWH
Another workpackage (WP 3) covers all essential architectural and technical elements for
designing, building and implementing the statistical-DWH. Basically, workpackage 3 has
linked the GSPBM (Generic Statistical Business Process Model) sub-processes to the
statistical-DWH concept. As a result, it has provided a Business Architecture for the
statistical-DWH. Moreover, it has proposed a modular workflow for the statistical-DWH in
order to manage the information flow between data sources and the central administration of a
statistical-DWH. To do this; it uses four functional layers:
 data source layer,
 integration layer,
6


interpretation and data analysis layer,
data presentation layer.
Figure 2 shows the GSBPM model. Figure 3 show the relationship between the phases of the
statistical process as defined by the GSBPM and the functional layers as proposed by the
workpackage 3 team.
Note that statistical (enterprise) units, which are needed to link independent input data sets
with the population frame in turn and to relate the input data to statistical estimates, play an
important role in the processing phase of the GSBPM. This processing phase corresponds
with the integration layer of the statistical-DWH.
In the next chapters of this document, we will discuss in more detail where populations and
statistical units play a crucial role and how this interfaces with the business registers.
Figure 2 A schematic sketch of the GSBPM (Generic Statistical Business Process Model). Note that
the GSBPM divides the statistical process into 9 phases. These phases are divided into subprocesses.
7
Figure 3 Relationships between the layers of a statistical-DWH and the statistical processes according
to the GSBPM (Generic Statistical Business Process Model).
2.3
Linking different data sources: units and populations
The aim of a statistical-DWH is to create a set of fully integrated data pertaining to
enterprises, which enables a statistical institute to produce flexible and consistent output. The
original data come from different data sources. Collection of these data takes place in the
collect phase of the Business Architecture (fig. 2 – sub-process 4).
In practice, different data sources may cover different populations. The coverage differences
may be for different reasons:
a. the definition of an enterprise differs between the sources, i.e. sources have different
units.
b. sources may include (or exclude) groups of enterprises which are excluded (or included)
in other sources.
An example of the latter is the VAT-registration versus survey data. VAT-data (and some
other tax data like corporate tax data) do not include the smallest enterprises, but include all
other commercial enterprises. Survey samples contain information about a small selected
group of enterprises, including the smallest enterprises. Hence, linking data of several sources
is not only a matter of linking enterprises between the different input data but also a matter of
relating all input data to a reference, the so-called population frame.
Different sources may have different units. For example, surveys are based on statistical units
(which generally corresponds with legal units), while VAT-units may be based on enterprise
groups (as in the Netherlands). Hence, when linking VAT-data and survey-data to the target
population, it is important to agree to which units data are linked.
8
Summarising, when linking several input data in a statistical-DWH, one has to agree about
 the population frame, i.e. the reference to which all data sources are linked,
 the enterprise unit to which all input data are matched.
Both challenges will be addressed in this deliverable. The technical aspects about linking of
several data-sources are described in deliverable 2.4 of the ESSnet on Datawarehousing
(DWH).
2.4
The Business Register and the population frame
Member States of the European Union maintain business registers for statistical purposes as a
conduit for the preparation and coordination of surveys, as a source of information for the
statistical analysis of the business population and its demography, for the use of
administrative data, and for the identification and construction of statistical units. The
Regulation (EC) No 177/2008 of the European Parliament and the Council (EC) sets out a
common framework for the harmonisation of the national business registers for statistical
purposes and Article 7 of the Regulation asks for the publication of a business register
recommendation manual. The manual aims to explain the reasoning behind the provisions of
the Regulation. It aims to provide the extra information required for the correct and consistent
interpretation of the Regulation in all countries. The latest edition of the manual was
published in 2010. The manual has been updated in close cooperation with the Member
States.
The regulation and manual implicitly imply that the business register contains at least

a statistical unit,

a name and address of the statistical unit,

an activity-code (NACE),

starting and a stopping date of enterprises.
The implication for the statistical-DWH is that the required information about the reference or
population frame, e.g. units and populations (see paragraph 2.3), can be obtained from the
SBR. Hence, the SBR is a crucial input for the statistical-DWH.
2.5
The Business Register and the statistical DWH
The implication of previous paragraphs is that the population frame derived from the SBR is a
crucial part of the statistical-DWH. It is the reference to which all data sources are linked.
However, this does not mean that the SBR itself is part of the statistical-DWH. A very good
practical solution is that

the population frame is derived from the SBR for every period t

these snapshots of population characteristics for periods tx are used in the statisticalDWH.
9
By choosing this option the maintenance of the SBR is separated from maintenance of the
statistical-DWH. Both systems are however linked by the same population characteristics for
period t. This option is called SBR outside the statistical DWH.
Another option is that the entire SBR-system is included in the statistical-DWH. The
advantage of this approach is that corrected information about populations in the statisticalDWH is immediately implemented in the SBR. However, this may lead to consistency
problems if outputs are produced outside the statistical-DWH (as the ‘corrected’ information
is not automatically incorporated in these parts of the SBR). Maintenance problems may arise
as a system including both the production of a SBR as well as flexible statistical outputs may
be large and quite complex. This option is called SBR inside the statistical DWH.
In our opinion, it is up the individual NSIs whether the SBR should be inside or outside the
statistical-DWH because the coverage of the statistical-DWH (it may include all statistical
input and outputs or only parts of the in- and outputs) may differ for different countries.
Furthermore, we did not investigate the crucial maintenance factor.
In the remaining part of this report, we consider the option “SBR outside the statistical DWH”
only. This choice has been made for the sake of clarity. Apart from last paragraph of chapter
6, which is not relevant in the case of “SBR inside the statistical DWH”, this choice does not
affect the other conclusions of this report.
3. Statistical units and population
3.1
Statistical units and population
Taking into account the expected recommendations of the ESSnet on Consistency, it is
proposed that the statistical enterprise unit is the standard unit in business statistics. Ideally,
the statistical community should have the common goal that all Member States use a unique
identifier for enterprises based on the statistical unit. Therefore, the statistical-DWH uses the
statistical enterprise units as standard units only. As long as a unique identifier for enterprises
is not realised yet, data from sources not using the statistical unit are linked to the statistical
unit in a statistical-DWH. For further analyses, it is recommended that the statistical-DWH
only uses the statistical unit as a standard, because it is quite complicated to use several units
in treatment of data. As a consequence, (standard) enterprise populations are based on
statistical units.
In line with the SBS-regulation the following definition for an enterprise population is used in
this paper: all enterprises with a certain kind of activity being economically active during the
reference period. For annual statistics this means that the target population consists of all
enterprises active during the year, including the starters and stoppers (and the new/stopping
units due to merging and splitting companies). Such a population is called the target
population in methodological terms, i.e. the population to which the estimates refer. The
NACE-code is used to classify the kind of activity.
10
Note that target populations can be flexible in a statistical-DWH, because a statistical-DWH is
meant to produce flexible outputs. When processing and analysing data, it is recommended to
consider the target populations of the annual SBS and monthly or quarterly STS. These are
important obligatory statistics. More importantly, these statistics define the enterprise
population to its widest extent. According to regulations, they include all enterprises with
some economic activity during (part of) the period. Hence, by using these populations as
standard:


all other data sources could be linked to this standard, because they cannot cover a wider
population in the SBS/STS domain from a theoretical point of view.
all other publications derived from the statistical-DWH are basically subgroups from the
SBS/STS-estimates.
Furthermore, the output obligations of the annual SBS and monthly or quarterly STS are quite
detailed in terms of different kind of activities (NACE-codes). We propose that the SBS and
STS-output obligations are used as standard to check, link, clean and weight the input data in
the processing phase of the statistical-DWH, too (see figs. 2/3).
A statistical-DWH is designed to produce flexible output. However, as the standard SBS- and
STS-populations are the widest in terms of economic activity during the period and quite
detailed in terms of kind of activity, most other populations can be considered as
subpopulations of these standards. Examples of subpopulation are:

large or small enterprises only,

all active enterprises active at a certain date,

even more detailed kind of activity populations (i.e. estimates at NACE 3/4-digit level).
Domain estimators or other estimation techniques can be used to determine these subtotals, if
the amount of available data is sufficient and there are no problems with statistical disclosure.
Estimation techniques and outlier detection in flexible outputs are more extensively discussed
in other deliverables of the ESSnet on Datawarehousing. Checking, cleaning, integrating and
weighting the input data in a statistical-DWH are further discussed in chapter 5 of this
deliverable, but we also refer to other deliverables of the ESSnet on Datawarehousing for
further information.
3.2 A statistical-DWH: the population frame
To determine the population in the statistical-DWH, two types of information are needed:

the population frame, i.e. a list of enterprises with a certain kind of activity during a
period,

information to determine which enterprises of the list really performed economic
activities during a period.
As previously mentioned, the population frame is derived from the SBR. This population
frame consists of all enterprises within the SBR during the year, regardless of whether they
11
are active or not. To derive activity status and subpopulations, it is recommended that the
population frame includes the following information:
1) the frame reference year
2) the statistical unit enterprise, including its national ID and its EGR ID1
3) the name and address of the enterprise
4) the date in population (mm/yr)
5) the date out of population (mm/yr)
6) the NACE-code
7) the institutional sector code
8) a size class2
Note that a population frame is crucial for a statistical-DWH. Target populations, i.e.
populations belonging to estimates, for the flexible outputs are derived from it!
To determine the activity status of an enterprise, i.e. to estimate whether enterprises really
carried our economic activities, a comparison with VAT and/or employment data is needed. If
VAT reveals turnover above a certain threshold and/or employment data paid wages above a
certain threshold, the enterprise is considered as active. This will be discussed in the next
chapter (3.3) and is one of the reasons that VAT and/or employment data are crucial elements
of the statistical-DWH. Chapter 3.4 and 3.5 discuss how target populations can be determined
in two specific cases:


the statistical-DWH is limited to annual statistics,
the statistical-DWH includes short-term statistics, too.
3.3 Backbone of the statistical-DWH: integrated population frame, turnover
and employment
The results of the ESSnet on Admin Data showed that VAT and social security data can be
used for turnover and employment estimates when quasi complete. The latter is the case for
annual statistics and for quarterly statistics in most European countries on the continent. Note
however that VAT and social security data can only be used for statistical purposes if a) the
data transfer from the tax office to the statistical institute is guaranteed and b) the link with
the statistical unit is established. As already mentioned in chapter 1.2,

it is possible to process the VAT and employment data within the SBR

it is possible to have separate systems for processing VAT and social security data linked
to the SBR
1
arbitrary ID assigned by the EGR system to enterprises, it is advised to include this ID in the Datawarehouse to
enable comparatibility between the country specific estimates
2
could be based on employment data
12
to obtain totals for turnover and employment. In this paper we do not discuss the pros and
cons of each approach as it is a partly organisational decision for the NSIs. For this paper, we
assume that totals are produced for
1.
2.
3.
number of enterprises
turnover,
employment
with administrative data covering quasi-all enterprises in the SBS/STS domain. These totals
are integrated because they are all based on the statistical unit and all classified by activity by
using the NACE-code from the population frame. Hence, these three integrated totals together
represent the basic characteristics of the enterprise population. Therefore, these three totals
can be considered as the backbone of the statistical-DWH. All other data sources are linked to
these three totals in statistical-DWH and made consistent with them. This chapter mentions
some aspects for VAT and social security data.
VAT and social security cover almost all enterprises in the domain covered by the SBS and
STS-regulations and are available in a timely manner (i.e. earlier than most annual statistics).
They are crucial
1.
2.
to determine the activity status of the enterprises and implicitly to determine the target
populations of active enterprises,
to create a fully integrated dataset suitable for flexible outputs, because these
administrative data sources contain information about almost all enterprises (unlike
survey which contain only information of a small sample of enterprises).
The latter reason is explained further in the remainder of this section. When (quasi) complete
VAT and social security data can be used to produce good-quality estimates of turnover and
employment. Therefore, these estimates can – together with the population frame (i.e number
of enterprises, NACE-code etc.) be used as benchmarks when incorporating results of survey
sampling in a statistical-DWH. In this case totals of turnovers and employment define,
together with the number of enterprises, the basic population characteristics. These three
characteristics are assumed to be correct unless otherwise proven. Other datasets or surveys
covering more specific parts of the population should be made consistent with these three
main characteristics of the entire population. In the case of inconsistencies, the population
characteristics are considered as correct, survey data or other datasets are modified by
adapting weights or data editing. As these three main characteristics (population frame,
turnover, employment) are

integrated,

available at micro-level (statistical unit)

considered as correct

and all other sources are linked and made consistent to them,
13
these characteristics are the backbone of the statistical-DWH. This backbone is considered as
the authoritative source of the statistical-DWH because its information is assumed to be
correct unless otherwise proven.
The concept of the backbone improves the quality of integrated datasets and flexible outputs
of a statistical-DWH. This is because more auxiliary information, in addition to the number of
enterprises, is used when weighting survey results (or other datasets) or when imputing
missing values. For example, VAT and social security data can be used as auxiliary
information when weighting survey results of variables derived from surveys. Many literature
studies have proven that estimates based on weighting techniques using auxiliary information
(e.g. ratio or GREG-type estimators) produce lower sampling errors than estimates without
using auxiliary information when weighting (when survey variables are well correlated with
the auxiliary variables). Using VAT and social security data as auxiliary information when
weighting also corrects for unrepresentativity in the datasources. Hence, it improves the
accuracy of estimates (and reduces its biases) for variables which are derived from data
sources representing a specific part of the population. We refer to other deliverables of the
ESSnet on Data Warehousing for further details about this subject.
Summarising using a backbone with integrated population, turnover and employment data

improves the quality of a fully integrated dataset using several input data sets, as two key
variables for statistical outputs (turnover and employment) can be estimated precisely,

reduces the impact of sampling errors or biases in estimates for variables derived from
other data sources, because turnover and/or employment can be used as auxiliary
information when weighting.
As the first condition is the aim of a statistical-DWH and the second condition is required to
produce flexible output (especially about subgroups of the standard SBS and STSpopulation), this is the main argumentation to consider a backbone of integrated totals of
number of enterprises (=population), employment and turnover as the heart of a statisticalDWH.
The second reason to consider a backbone with integrated data about number of enterprises
(=population), employment and turnover as the heart of the statistical-DWH is the
determination of the activity status of an enterprise. Several NSIs use VAT- and social
security data for this purpose. More precisely, enterprises are considered as active if VAT
and/or social security data are available for the reference period or the previous period (in
case of late VAT or late social security data). This method is preferred over a survey to
determine the activity status, because the latter might be biased due to high non-response rates
under the enterprises that had ceased trading. Summarising, VAT and social security data are
crucial to determine whether an enterprise has been active or not. Hence, quasi-complete
turnover and employment data are crucial to determine target populations consisting of active
enterprises. This is the second reason to use a backbone of integrated totals of number of
enterprises (=population), employment and turnover in a statistical-DWH.
14
A schematic sketch of the position of the backbone with integrated population, turnover and
employment data is provided in figure 4.
Figure 4 Illustration of the position of the SBR and the backbone with integrated data about number
of enterprises (=population), VAT-turnover and employment derived from social security data in a
statistical-DWH. This backbone is represented by a line within the GSBPM phase 5.1. All other data
sources are integrated to this backbone at GSBPM phase 5.1, which is at the beginning of the
processing phase. The same backbone is also used for weighting when producing outputs at the end of
the processing phase (see line in GSBPM steps 5.7 and 5.8). In this figure VAT, social security data
and population are represented as different datasources with separate processes to integrate them.
Note that this integration can also be done within the SBR (dotted lines via SBR) or outside the SBR
(dotted lines directly to turnover, employment etc.).
3.4 Target populations of active enterprises
3.4.1 Case 1: Statistical DataWareHouse is limited to annual statistics
The determination of a target population with active enterprises only is relatively easy, if the
scope of the statistical-DWH is limited to annual statistics. This case is relatively easy
because the required information about population totals, turnover and employment can be
selected afterwards, i.e. when the year has finished. This is because annual surveys are
designed after the year has ended and results of surveys and other datasources with annual
data (like accountancy data + totals of four quarters) become available after the year has
ended, too. Hence, no provisional populations are needed to link provisional data during the
calendar year. Therefore, the population frame can be determined by
15

selecting all enterprises which are recorded in the SBR during the reference year

using the complete annual VAT and social security dataset to determine the activity
status and totals for turnover and employment.
3.4.2 Case 2: the Statistical Data Warehouse includes short-term statistics
The determination of a target population with only active enterprises becomes more
complicated when the production of short-term statistics is incorporated in the statistical
DWH. In this case a provisional population frame for reference year t frame should be
constructed at the end of year t-1, i.e. November or December. This population frame is used
to design short-term surveys. It is also the starting point for the statistical-DWH. This
provisional frame is called release 1 and formally it does not cover the entire population of
year t as it does not contain the starting enterprises yet.
During the year the backbone of the statistical-DWH is regularly updated with new
information about population (new, stopped, merged and splitted enterprises), activity,
turnover and employment. The frequency of these updates depends on the updates of the SBR
and related to this updating information provided by tha admin data holders (VAT and social
security data). At the end of year t (or at the beginning of year t+1), a regular population
frame for year t can be constructed. This regular population frame consists of all enterprises in
the year and is called release 2.
The ESSnet of Administrative Data has observed that time-lags do exist between the
registration of starting/stopping enterprises in the SBR (if based on Chamber of Commerce
data) and other admin data sources like tax information or social security data. The impact of
these time-lags differs for each country, because it depends

on the updates of both
1) the population frame in the SBR
2) VAT and social security data from the admin data holders (in the SBR),

the quality the underlying data sources.
Despite the different impact of the time-lags, the ESSnet on Administrative Data has shown
that these time-lags do exist in every country and lead to revisions in estimates about active
enterprises on a monthly and quarterly basis. This effect is enhanced, because the admin data
are not entirely complete on a quarterly basis. These time-lag and incompleteness issues
might be a consideration for choosing a low-frequency for updating the backbone in a
statistical-DWH. For example, quarterly and/or bi-annual updates could be considered.
4. Linking datasources to the statistical unit
4.1 Linking other datasources to the backbone of the statistical-DWH
As previously mentioned, the backbone – or heart - of a statistical-DWH consists of an
integrated set of
16

population characteristics (the so-called population frame)

turnover

employment data.
The main characteristic of this backbone is that this integrated information is available at
micro level. Hence, we have information about activity, size, turnover and employment of
(almost) every enterprise. Population totals can be obtained by adding the information of the
individual enterprises. This information can be derived from the SBR or – depending on the
choice of a NSI – from systems linked to the SBR.
As previously mentioned the backbone is based on statistical units only. Hence, when other
datasources are integrated with the backbone, they should be linked to the statistical units as a
first step. Technical aspects of data linking are described in deliverable 2.4 of the ESSnet on
Datawarehousing. The next chapter of this document addresses the question of what
information is required to link the several input sources to the statistical unit.
4.2 Variation in input units
Ideally a unique identifier for enterprises based on the statistical unit exists already. Data
linkage is simple in this case. In practice, accountancy data, tax data (including VAT and
social security data) and other data may be reported for different parts within an enterprise
group. These data might be reported for the enterprise group as a whole, the underlying legal
units or tax units consisting of other part of the enterprise group. The variation in units and the
challenge of linking them depends on the national legislation. Therefore the impact of this
issue differs for each country. The size of the enterprise also determines the variation is units
and the complexity of linking them. For small enterprises one-to-one relationships between
the different units can be assumed, but this assumption cannot be made for medium-sized and
large enterprises. Nevertheless, whatever the extent of these issues in individual countries and
whatever the determination of the statistical unit, it cannot be assumed that all input data link
automatically to the statistical unit. Hence, the relationship between these ‘input’ units and the
statistical units should be known before the data can be linked.
Data linking is of less importance when using surveys only, because surveys are generally
based on statistical units (as they are designed from SBR information). Data linking is more
important for the statistical-DWH, because it also uses other data sources.
4.3 Variation in output units
Most statistical estimates in enterpise statistics are produced on the statistical unit enterprise.
Examples are SBS, STS and most institutional statistics. However, some output is produced
on different units like local units, LKAUs, KAUs or enterprises groups. Again the complexity
of linking these units depends on the country and size of the enterprises. Nevertheless, one-toone relationships between these output units and the statistical enterprises unit cannot be
taken for granted. Hence, relationships between the ‘output’ units and the statistical units
17
should be known before flexible outputs can be generated. As producing flexible output is a
main characteristic of a statistical-DWH, the existence of several output units is an issue for a
statistical-DWH.
4.4 The statistical unit and the process of a statistical-DWH
The simplest and most transparent statistical process can be generated by

linking all input sources to the statistical enterprise unit at the beginning of the processing
phase (GSBPM-step 5.1 – see figures 2,3).

performing data cleaning, plausibility checks and data integration on statistical units only
(GSBPM steps 5.2-5.6).
producing statistical output (GSBPM-steps 5.7-5.8) by default on the statistical unit and
the target populations according to the SBS and STS regulations. Flexible outputs on
other target populations and other units are also produced in these steps by using repeated
weighting techniques and/or domain estimates. Technical aspects of these estimation
methods are described in deliverable 2.8 of the ESSnet on Datawarehousing.

Note that it is theoretically possible to perform data analysis and data cleaning on several
units simultaneously. However, the experience of Statistics Netherlands with cleaning VATdata on statistical units and ‘implementing’ these changes on the original VAT-units too,
reveal that the statistical process becomes quite complex. Therefore, it is proposed that

linking to the statistical units is carried out at the beginning of the processing phase only,

the creation of a fully integrated dataset is done for statistical units only,

statistical estimates for other units are produced at the end of the processing phase only,

relationships between the different in- and output units on the one hand and the statistical
enterprise units on the other hand should be known (or estimated) beforehand.
4.5 The concept of a statistical unit base
As mentioned in paragraph 3.1, the statistical-DWH uses only one unit when processing the
data: the statistical enterprise unit. The statistical community should have the aim that all
Member States use a unique identifier for enterprises based on the statistical unit having the
advantage that all datasources can be easily linked to the statistical-DWH. In practice,
dataholders may use several definitions of enterprises in some countries. As a result, several
enterprises units may exist. Related to this, different definitions of units may also exist when
producing output (LKAU, KAU, etc.).
The relationship between the different in- and output units on the one hand and the statistical
enterprise units on the other hand should be known (or estimated) before the processing
phase, because it is a crucial step for datalinking and producing output. Maintaining this
relationship in a database is recommended when outputs are produced by releases; e.g. newer
18
more precise estimates when more data (sources) become available. This prevents redoing a
time-consuming linking process at every flexible estimate.
It is proposed that the information about the different enterprise units and their relationships at
microlevel is kept by using the concept of a so-called unit base. This base should at least
contain

the statistical enterprise, which is the only unit used in the processing phase of the
statistical-DWH.

the enterprise group, which is the unit for some output obligations. Moreover the
enterprise group may be the base for tax and legal units, because in some countries, like
the Netherlands, the enterprise unit is allowed to choose its own tax and legal units of the
underlying enterprises.
The unit base contains the link between the statistical enterprise, the enterprise group and all
other units. Of course, it should also include the relationship between the enterprise group and
the statistical enterprise. In case of x-to-y relationships between the units, i.e. one statistical
unit corresponds with several units in another data source or vice versa, the estimated share in
terms of turnover (or employment) of the ‘data source’ units to the corresponding statistical
enterprise(s) and enterprise group needs to be mentioned.. This share can be used to relate
levels of variables from other datasources based on enterprises unit x1 to levels of turnover
and employment in the backbone based on the (slightly different) statistical enterprise unit x2 .
We refer to deliverable 2.4 of the ESSnet on Datawarehousing for further information about
data linking and estimating shares..
The unit base can be subdivided into ‘input’ units, used to link the different dataset to the
statistical enterprise unit at the beginning of the processing phase (GSBPM-step 5.1:
“integrate data”) and ‘output’ unit used to produce output on units other than the statistical
enterprise at the end of the processing phase (GSBPM-step 5.7/5.8 “calculate aggregated”).
Figure 5 illustrates the concept of a unit base. It shows that the unit base can be subdivided
into


input units, used to link the datasources to the statistical enterprise unit at the beginning
of the processing phase (GSBPM-step 5.1: “integrate data”)
output units, which are used to produce output about units other than the statistical
enterprise at the end of the processing phase (GSBPM-step 5.7/5.8 “calculate
aggregated”). An example is output about ‘enterprise groups’ LKAUs etc...
The exact contents of the unit base (and related to this its complexity) depends on


legislation for a particular country,
output requirements and desired output of a statistical-DWH,

available input data.
It is a matter of debate
19

whether the concept of a unit base should be included in the SBR

or whether the concept of a unit base should result in a physically independent database.
In the case of the latter it is closely related to the SBR, because both contain the statistical
enterprise. Basically, the choice depends on the complexity of the unit base. If the unit base is
complex, the maintenance becomes more challencing and a separate unit base might be
considered. The complexity depends on

the number of enterprise unit in a country

the number of (flexible) datasources an NSI uses to produce statistics.
As these factors differ by country and NSI, the decision to include or exclude the concept of a
unit base in the SBR depends on the individual NSI and won’t be discussed further in this
paper.
Figure 5 Example of the concept of a unit base.
5 Correcting information in the population frame and feedback
to SBR
5.1
The position of the Business Register in a statistical Data Warehouse
The position of the SBR in a statistical DWH is three-fold. More precisely
20

the SBR is the input source for the backbone of the statistical-DWH; integrated data
about enterprise populations, turnover and employment,

the SBR is closely related to the unit base,

the SBR is the sampling frame for the surveys, which is an another important data-source
of the statistical-DWH (for variables which cannot be derived from admin data).
The last point implies that errors in the backbone source, which might be detected during the
statistical process, should be incorporated in the SBR. Hence, a process to incorporate revised
information from the backbone in the statistical-DWH to the SBR should be established. By
not doing this, the same errors will return in survey results in subsequent periods.
The key questions are:

At which step of the process of the statistical-DWH is the backbone corrected when
errors are detected?

How is revised information from the backbone of integrated sources in the statisticalDWH incorporated in the SBR?
The position of the SBR and its relationships with the backbone, unit base and surveys is
illustrated in fig. 6
Figure 6 This figure also shows the position of a) data-integration, b) ‘weighting/calculation of
aggregates’ in the statistical process and c) the step with the statistical process at which the backbone
of the statistical-DWH is corrected in case of influential errors: GSBPM-step 5.7. At this step also
21
feedback to the SBR is provided. Note that data sources for the backbone are denoted by brown
cylinders and other input data by light blue cylinders.
5.2 Dealing with conflicting information
As mentioned previously, the backbone of the statistical-DWH consists of an integrated set of

population characteristics (statistical enterprises units, size and activity, the so-called
population frame),

turnover data derived from the Value Added Tax (VAT) data,

employment data derived from social security data
at micro level. All other data sources (with information about other variables) are linked to the
backbone, which again represents the main characteristics of the enterprise population in a
statistical-DWH. The backbone is also used to check, clean and integrate all other data
sources data at a micro level. During these steps, conflicting information between the data
sources themselves and between the data of the backbone might be detected (in practice: will
be detected). Conflicting information may in extremis lead to the conclusion that the
backbone contains errors. Deliverable 2.8 of the ESSnet of DataWareHousing addresses the
question how this conclusion might be drawn, because this deliverable deals with hierarchy
between the different data sources.
Whatever the exact methodology for detection, errors in the backbone might have several
origins. More specifically, they may be related to

errors in the data linking,


errors in the population characteristics (units, NACE-codes, size classes of enterprises).
errors in VAT- and or employment data,
Some errors may result in an erroneous estimation of the activity status and therefore the
number of active enterprises and possibly the level of the estimates. Other error may reveal
erroneous values, which may also lead to inconsistencies in level estimates. An example of an
erroreous value is that (VAT)turnover in the backbone of the integrated sources differs
considerably from the observed turnover and other variables in a survey. It is expected that
most errors in the population frame are detected because other data sources like surveys and
administrative data indicate that the enterprise has either another activity as recorded in the
SBR or another size as recorded in the SBR.
If the backbone is of good quality, which is essential, its number of errors should be limited.
Moreover, data cleaning plus data integration at micro level are basically independent of the
number of active enterprises, NACE-code, size class, etc... Therefore, it is proposed to use the
‘original’ population frame – which is part of the backbone - for these steps, even after errors
in it have been detected. Another reason for this proposal is that errors might be detected at
several stages of the data cleaning and integration process. Errors in the backbone might
22
become influential when survey data are weighted with the integrated (micro) data of the
backbone (number of enterprises possibly supplemented with the auxiliary information like
turnover, employment). This becomes visible when calculating aggregates at the end of the
processing phase. Therefore, it is recommended that all errors in the backbone be corrected
before weighting and calculating aggregates! This correponds with (the beginning of)
GSBPM-step 5.7 (“weighting”).
Note that in the case of errors due to data-linking the information used in the unit base should
be corrected rather than the backbone in the statistical-DWH.
5.3 Panel surveys and correcting population characteristics
When integrating survey data with the backbone, errors in the backbone and implicitly in the
SBR may be detected. This is especially true of surveys about produced goods, performed
services and investments where information can be very useful in detecting errors in the
NACE-code. However, carte blanche correction of NACE-codes etc. should be avoided since
this could bias the backbone and the SBR. Bias arises because: some parts of the SBR are of
better quality than others because they are surveyed. To prevent this drawback, one should be
very careful as to how panel surveys are used to correct information in the backbone and
SBR. Influential errrors in panel surveys, i.e. errors which significantly affect the estimates,
should preferably be treated as outliers.
5.4 Timing of correcting data in the backbone and the SBR
The unit base, the SBR, VAT-data and employment data derived from registers have a crucial
role in the linking and estimation process of the statistical-DWH. These data are also
important for estimates of statistics possibly falling outside the scope of the statistical-DWH.
Therefore, it is advisable that if the backbone is updated due to errors after confrontation with
other data like surveys, the SBR and unit base are updated themselves too. This updating of
both the backbone, the SBR (and the to the SBR related unit-base) is desirable to ensure
 that late information or later available input data are processed with the correct
information about the enterprise population,

new surveys are designed with the correct enterprise population.
The disadvantage of correcting the backbone (and SBR) is that previous published estimates
are revised when re-running the process with improved population information. More
precisely, the previous published estimates are estimated with an uncorrected population
frame and new estimates with a corrected population frame. This difference in population
frame leads to revisions. If the influential error – which led to the correction of the backbone
– is found when estimating a specific estimate x, this revision is desired as it is an
improvement. However, as the statistical-DWH is used for several output also previous
publised statistics, which were apparently not affected by this error, are also revised when
23
rerunning the process. To limit the disadvantage of unexpected revisions when revising the
backbone, the following recommendations are made:

developing a good metadata system, i.e. which data belong to which estimate,

using the paradigm that the information in the backbone is correct unless otherwise
proven. In other words, consider the backbone and SBR as authoritative sources which
are corrected only, if the detected errors are certain and influential,

relating the timing of incorporating changes in backbone to the revision policy of the
most important statistical outputs.
5.5 Timing of feedback to the SBR
It has been argued in the previous chapter that proven and influential errors in the

population characteristics, turnover, employment of the backbone,

statistical unit and (concept of) unit base
should be accompanied by corrections in the SBR, too. This is because the backbone is
strongly related to the SBR and unit base. In paragraph 5.4 it was argued that the timing of
these updates ìn the backbone of the statistical-DWH should correspond with the timing of the
revision action in the most important estimates. However, the timing of these corrections in
the SBR is even more complex. This is because, the SBR primarily acts as a frame for survey
sampling including for surveys falling outside the scope of the statistical DWH.
The importance of the timing can be best illustrated with an example. If the SBR is used as
sampling frame for an STS-survey of current year t and the SBR is ‘suddenly’ updated with
information from the statistical-DWH from last year t-1, a sudden – and misleading as far as
timing is concerned - discontinuity in the STS-survey series occurs. The question is whether
this discontinuity is desirable. The same applies for surveys falling outside the scope of the
statistical-DWH.
Therefore, it is advisable to develop a strategy for correcting information in the SBR. A
possible strategy is:

For the errors with such an impact that they cannot be neglected: correcting the backbone
and SBR at the same time (and as soon as possible). However, consultation with the
stakeholders of the most important statistics outside the scope of the statistical-DWH is
strongly recommended as these corrections may have impact on other statistics.

For less influential errors: corrections in the SBR are carried out at the end of the
calendar year when all surveys are renewed or refreshed. In this case, preliminary
estimates outside the statistical-DWH published within 12 months after the statistical
year t are still on a SBR including known-errors. This is the price for continuity of STSsurveys and consistency with statistics falling outside the scope of the statistical-DWH.
However, the final estimates published more than 12 months after statistical year t are on
an improved SBR, i.e. a SBR corrected for known-errors.
24
6
Conclusions
Two conditions are required for a successful statistical DWH. Firstly, the population is well
defined. Secondly, one unit should be used in the statistical DWH, because it is – in practice –
impossible to create integrated datasets for several (types of) enterprise units. The unit that
should be used is the statistical enterprise. For the sake of efficiency the link between the
statistical units and units of other data sources need to be stored. Therefore, for this storage
the concept of a unit base was presented. Whether a unit base should be incorporated in the
SBR or not depends on its complexity, i.e. how many different units of enteprises exist in a
country and the number of used data sources.
An integrated set of

population characteristics (activity, size etc.) or the so-called population frame

turnover derived from VAT

employment derived from social security data
is desired for the statistical-DWH as these administrative data sources are available for almost
all enterprises in the SBS/STS domain and therefore provide good information about the basic
characteristics of the enterprises. All other data sources can be linked to this integrated set of
population, turnover and employment data, which is therefore considered as for the backbone
of the statistical-DWH. This backbone can also be considered as the authoritative source of
the statistical-DWH as its information is considered as correct unless otherwise process in
case of conflicting information from other data sources like surveys.
The SBR is an indirect source for the backbone of the statistical-DWH because
1) the population frame is derived from it (and depending on the scope the VAT and
employment-data),
2) the unit base is strongly related to it
3) the surveys – another important data source for the statistical-DWH – are based on it.
Hence, when errors in the population are revealed after integrating different data sources, it is
desired that these errors are corrected in the SBR, too. However, the timing of incorporating
these corrections in the SBR (and the VAT and social security data) is extremely important
due to multiple use of SBR-information in data sources within or beyond the scope of the
statistical-DWH.
Finally there is an alternative approach. If the maintenance challenges are considered
acceptable and there are no consistency problems then it is feasible that the entire SBR be
incorporated within the statistical-DWH. If this alternative approach were to be followed it is
still imperative that the principles outlined within this report be adhered to.
The choice of which path to take is entirely at the discretion of each individual country/NSI. It
is more likely that this choice would be driven by cultural or legacy issues than by simple
efficiencies.
25
26