DWH-SGA2-WP3 - 3.5 Relate the ideal architectural scheme into an

in partnership with
Title:
Relate the 'ideal' architectural scheme into an actual development and
implementation strategy
WP:
3
Deliverable:
3.5
Version:
1.2
Date:
9 August 2013
Autors:
Antonio Laureti Palma
Björn Berglund
Allan Randlepp
Velerij Žavoronok
NSI:
Istat
Statistics Sweden
Statistics Estonia
Statistics Lithuania
ESS - NET
ON MICRO DATA LINKING AND DATA WAREHOUSING
IN PRODUCTION OF BUSINESS STATISTICS
Summary
1 RELATE THE 'IDEAL' ARCHITECTURAL SCHEME INTO AN ACTUAL DEVELOPMENT
AND IMPLEMENTATION STRATEGY ........................ ERROR! BOOKMARK NOT DEFINED.
2
2.1
INTRODUCTION ................................................................................. 3
Stovepipe model .................................................................................................... 3
2.2
The Data Warehouse approach .................................................................................. 4
2.2.1 Integrated model ................................................................................................. 4
2.2.2 Data Warehouse model .......................................................................................... 5
2.3
The GSBPM ........................................................................................................... 6
2.4
Generic Statistical Information Model (GSIM) ................................................................ 8
2.5
CORE, COmmon Reference Environment ...................................................................... 9
3
STATISTICAL DATA WAREHOUSE ARCHITECTURE ........................................ 11
3.1
Business Architecture ............................................................................................ 12
3.1.1 Business processes, that constitute the core business and create the primary value stream ..... 12
3.1.1.1 Source Layer funtionalities .............................................................................. 13
3.1.1.2 Integration Layer funtionalities ......................................................................... 14
3.1.1.3 Interpretation and data analysis layer funtionalities ............................................... 15
3.1.1.4 Access Layer funtionalities .............................................................................. 19
3.1.2 Management processes, the processes that govern the operation of a system, ..................... 20
3.1.3 Functional diagram for strategic over-arching processes ................................................ 21
3.1.3.1 Functional diagram for operational over-arching processes ....................................... 22
3.2
Information Systems Architecture ............................................................................. 25
3.2.1 S-DWH is a metadata-driven system ......................................................................... 25
3.2.2 Layered approach of a full active S-DWH ................................................................... 27
3.2.2.1 Source layer ................................................................................................ 28
3.2.2.2 Integration layer ........................................................................................... 29
3.2.2.3 Interpretation layer ....................................................................................... 30
3.2.2.4 Access layer ................................................................................................ 31
3.2.3 CORE services and reuse of components .................................................................... 32
2
1. Introduction
Statistical system is a complex system of data collection, data processing, statistical analyses, etc. The
following figure (by Sundgren (2004)) shows a statistical system as precisely defined, man-designed
system that measures external reality. On it are showed two main macro functions: “Planning and control
system” and “Statistical production system”.
This is a general synthesized view of the statistical system and it could represent one survey or the whole
statistical office or even an international organization. How such a system is built up and organized in real
life varies greatly. Some implementations of statistical system have worked quite well so far and others
not so well. Local environments of statistical systems are slightly different but big changes in
environment are more and more global. It does not matter anymore how well the system has performed so
far, some global changes in environment are so big that every system has to adapt and change (del 3.2).
Independently from any specific system, what it show is a strong interaction, or hysteresis, of the systems
with the real word and a system overlapping between the two main macro functions for accounting the
request from the real world.
In the context of the Ess-Net, we identify this system overlapping as the effective Data Warehouse (DW)
in which we are able to store statistical information of several statistical domain for supporting any
analysis for strategic NSI’s or European decision related to statistics. This identify a new possible
approach to statistical production based on a DW architecture; we define this specific approach as
Statistical-Data Warehouse (S-DWH).
In a S-DWH the main purpose is to integrate and store data generated as a result of an organization's
activities from different production departments, with the aim of optimizing the supply chain or carry out
marketing strategies.
1.1 Stovepipe model
Today’s prevalent production model in statistical systems is the stovepipe model. That is the outcome of a
historic process in which statistics in individual domains have developed independently. In stovepipe
model a statistical action or survey is independent form other actions in almost every phase of statistical
production value chain.
3
Advantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H.
Linden (2009)):
1. The production processes are best adapted to the corresponding products.
2. It is flexible in that it can adapt quickly to relatively minor changes in the underlying phenomena
that the data describe.
3. It is under the control of the domain manager and it results in a low-risk business architecture, as
a problem in one of the production processes should normally not affect the rest of the
production.
Disadvantages of the stovepipe model (from W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H.
Linden (2009)):
1. First, it may impose an unnecessary burden on respondents when the collection of data is
conducted in an uncoordinated manner and respondents are asked for the same information more
than once.
2. Second, the stovepipe model is not well adapted to collect data on phenomena that cover multiple
dimensions, such as globalization, sustainability or climate change.
3. Last but not least, this way of production is inefficient and costly, as it does not make use of
standardization between areas and collaboration between the Member States. Redundancies and
duplication of work, be it in development, in production or in dissemination processes are
unavoidable in the stovepipe model.
The stovepipe model is the dominant model in ESS and is reproduced and added at Eurostat level as well,
called as augmented stovepipe model.
1.2 The Data Warehouse approach
1.2.1
Integrated model
Integrated model is the new and innovative way of producing statistics. It is based on the combination of
various data sources. This integration can be horizontal or vertical:
- “Horizontal integration across statistical domains at the level of National Statistical Institutes
and Eurostat. Horizontal integration means that European statistics are no longer produced
domain by domain and source by source but in an integrated fashion, combining the individual
characteristics of different domains/sources in the process of compiling statistics at an early stage,
for example households or business surveys.” (W. Radermacher, A. Baigorri, D. Delcambre, W.
Kloek, H. Linden (2009))
-
“Vertical integration covering both the national and EU levels. Vertical integration should be
understood as the smooth and synchronized operation of information flows at national and ESS
levels, free of obstacles from the sources (respondents or administration) to the final product (data
or metadata). Vertical integration consists of two elements: joint structures, tools and processes
and the so-called European approach to statistics.” (W. Radermacher, A. Baigorri, D. Delcambre,
W. Kloek, H. Linden (2009))
Integrated model is created to avoid the disadvantages of stovepipe model (burden on respondents, not
suitable for surveying multi-dimensional phenomena, inefficiencies and high costs). “By integrating data
sets and combining data from different sources (including administrative sources) the various
disadvantages of the stovepipe model could be avoided. This new approach would improve efficiency by
elimination of unnecessary variation and duplication of work and create free capacities for upcoming
information needs.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
A task to go from the stovepipe model to the integrated model is not an easy one at all. In his answer to
UNSC about the draft of the paper on “Guidelines on Integrated Economic Statistics” W. Radermacher
writes: “To go from a conceptually integrated system such as the SNA to a practically integrated system
is a long term project and will demand integration in the production of primary statistics. This is the
priority objective that Eurostat has given to the European Statistical System through its 2009
Communication to the European Parliament and the European Council on the production method of the
EU statistics ("a vision for the new decade").”
4
The Sponsorship on Standardization, a strategic task force in the European Statistical System, has
compared traditional and integrated approach to statistical production. They conclude that “in the current
situation, it is clearly shown that there are high level risks and low level opportunities” and that “the full
integration situation is more balanced than the current situation, and the most interesting point is that risks
are mitigated and opportunities exploded.” (The Sponsorship on Standardisation (2013)) It seems that it is
strategically wise to move away from stovepipes and partly integrated statistical systems toward fully
integrated statistical production systems.
“Integration should address all stages of the production process, from design of the collection system to
the compilation and dissemination of data.” (W. Radermacher (2011)) Each statistical action designs
sample and questionnaires according to own needs and uses variations of classificators as needed,
selection of data sources is done according to the needs of the action, etc.
1.2.2
Data Warehouse model
The main purpose of a data warehouse (DW) is to integrate and store data generated as a result of an
organization's activities. A data warehouse system is a whole or one of several components of the
production infrastructure and, using the data coming from different production departments, is generally
used to optimize the supply chain or carry out marketing.
From a statistical production point of view, in addition to the stovepipe model, augmented stovepipe
model and integration model, W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)
describe also the warehouse approach, defined as: “The warehouse approach provides the means to store
data once, but use it for multiple purposes. A data warehouse treats information as a reusable asset. Its
underlying data model is not specific to a particular reporting or analytic requirement. Instead of focusing
on a process-oriented design, the underlying repository design is modelled based on data interrelationships that are fundamental to the organisation across processes.”
Conceptual model of data warehousing in the ESS (European Statistical System)
(W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009))
5
“Based on this approach statistics for specific domains should not be produced independently from each
other, but as integrated parts of comprehensive production systems, called data warehouses. A data
warehouse can be defined as a central repository (or "storehouse") for data collected via various
channels.” (W. Radermacher, A. Baigorri, D. Delcambre, W. Kloek, H. Linden (2009)).
In the future of the document, a statistical production system model combining the integrated model and
the warehouse approach will be defined as a Statistic Data Warehouse (S-DWH).
In NSIs, where statistical production processes of different topics are produced following stove-pipe-like
production lines, i.e. independent statistical production processes, the output system is generally used to
collect final aggregate-data. When several statistical production are inside a common S-DWH, different
aggregate data on different topics should not be produced independently from each other but as integrated
parts of a comprehensive information system. In this case statistical concepts and infrastructures are
shared, and the data in a common statistical domain are stored once for multiple purposes.
In all these cases,the S-DWH is the central part of the whole IT infrastructure for supporting statistical
production and corresponds to a system able to manage all phases of a statistical production process.
In the following we will describe a generic S-DWH as: a central statistical data store, regardless of the
data’s source, for managing all available data of interest, improving the NSI to:
- (re)use data to create new data/new outputs;
- perform reporting;
- execute analysis;
- produce the necessary information.
Which, would correspond to a central repository able to manage several kind of data, micro, macro and
meta, in order to support cross-domain production processes, fully integrated in terms of data, metadata,
process e instruments and definition of new statistical strategy, for new statistical designs or updates.
1.3 The GSBPM
In order to treat and manage all stages of a generic production process is useful to identify and locate the
different phases of a generic statistic production process by using the GSBPM schema of figure X.
The original intention of the GSBPM defined by UNECE (vers.4) was: "...to provide a basis for statistical
organizations to agree on standard terminology to aid their discussions on developing statistical metadata
systems and processes. The GSBPM should therefore be seen as a flexible tool to describe and define the
set of business processes needed to produce official statistics.”
The GSBPM identify a generic statistic business process, articulated in 9 phases and relative subprocesses, and nine over-arching management processes.
The nine Business Statistical phases are:
1. Specify Needs - This phase is related to a need of new statistics or an update from current
statistics. This is a strategic activity in a SDW approach, since here it is possible to realize a quick
and low cost first overall analysis of all the data and meta data available inside an institute.
2. Design phase - This phase describes the development and design activities, and any associated
practical research work needed to define the statistical outputs, concepts, methodologies,
collection instruments and operational processes. All the sub-processes are related to meta data
definition to coordinate the implementation process.
3. Build phase - In this phase are built and tested all the sub processes for the production systems
component. For statistical outputs produced on a regular basis, this phase usually occurs for the
first iteration, and following a review or a change in methodology, rather than for every iteration.
4. Collect phase - This phase collects all necessary data, using different collection modes (including
extractions from administrative and statistical registers and databases), and loads them into the
appropriate data environment, the source layer from a SDW point of view.
5. Pro cess phase - This phase describes the cleaning of data records and their preparation for
analysis.
6
6. Analyze phase - This phase is central for a SDW, since during this phase statistics are produced,
validated, examined in detail and made ready for dissemination.
7. Disseminate phase - This manages the release of the statistical products to customers. For
statistical outputs produced regularly, this phase occurs in every iteration.
8. Archive phase - This phase manages the archiving and disposal of statistical data and metadata.
9. Evaluate phase - This phase provides the basic information for the overall quality evaluation
management.
Figure X: GSBPM, generic statistic production process schema.
The nine Management Over-Arching Processes are:
1. statistical program management – This includes systematic monitoring and reviewing of
emerging information requirements and emerging and changing data sources across all statistical
domains. It may result in the definition of new statistical business processes or the redesign of
existing ones
2. quality management – This process includes quality assessment and control mechanisms. It
recognizes the importance of evaluation and feedback throughout the statistical business process
7
3. metadata management – Metadata are generated and processed within each phase, there is,
therefore, a strong requirement for a metadata management system to ensure that the appropriate
metadata retain their links with data throughout the different phases
4. statistical framework management – This includes developing standards, for example
methodologies, concepts and classifications that apply across multiple processes
5. knowledge management – This ensures that statistical business processes are repeatable, mainly
through the maintenance of process documentation
6. data management – This includes process-independent considerations such as general data
security, custodianship and ownership
7. process data management – This includes the management of data and metadata generated by and
providing information on all parts of the statistical business process. (process management is the
ensemble of activities of planning and monitoring the performance of a process) operations
management is an area of management concerned with overseeing, designing, and controlling the
process of production and redesigning business operations in the production of goods or services
8. provider management – This includes cross-process burden management, as well as topics such
as profiling and management of contact information (and thus has particularly close links with
statistical business processes that maintain registers)
9. customer management – This includes general marketing activities, promoting statistical literacy,
and dealing with non-specific customer feedback.
The quality and metadata over-arcing management and the Statistical Business Process are described in
the figure below:
1.4 Generic Statistical Information Model (GSIM)
Another model used for describing statistical processes defined from the “High-Level Group for the
Modernisation of Statistical Production and Services” is the Generic Statistical Information Model
(GSIM) which is a reference framework of internationally agreed definitions, attributes and relationships
that describe the pieces of information that are used in the production of official statistics (information
objects). This framework enables generic descriptions of the definition, management and use of data and
metadata throughout the statistical production process.
GSIM Specification provides a set of standardized, consistently described information objects, which are
the inputs and outputs in the design and production of statistics. Each information object has been defined
and its attributes and relationships have been specified. GSIM is intended to support a common
representation of information concepts at a “conceptual” level. It means that it is representative of all the
information objects which would be required to be present in a statistical system.
In the case of a process, there are objects in the model to represent these processes. However, it is
conceptual and not at the implementation level, so it doesn't just support a specific technical architecture it is technically 'agnostic'.
8
Because GSIM is a conceptual model, it isn’t giving any tools and measures for IT processes
management. It is only trying to identify the objects which would be used in statistical processes,
therefore it won't provide advice on tools etc. (which would be at the implementation level). However, in
terms of process management, GSIM should define the objects which would be required in order to
manage processes. These objects would specify what process flow should occur from one process step to
another. It might also contain the conditions to be evaluated at the time of execution, to determine which
process steps to execute next.
Figure 1: General Statistical Information Model (GSIM) [from High-Level Group for the Modernisation of
Statistical Production and Services]
We will use the GSIM as conceptual model together with the GSBPM in order to define all the basic
requirements for a Statistical Information Model, in particular:
- the Business Group (blue) is used to describe the designs and plans of Statistical Programs
-
the Production Group (red) is used to describe each step in the statistical process, with a particular
focus on describing the inputs and outputs of these steps
-
the Concepts Group (green) contains sets of information objects that describe and define the
terms used when talking about real-world phenomena that the statistics measure in their practical
implementation (population frame and units)
-
the Structures Group (orange) contains sets of information objects that describe and define the
terms used in relation to data and their structure
In the follow discussion we will these four conceptual groups to connect the nine statistical phases with
the over-arching management process of the GSBPM.
1.5 CORE, COmmon Reference Environment
There are many software models and approaches available to build modular flows between layers. SDWH’s layered architecture itself provides possibility use different platforms and software in separate
layers or to re-use components already available. In addition, different software can be used inside the
same layer to build up one particular flow. The problems arise when we try to use these different modules
and different data formats together.
One of the approaches is CORE services. They are used to move data between S-DWH layers and also
inside the layers between different sub-tasks, then it is easier to use software provided by statistical
community or re-use self-developed components to build flows for different purposes.
CORE services are based on SDMX standards and use main general conception of messages and
processes. Its feasibility to use within statistical system was proved under ESSnet CORE. Note that
CORE is not a kind of software but only a set of methods and approaches.
9
Generally CORE (COmmon Reference Environment) is an environment supporting the definition of
statistical processes and their automated execution. CORE processes are designed in a standard way,
starting from available services; specifically, process definition is provided in terms of abstract statistical
services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of
tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE
principles, and thus easily integrated within a statistical process of another NSI. Moreover, having a
single environment for the execution of entire statistical processes provides a high level of automation
and a complete reproducibility of processes execution.
The main principles underlying CORA design are:
a) Platform Independence. NSIs use various platforms (e.g., hardware, operating systems,
database management systems, statistical software, etc.), hence architecture is bound to fail if it
endeavours to impose standards at a technical level. Moreover, platform independence allows to
model statistical processes at a “conceptual level”, so that they do not need to be modified when
the implementation of a service changes.
b) Service Orientation. The vision is that the production of statistics takes place through services
calling other services. Hence services are the modular building blocks of the architecture. By
having clear communication interfaces, services implement principles of modern software
engineering like encapsulation and modularity.
c) Layered Approach. According to this principle, some services are rich and are positioned at the
top of the statistical process, so, for instance a publishing service requires the output of all sorts of
services positioned earlier in the statistical process, such as collecting data and storing
information. The ambition of this model is to bridge the whole range of layers from collection to
publication by describing all layers in terms of services delivered to a higher layer, in such a way
that each layer is dependent only on the first lower layer.
In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint, i.e. a
CORE executable service. CORE service is indeed composed by an inner part, which is the tool to be
wrapped, and by input and output integration APIs. Such APIs transform from/to CORE model into the
tool specific format.
Basically, the integration API consists of a set of transformation components. Each transformation
component corresponds to a specific data format and the principal elements of their design are specific
mapping files, description files and transform operations.
10
2 Statistical Data Warehouse architecture
The basic elements that must be considered in designing a DWH architecture are the data sources, the
management instruments, the effective data warehouse data structure, in terms of micro, macro and
meta data, and the different types and number of possible users.
This is generally composed of two main different functional environments: the first is where all available
information is collect and build-up, usually defined as Extraction, Transformation and Loading (ETL)
environment, while the second is the actual data warehouse, i.e. where data analysis, or mining, reports
for executives and statistical deliverables are realized.
If the management of data in the S-DWH is after any statistical check or data imputation (Process phase
in the GSBPM), this means that the S-DWH use as sources cleaned micro data, and their relative
description and quality meta data.
We should define this approach as a “passive S-DWH system” in which we exclude in the ETL phase any
statistical action on data to correct or modify values, i.e. we exclude any sub-processes of the GSBPM
Process phase, but of course we shouldn’t exclude further data transformation step for a coherent
harmonization of the definitions.
Otherwise if we include in the ETL phase all statistical check or data imputation on data sources this
means that we are considering a whole statistical production system with its typical statistical
elaborations. We may define this approach as a “full active S-DWH system” in which we include
statistical action on data to correct or modify values and transform them to harmonize definitions from
sources to the output.
Whit this last approach we must identify a unique possibly entry point for the metadata definitions and
management of all statistical processes managed. This approach is the most complex one in terms of
design and it depends on the ability of each organization to overcame the methodological, organizational
and IT barriers of a full active S-DWH.
Any others intermediate solution, between a fully active S-DWH system and a passive S-DWH system,
can be accounted by managing as external sources the data and the metadata produced out of the S-DWH
control system and we may define that the boundary of a S-DWH is then the operational limit of internal
users which depends on the typology and availability of the data sources.
In a generic fully active S-DWH system we identified four functional layers, starting from the bottom up
to the top of the architectural pile, they are defined as:
IV° - access layer, for the final presentation, dissemination and delivery of the information
sought specialized for external, relativaly to NSI or EStat, users;
III° - interpretation and data analysis layer, enables any data analysis or data mining,
functional to support statistical design or any new strategies, as well as data re-use; functionality
and data are optimized then for internal users, specifically for statistician methodologists or
statistician experts.
II° - integration layer, is where all operational activities needed for any statistical production
process are carried out; in this layer data are manly transformed from raw to cleaned data and this
activities are carried on by internal operators;
I° - source layer, is the level in which we locate all the activities related to storing and managing
internal or external data sources.
The ground level corresponds to the area where the process starts, while the top of the pile is where the
data warehousing process finishes. This reflects a conceptual organization in which we consider the first
two levels as operational IT infrastructures and the last two layers as the effective data warehouse.
This layered S-DWH vision can be described in terms of three reference architecture domains:
- Business Architecture,
- Information Systems Architecture,
- Technology Architecture.
11
The Business Architecture (BA) is a part of an enterprise architecture related to corporate business, and
the documents and diagrams that describe the architectural structure of that business.
The Information Systems Architecture is, in our context, the conceptual organization of the effective SDWH which is able to support tactical demands.
The Technology Architecture is the combined set of software, hardware and networks able to develop
and support IT services. Since we are interested to a conceptual SDWH, this domain is not relevant to us
and we will not discuss it in this Ess-Net (interesting studies in line with our approach can be found on IT
infrastructures services-oriented like in GSIM and CORA projects, which frameworks seem to fit quite
well with our activity).
2.1 Business Architecture
The BA is the bridge between the enterprise business model and enterprise strategy on one side, and the
business functionality of the enterprise on the other side and is used to align strategic objectives and
tactical demands. We provides a common understanding of a NSI articulating the organization by:
- management processes, the processes that govern the operation of a system,
- business processes, that constitute the core business and create the primary value stream,
2.1.1
Business processes, that constitute the core business and create the primary
value stream
In the layered S-DWH vision we identified the business processes in each layers; the ground level
corresponds to the area where the external sources are incoming and interfaced, while the top of the pile is
where aggregated, or deliverable, data are available for external user. In the intermediate layers we
manage the ETL functions for uploading the DWH in which are carried out strategic analysis, data mining
and design, for possible new strategies or data re-use.
This will reflect a conceptual organization in which we will consider the first two levels as pure statistical
operational infrastructures, where is produced the necessary information, functional for acquiring, storing,
coding, checking, imputing, editing and validating data, and the last two layers as the effective data
warehouse, i.e. levels in which data are accessible for execute analysis, re-use of data and perform
reporting.
new outputs
perform reporting
ACCESS LAYER
INTERPRETATION AND ANALYSIS LAYER
re-use data to create new data
execute analysis
INTEGRATION LAYER
produce the necessary information
SOURCES LAYER
The core of the S-DWH system is the interpretation and analysis layer, this is the effective data
warehouse and must support all kinds of statistical analysis or data mining, on micro and macro data, in
order to support statistical design, data re-use or real-time quality checks during productions.
12
The layers II and III are reciprocally functional to each other. Layer II always prepare the elaborated
information for the layer III: from raw data, just uploaded into the S-DWH and not yet included in a
production process, to micro/macro statistical data at any elaboration step of any production processes.
Otherwise in layer III it must be possible to easily access and analyze this micro/macro elaborated data
of the production processes in any state of elaboration, from raw data to cleaned and validate micro data.
This because, in layer III methodologists should correct possible operational elaboration mistakes before,
during and after any statistical production line, or design new elaboration processes for new surveys. In
this way the new concept or strategy can generate a feedback toward layer II which is able to correct, or
increase the quality, of the regular production lines.
A key factor of this S-DWH architecture is that layer II and III must include components of bidirectional
cooperation. This means that, layer II supplies elaborated data for analytical activities, while layer III
supplies concepts usable for the engineerization of ETL functions, or new production processes.
DATA
CONCEPTS
III - Interpretation and Analysis Layer
II - Integration Layer
Finally, the access layer should supports functionalities related to the exercise of output systems, from the
dissemination web application to the interoperability. From this point of view, the access layer operates
inversely to the source layer. On the layer IV we should realize all data transformations, in terms of data
and metadata, from the S-DWH data structure toward any possible interface tools functional to
dissemination.
In the following sections we will indicate explicitly the atomic activities that should be supported by each
layer using the GSBPM taxonomy.
2.1.1.1 Source Layer funtionalities
The Source Layer is the level in which we locate all the activities related to storing and managing internal
or external data sources. Internal data are from direct data capturing carried out by CAWI, CAPI or
CATI; while external data are from administrative archives, for example from Customs Agencies,
Revenue Agencies, Chambers of Commerce, National Social Security Institutes.
Generally, data from direct surveys are well-structured so they can flow directly into the integration layer.
This is because NSIs have full control of their own applications. Differently, data from others institutions’
archives must come into the S-SDW with their meta data in order to be read correctly.
In the source layer we support data loading operations for the integration layer but do not include any data
transformation operations, which will be realized in the next layer.
Analyzing the GSBPM shows that the only activities that can be included in this layer are:
Phase
4- Collect:
sub-process
4.2-set up collection,
4.3-run collection,
4.4-finalize collection
Set up collection (4.2) ensures that the processes and technology are ready to collect data. So, this subprocess ensures that the people, instruments and technology are ready to work for any data collections.
This sub-process includes:
-preparing web collection instruments,
-training collection staff,
-ensuring collection resources are available e.g. laptops,
-configuring collection systems to request and receive the data,
-ensuring the security of data to be collected.
13
Where the process is repeated regularly, some of these activities may not be explicitly required for each
iteration.
Run collection (4.3) is where the collection is implemented, with different collection instruments being
used to collect the data.
It is important to consider that the run collection sub-process in a web-survey could be contemporary with
the review, validate & edit sub-processes.
Finalize collection (4.4) includes loading the collected data into a suitable electronic environment for
further processing of the next layers. This sub-process also aims to check the metadata descriptions of all
external archives entering the SDW system. In a generic data interchange, as far as metadata transmission
is concerned, the mapping between the metadata concepts used by different international organizations,
could support the idea of open exchange and sharing of metadata based on common terminology.
2.1.1.2 Integration Layer funtionalities
The integration layer is where all operational activities needed for all statistical elaboration process are
carried out. This means operations carried out automatically or manually by operators to produce
statistical information in an IT infrastructure. With this aim, different sub-processes are pre-defined and
pre-configured by statisticians as a consequence of the statistical survey design in order to support the
operational activities.
This means that whoever is responsible for a statistical production subject defines the operational work
flow and each elaboration step, in terms of input and output parameters that must be defined in the
integration layer, to realize the statistical elaboration.
For this reason, production tools in this layer must support an adequate level of generalization for a wide
range of processes and iterative productions. They should be organized in operational work flows for
checking, cleaning, linking and harmonizing data-information in a common persistent area where
information is grouped by subject. This could be those recurring (cyclic) activities involved in the running
of the whole or any part of a statistical production process and should be able to integrate activities of
different statistical skills and of different information domains.
To sustain these operational activities, it would be advisable to have micro data organized in generalized
data structures able to archive any kind of statistical production, otherwise data should be organized in
completely free form but with a level of metadata able to realize an automatic structured interface toward
the themselves data.
Therefore, in the Integration layer are possible a wide family of software applications, from Data
Integration Tool, where a user-friendly graphic interface helps to build up work flow to generic statistics
elaboration line or part of it.
In this layer , we should include all the sub-processes of phase 5 and some sub-processes from phase 4,6
and 7 of the GSBPM:
Phase
5- Process
6- Analyze
sub-process
5.1-integrate data;
5.2-classify & code;
5.3-review, validate & edit;
5.4-impute;
5.5-derive new variables and statistical units:
5.6-calculate weights;
5.7-calculate aggregate;
5.8-finalize data files
6.1-prepare draft output;
Integrate data (5.1), this sub-process integrates data from one or more sources. Input data can be from
external or internal data sources and the result is a harmonized data set. Data integration typically
includes record linkage routines and prioritising, when two or more sources contain data for the same
variable (with potentially different values).
14
The integration sub-process includes micro data record linkage which can be realized before or after any
reviewing or editing, in function of the statistical process. At the end of each production process, data
organized by subject area should be clean and linkable.
Classify and code (5.2), this sub-process classifies and codes data. For example automatic coding routines
may assign numeric codes to text responses according to a pre-determined classification scheme, which
should include a residual interactive human activity.
Review, validate and edit (5.3), this sub-process applies to collected micro-data, and looks at each record
to try to identify (and where necessary correct) potential problems, errors and discrepancies such as
outliers, item non-response and miscoding. It can also be referred to as input data validation. It may be
run iteratively, validating data against predefined edit rules, usually in a set order. It may apply automatic
edits, or raise alerts for manual inspection and correction of the data. Reviewing, validating and editing
can apply to unit records both from surveys and administrative sources, before and after integration.
Impute (5.4), this sub-process refers to when data are missing or unreliable. Estimates may be imputed,
often using a rule-based approach.
Derive new variables and statistical units (5.5), this sub-process in this layer describes the simple
function of the derivation of new variables and statistical units from existing data using logical rules
defined by statistical methodologists.
Calculate weights, (5.6), this sub process creates weights for unit data records according to the defined
methodology and is automatically applied for each iteration.
Calculate aggregates (5.7), this sub process creates already defined aggregate data from micro-data for
each iteration. Sometimes this may be an intermediate rather than a final activity, particularly for business
processes where there are strong time pressures, and a requirement to produce both preliminary and final
estimates.
Finalize data files (5.8), this sub-process brings together the results of the production process, usually
macro-data, which will be used as input for dissemination.
Prepare draft outputs (6.1), this sub-process is where the information produced is transformed into
statistical outputs for each iteration. Generally, it includes the production of additional measurements
such as indices, trends or seasonally adjusted series, as well as the recording of quality characteristics.
The presence of this sub-process in this layer is strictly related to regular production process, in which the
measures estimated are regularly produced, as should in the STS.
2.1.1.3 Interpretation and data analysis layer funtionalities
The interpretation and data analysis layer is specifically for internal users, statisticians, and enables any
data analysis, data mining and support at the maximum detailed granularity, micro data, for design
production processes or individuate data re-use. Data mining is the process of applying statistical methods
to data with the intention of uncovering hidden patterns. This layer must be suitable to support experts for
free data analysis in order to design or test any possible new statistical methodology, or strategy.
The results expected of the human activities in this layer should then be statistical “services” useful for
other phases of the elaboration process, from the sampling, to the set-up of instruments used in the
process phase until generation of new possible statistical outputs. These services can, however, be
oriented to re-use by creating new hypotheses to test against the larger data populations. In this layer
experts can design the complete process of information delivery, which includes cases where the demand
for new statistical information does not involve necessarily the construction of new surveys, or a
complete work-flow setup for any new survey needed.
15
Case: produce the necessary information
ACCESS LAYER
7 DISSEMINATE
INTERPRETATION LAYER
2 DESIGN
9 EVALUATE
6 ANALYSIS
INTEGRATION LAYER
3 BUILD
5 PROCESS
SOURCE LAYER
4 COLLECT
From this point of view, the activities on the Interpretation layer should be functional not only to
statistical experts for analysis but also to self-improve the S-DWH, by a continuous update, or new
definition, of the production processes managed by the S-DWH itself.
We should point out that a S-DWH approach can also increase efficiency in the Specify Needs and
Design Phase since statistical experts, working on these phases on the layer III, share the same
information elaborated in the Process Phase in the layer II.
The use of a data warehouse approach for statistical production has the advantage of forcing different
typologies of users to share the same information data. That is, the same stored-data are usable for
different statistical phases.
Case: re-use data to create new data
ACCESS LAYER
7 DISSEMINATE
INTERPRETATION LAYER
5 PROCESS
2 DESIGN
9 EVALUATE
6 ANALYSIS
INTEGRATION LAYER
5 PROCESS
SOURCE LAYER
4 COLLECT
In general in the Interpretation layer, only a reduced number of users are allowed to operate in order to
prevent a reduction of servers performance, given that a deep data analyses could involve very complex
activities not always pre-evaluated in terms of processing costs. Moreover, queries on the operational
16
structures of the integration layer can not be left to a free user access, but they must be always optimized
and mediate by specific tools in order to not reduce the server performance of the integration layer.
Therefore, this layer supports any possible activities for new statistical production strategies aimed at
recovering facts from large administrative archives. This would create more production efficiency and
less of a statistical burden and production costs.
From the GSBPM then we consider:
1- Specify Needs:
2- Design:
4- Collect:
5- Process
6- Analyze
7- Disseminate
9- Evaluate
1.5 - check data availability
2.1-design outputs
2.2-design variable descriptions
2.4-design frame and sample methodology
2.5-design statistical processing methodology
2.6-design production systems and workflow
4.1-select sample
5.1-integrate data;
5.5-derive new variables and statistical units;
5.6-calculate weights;
5.7-calculate aggregate;
6.1-prepare draft output;
6.2-validate outputs;
6.3-scrutinize and explain;
6.4-apply disclosure control;
6.5-finalize outputs
7.1-update output systems,
9.1- gather evaluation inputs
9.2- conduct evaluation
Check data availability (1.5), this sub-process checks whether current data sources could meet user
requirements, and the conditions under which they would be available, including any restrictions on their
use. An assessment of possible alternatives would normally include research into potential administrative
data sources and their methodologies, to determine whether they would be suitable for use for statistical
purposes. When existing sources have been assessed, a strategy for filling any remaining gaps in the data
requirement is prepared. This sub-process also includes a more general assessment of the legal framework
in which data would be collected and used, and may therefore identify proposals for changes to existing
legislation or the introduction of a new legal framework.
Design outputs (2.1), this sub-process contains the detailed design of the statistical outputs to be
produced, including the related development work and preparation of the systems and tools used in phase
7 (Disseminate). Outputs should be designed, wherever possible, to follow existing standards, so inputs to
this process may include metadata from similar or previous collections, international standards.
Design variable descriptions (2.2), this sub-process defines the statistical variables to be collected via the
data collection instrument, as well as any other variables that will be derived from them in sub-process
5.5 (Derive new variables and statistical units), and any classifications that will be used. This sub-process
may need to run in parallel with sub-process 2.3 (Design data collection methodology), as the definition
of the variables to be collected, and the choice of data collection instrument may be inter-dependent to
some degree. The III layer can be seen as a simulation environment able to identify the effective variables
needed.
Design frame and sample methodology (2.4), this sub-process identifies and specifies the population of
interest, defines a sampling frame (and, where necessary, the register from which it is derived), and
determines the most appropriate sampling criteria and methodology (which could include complete
enumeration). Common sources are administrative and statistical registers, censuses and sample surveys.
This sub-process describes how these sources can be combined if needed. Analysis of whether the frame
covers the target population should be performed. A sampling plan should be made: The actual sample is
created sub-process 4.1 (Select sample), using the methodology, specified in this sub-process.
17
Design statistical processing methodology (2.5), this sub-process designs the statistical processing
methodology to be applied during phase 5 (Process), and Phase 6 (Analyse). This can include
specification of routines for coding, editing, imputing, estimating, integrating, validating and finalising
data sets.
Design production systems and workflow (2.6), this sub-process determines the workflow from data
collection to archiving, taking an overview of all the processes required within the whole statistical
production process, and ensuring that they fit together efficiently with no gaps or redundancies. Various
systems and databases are needed throughout the process. A general principle is to reuse processes and
technology across many statistical business processes, so existing systems and databases should be
examined first, to determine whether they are fit for purpose for this specific process, then, if any gaps are
identified, new solutions should be designed. This sub-process also considers how staff will interact with
systems, and who will be responsible for what and when.
Select sample (4.1), this sub-process establishes the frame and selects the sample for each iteration of the
collection, in line with the design frame and sample methodology. This is an interactive activity on
statistical business registers typically carry out by statisticians using advanced methodological tools.
It includes the coordination of samples between instances of the same statistical business process (for
example to manage overlap or rotation), and between different processes using a common frame or
register (for example to manage overlap or to spread response burden).
Integrate data (5.1), in this layer this sub-process makes it possible for experts to freely carry out micro
data record linkage from different information data-sources when these refer to the same statistical
analysis unit.
In this layer this sub-process must be intended as a evaluation for the data linking design, wherever needs.
Derive new variables and statistical units (5.5), this sub-process derives variables and statistical units
that are not explicitly provided in the collection, but are needed to deliver the required outputs. In this
layer this function would be used to set up procedures or for defining the derivation roles applicable in
each production iteration. In this layer this sub-process must be intended as a evaluation for evaluation on
designing new variable.
Prepare draft outputs (6.1), in this layer this sub-process means the free construction of not regular
outputs.
Validate outputs (6.2), this sub-process is where statisticians validate the quality of the outputs produced.
Also this sub process is intended as a regular operational activity, and the validations are carried out at the
end of each iteration on an already defined quality framework.
Scrutinize and explain (6.3) this sub-process is where the in-depth understanding of the outputs is gained
by statisticians. They use that understanding to scrutinize and explain the statistics produced for this cycle
by assessing how well the statistics reflect their initial expectations, viewing the statistics from all
perspectives using different tools and media, and carrying out in-depth statistical analyses.
Apply disclosure control (6.4), this sub-process ensures that the data (and metadata) to
be disseminated do not breach the appropriate rules on confidentiality. This means the use of specific
methodological tools to check the primary and secondary disclosure
Finalize outputs (6.5), this sub-process ensures the statistics and associated information are fit for
purpose and reach the required quality level, and are thus ready for use.
Update output systems (7.1), this sub-process manages update to systems where data and metadata are
stored for dissemination purposes.
Gather evaluation inputs (9.1), evaluation material can be produced in any other phase or sub-process. It
may take many forms, including feedback from users, process metadata, system metrics and staff
suggestions. Reports of progress against an action plan agreed during a previous iteration may also form
18
an input to evaluations of subsequent iterations. This sub-process gathers all of these inputs, and makes
them available for the person or team producing the evaluation.
Conduct evaluation (9.2), this rocess analyzes the evaluation inputs and synthesizes them into an
evaluation report. The resulting report should note any quality issues specific to this iteration of the
statistical business process, and should make recommendations for changes if appropriate. These
recommendations can cover changes to any phase or sub-process for future iterations of the process, or
can suggest that the process is not repeated.
2.1.1.4 Access Layer funtionalities
Access Layer is the layer for the final presentation, dissemination and delivery of the information sought.
This layer is addressed to a wide typology of external users and computer instruments. This layer must
support automatic dissemination systems and free analysts tools, in both cases, statistical information are
mainly macro data not confidential, we may have micro data only in special limited cases.
This typology of users can be supported by three broad categories of instruments:
- a specialized web server for software interfaces towards other external integrated output systems.
A typical example is the interchange of macro data information via SDMX, as well as with other
XML standards of international organizations.
- specialized Business Intelligence tools. In this category, extensive in terms of solutions on the
market, we find tools to build queries, navigational tools (OLAP viewer), and in a broad sense
web browsers, which are becoming the common interface for different applications. Among these
we should also consider graphics and publishing tools able to generate graphs and tables for
users.
- office automation tools. This is a reassuring solution for users who come to the data warehouse
context for the first time, as they are not forced to learn new complex instruments. The problem is
that this solution, while adequate with regard to productivity and efficiency, is very restrictive in
the use of the data warehouse since these instruments, have significant architectural and
functional limitations;
In order to support this different typology of instruments, this layer must allow the transformation of datainformation already estimated and validated in the preview layers by automatic software.
From the GSBPM we may consider only the phase 7 for operational process and specifically:
7- Disseminate
7.1-update output systems
7.2-produce dissemination.
7.3-manage release of dissemination products
7.4-promote dissemination
7.5-manage user support
Update output systems (7.1) this sub-process in this layer manages the output update adapting the already
defined macro data to specific output systems, including re-formatting data and metadata into specific
output databases, ensuring that data are linked to the relevant metadata. This process is related with the
interoperability between the access layer and others external system; e.g toward the SDMX standard or
other a Open Data infrastructure.
Produce dissemination products (7.2), this sub-process produces final, previously designed statistical
products, which can take many forms including printed publications, press releases and web sites. Typical
steps include:
-preparing the product components (explanatory text, tables, charts etc.);
-assembling the components into products;
-editing the products and checking that they meet publication standards.
The production of dissemination products is a sort of integration process between table, text and graphs.
In general this is a production chain in which standard table and comments from the scrutinizing of the
produced information are included.
Manage release of dissemination products (7.3), this sub-process ensures that all elements for the release
are in place including managing the timing of the release. It includes briefings for specific groups such as
19
the press or ministers, as well as the arrangements for any pre-release embargoes. It also includes the
provision of products to subscribers.
Promote dissemination products (7.4), this sub-process concerns the active promotion of the statistical
products produced in a specific statistical business process, to help them reach the widest possible
audience. It includes the use of customer relationship management tools, to better target potential users of
the products, as well as the use of tools including web sites, wikis and blogs to facilitate the process of
communicating statistical information to users.
Manage user support (7.5), this sub-process ensures that customer queries are recorded, and that
responses are provided within agreed deadlines. These queries should be regularly reviewed to provide an
input to the over-arching quality management process, as they can indicate new or changing user needs.
2.1.2
Management processes, the processes that govern the operation of a system,
In a S-DWH we recognizes fourteen over-arching statistical processes needed to support the statistic
production processes, nine of them are the same as in the GSBPM, while the remaining five are
consequence of a full active S-DWH approach.
In line with the GSBPM, the first 9 over-arching processes are1:
1. statistical program management – This includes systematic monitoring and reviewing of
emerging information requirements and emerging and changing data sources across all statistical
domains. It may result in the definition of new statistical business processes or the redesign of
existing ones
2. quality management – This process includes quality assessment and control mechanisms. It
recognizes the importance of evaluation and feedback throughout the statistical business process
3. metadata management – Metadata are generated and processed within each phase, there is,
therefore, a strong requirement for a metadata management system to ensure that the appropriate
metadata retain their links with data throughout the different phases
4. statistical framework management – This includes developing standards, for example
methodologies, concepts and classifications that apply across multiple processes
5. knowledge management – This ensures that statistical business processes are repeatable, mainly
through the maintenance of process documentation
6. data management – This includes process-independent considerations such as general data
security, custodianship and ownership
7. process data management – This includes the management of data and metadata generated by and
providing information on all parts of the statistical business process. (process management is the
ensemble of activities of planning and monitoring the performance of a process) operations
management is an area of management concerned with overseeing, designing, and controlling the
process of production and redesigning business operations in the production of goods or services
8. provider management – This includes cross-process burden management, as well as topics such
as profiling and management of contact information (and thus has particularly close links with
statistical business processes that maintain registers)
9. customer management – This includes general marketing activities, promoting statistical literacy,
and dealing with non-specific customer feedback.
In addition, we should include five more over-arching management processes in order to coordinate the
actions of a fully active S-DWH infrastructure; they are:
10. S-DWH Management: - This includes all activities able to support the coordination between:
statistical framework management, provider management, process data management, data
management
1
http://www1.unece.org/stat/platform/download/attachments/8683538/GSBPM+Final.pdf?version=1
20
11. data capturing management – This include all activities related with a direct, statistical or
computer, support (help-desk) to respondents, i.e. provision of specialized customer care for webquestionnaire compilation or toward external institution for acquiring archives.
12. output management, for general marketing activities, promoting statistical literacy, and dealing
with non-specific customer feedback.
13. web communication management, includes data capturing management, customer management
and output management; this includes for example should be the effective management of a
statistical web portal, able to support all front-office activities.
14. business register management (or for institutions or civil registers) – this is a trade register kept
by the registration authorities and is related to provider management and operational activities.
By definition, an S-DWH system includes all effective sub-processes needed to carry out any production
process. Web communication management handles the contact between respondents and NSIs, this
includes providing a contact point for collection and dissemination of data over internet. It supports
several phases of the statistical business process, from collecting to disseminating, and at the same time
provides the necessary support for respondents.
The BR Management is an overall process since the statistical, or legal, state of any enterprise is archived
and updated at the beginning and end of any production process.
2.1.3
Functional diagram for strategic over-arching processes
The strategic management processes among the over-arching processes stated in GSBPM and in the
extension for the S-DWH management functionalities falls outside S-DWH system but are still vital for
the function of a S-DWH. Those strategic functions are Statistical Program Management, Business
Register Management and Web Communication Management. The functional diagram below illustrates
the relation between the strategic over-arching processes and the operational management.
Figure 2: High level functional diagram representation
In the functional diagram functions are represented by modules whose interactions are represented by
flows. The diagram is a collection of coherent processes, which are continuously performed. Each module
is described with a box and contains everything necessary to execute the represented functionality.
As far as possible the GSBPM and GSIM are used to describe the functional architecture of an S-DWH,
thus the colours of the arrows in the functional diagrams refers to the four conceptual categories already
used inside the GSIM conceptual reference model.
The functional diagram in Figure 2 shows that the Statistical Program is the impulse for planning
statistical product requests. This generates two main activity phases. The first, Specify needs, is related to
the sphere of strategic decisions, while the second is leads to the design phase.
The basic input process for new statistical information derives from the natural evolution of the civil
society or the economic system. During this phase, needs are investigated and high level objectives are
established for output. The S-DWH is able to support this process by allowing the use of all available
information to analysts to check if the new concepts and new variables already are managed in the SDWH.
21
The design phase can be a consequence of new input from the statistical program or any new statistical
product or process-improvement. The design phase describes the development and design activities
associated with practical research work, which includes identifications of concepts, methodologies,
collection instruments and operational processes.
The web communication management is an external component with a strong interdependency with the
S-DWH since it is the interface for external users, respondents and scientific or social society. From an
operational point of view the provision of a contact point accessible over internet, e.g. a web-portal is a
key factor for relationship with respondents, services related to direct or indirect data capturing and
delivery of information products.
Functional diagram for operational over-arching processes
In order to analyze functions to support a generic statistic business process we describe the functional
diagram of Figure 2 in more detail. Expanding the module representing the S-DWH Management, we can
identify four more management functions within; Statistical Framework Management, Provider
Management, Process Metadata Management and Data Management. Furthermore, by expanding the
Web Communication Management module we can identify three more functions; Data Capturing
Management, Customer Management and Output Management. This is shown in the diagram in Figure 3.
Figure 3: Functional Diagram, expanded representation.
The details in Figure 3 enable us to contextualize the nine phases of the GSBPM in an S-DWH functional
diagram. We represent the nine phases using connecting arrows between modules. For the arrows we use
the same four colors used in the GSIM to contextualize the objects.
The four layers in the S-DWH are placed in the Data management function labeled I° (Source layer), II°
(Interpretation layer), III° (Integration layer) and IV° (Access layer).
Specify Needs phase - This phase is the request for new statistics or an update on current statistics. The
flow is blue since this phase represents the building of Business Objects from the GSIM, i.e. activities for
planning statistical programs. This phase is a strategic activity in an S-DWH approach because a first
overall analysis of all available data and meta data is realized.
In the diagram we identify a sequence of functions starting from the Statistical Program pass through the
Statistical framework and ending with the Interpretation layer of Data Management. This module
relationship supports executives in order to “consult needs”, “identify concepts”, “estimate output
objectives” and “determine needs for information”.
22
The connection between the Statistical framework and the Interpretation layer data indicates the flow of
activities to “check data availability”, i.e. if the available data could meet the information needs or the
conditions under which data would be available. This action is then supported by the “interpretation and
analysis layer” functionalities in which data is available and easy to use for any expert in order to
determine whether they would be suitable for the new statistical purposes.
At the end of this action, statisticians should prepare a business case to get approval from executives or
from the Statistical Program manager.
Design phase - This phase describes the development and design activities, and any associated practical
research work needed to define the statistical outputs, concepts, methodologies, collection instruments
and operational processes. All these sub-processes can create active and/or passive meta data, functional
to the implementation process. Using the GSIM reference colours we colour this flow in blue to describe
activities for planning the statistical program, realized by the interaction between the statistical
framework, process metadata and provider management modules. Meanwhile the phase of conceptual
definition is represented by the interaction between the statistical framework and the interpretation layer.
The information related to the “design data collection methodology” impacts on the provider management
in order to “design the frame” and “sample methodology”. These designs specify the population of
interest, defining a sample frame based on the business register, and determine the most appropriate
sampling criteria and methodology in order to cover all output needs. It also uses information from the
provider management in order to coordinate samples between instances of the same statistical business
process (for example to manage overlap or rotation), and between different processes using a common
frame or register (for example to manage overlap or to spread response burden).
The operational activity definitions are based on a specific design of a statistical process methodology
which includes specification of routines for coding, editing, imputing, estimating, integrating, validating
and finalizing data sets. All methodological decisions are taken using concepts and instruments defined in
the Statistical Framework. The workflow definition is managed inside the Process Metadata and supports
the production system. If a new process requires a new concept, variable or instrument, those are defined
in the Statistical Framework.
Build phase –In this phase all sub processes are built and tested for the systems component production.
For statistical outputs produced on a regular basis, this phase usually occurs for the first iteration or
following a review or a change in methodology, rather than for each iteration.
In a S-DWH which represents a generalized production infrastructure this phase is based on code reuse
and each new output production line is basically a work flow configuration. This has a direct impact on
active metadata managed by process meta data in order to execute the operational production flows
properly. In analogy with the GSIM, we color this flow in orange. Therefore, in a S-DWH the build phase
can be seen as a metadata configuration able to interconnect the Statistical Framework with the DWH
data structures.
Collect phase - This phase is related to all collection activities for necessary data, and loading of data into
the source layer of the S-DWH. This represents the first step of the operational production process and
therefore, in analogy with the GSIM the flow is colored red.
The two main modules involved with the collection phase in the functional diagram are Provider
Management and Data Capturing Management.
Provider Management includes: Cross-Process Burden, Profiling and Contact Information Managements.
This is done by optimizing register information using three inputs of information, the first from the
external official Business Register, the second from respondents' feedback and third from the
identification of the sample for each survey.
Data capturing management collects external data into the source layer. Typically this phase does not
include any data transformations. We identify two main types of data capture: from controlled systems
and from non controlled systems. The first is data collection directly from respondents using instruments
which should include shared variable definitions and preliminary quality checks. A typical example is a
web questionnaire. The second type is for example data collected from an external archive. In this case a
conceptual mapping between internal and external statistical concept is necessary before any data can be
loaded. Data mapping involves combining data residing in different sources and providing users with a
unified view of these data. These systems are formally defined as a triple <T,S,M> where T is the target
schema, S is the heterogeneous set of source schemas, and M is the mapping that maps queries between
the source and the target schemas.
23
Process phase - This phase is the effective operational activities made by reviewers. It is based on
specific step of elaboration and corresponds to the typical ETL phase of a DWH. In an S-DWH it
describes the cleaning of data records and their preparation for output or analysis. The operational
sequence of activities follows the design of the survey configured in the metadata management. This
phase corresponds to the operational use of modules and for this reason we colour this flow in red in
analogy with the managing of production objects of the GSIM.
All the sub process “classify & code”, “review”, “validate & edit”, “impute”, “derive new variables and
statistical units”, “calculate weights”, “calculate aggregate”, “finalize data files” are made up in the
“integration layer” following ad hoc sequences in function of the typology of the survey.
The “integrate data” is connecting different sources and use the Provider Management in order to update
asynchronous business register status.
Analyze phase - This phase is central for any S-DWH, since during this phase statistical concepts are
produced, validated, examined in detail and made ready for dissemination. We therefore colour the
activity flow of this phase green in accordance with the GSIM.
In the diagram the flow is bidirectional connecting the statistical framework and the interpretation layer of
the data management. This is to indicate that all non consolidated concepts must be first created and
tested directly in the interpretation and analysis layer. It includes the use or the definition of
measurements such as indices, trends or seasonally adjusted series. All the consolidated draft output can
be then automated for the next iteration and included directly in the ETL elaborations for a direct output.
The Analysis phase includes primary data scrutinizing and interpretation to support the data output. The
scrutinize is a in-depth understanding from statisticians of the data. They use that understanding to
explain the statistics produced in each cycle by evaluating the effective fitting with their initial
expectations.
Disseminate phase - This manages the release of the statistical products to customers. For statistical
outputs produced regularly, this phase occurs for each iteration. From the GSBPM we have five sub
processes: “updating output systems”, “produce dissemination products”, “manage release of
dissemination products”, “promote dissemination products”, “manage user support”. All of these sub
process can be directly considered related to the operational data warehousing.
The “updating output systems” sub process is the arrow connecting the Data Management with the Output
Management. This flow is coloured red to indicate the operational data uploading. The Output
Management produce and manage release of dissemination products and promote dissemination products
using the information stored in the “access layer”.
Finally the “finalize output” sub process ensure the statistics and associated information are fit for
purpose and reach the required quality level, and are thus ready for use. This sub process is manly
realized in the “interpretation and analysis” and their evaluations are available at the access layer.
Archive phase - This phase manages the archiving and disposal of statistical data and metadata.
Considering that an S-DWH is substantially an integrated data system, this phase must be considered to
be an over-arching activity; i.e. in a S-DWH it is a central structured generalized activity for all S-DWH
levels. In this phase we include all operational structured steps needed for the Data Management and the
flow is marked red.
In the GSBPM are four sub processes considered: “definition archive rules”, “management of archive
repository”, “preserve data and associated metadata” and “dispose of data and associated metadata”.
Between them the “definition archive rules” is a typical metadata activity and the others are operational
functions.
The archive rules sub process define structural metadata, for the definition of the structure of data (data
mart and primary), metadata, variable, data dimensions, constraints, etc., and it defines process metadata,
for specific statistical business process as regards to a general archiving policy of the NSI or standards
applied across the government sector.
The other sub processes concern the management of one or more data bases, the preservation of data and
metadata and their disposal, these functions are operational on an S-DWH and are depending from its
design.
24
Evaluate phase - This phase provides the basic information for the overall quality evaluation
management. The evaluation is applied to all the S-DWH layers through the statistical framework
management. It takes place at the end of each sub process and the gathered quality information is stored
into the corresponding metadata structures of each layer. Evaluation material may take many forms, data
from monitoring systems, log files, feedback from users or staff suggestions.
For statistical outputs produced regularly evaluation should, at least in theory, occur once for each
iteration. The evaluation is one key factor to determine whether future iterations should take place and
whether any improvements should be implemented. In a S-DWH context the evaluation phase always
involves evaluation of business processes for an integrated production.
2.2 Information Systems Architecture
The Information Systems bridge the business to the infrastructures, in our context, this is represented by
a conceptual organization of the effective S-DWH which is able to support tactical demands.
In the layered architecture in terms of data system we identify:
- the staging data are usually of temporary nature, and its contents can be erased, or archived, after
the DW has been loaded successfully;
- the operational data is a database designed to integrate data from multiple sources for additional
operations on the data. The data is then passed back to operational systems for further operations
and to the data warehouse for reporting;
- the Data Warehouse is the central repository of data which is created by integrating data from one
or more disparate sources and store current as well as historical data;
- data marts are in the access layer and are used to get data out to the users. Data marts are derived
from the primary information of a data warehouse, and are usually oriented to specific business
lines.
Source
Layer
Integration
Layer
Interpretation and
Analysis Layer
Access
Layer
Staging Data
Operational Data
Data Warehouse
Data Mart
DATA MINING
ANALYSIS
ICT - Survey
EDITING
SBS - Survey
ETrade - Survey
…
operational
information
operational
information
ANALYSIS
Data Warehouse
REPORTS
Data Mart
Data Mart
Data Mart
ADMIN
2.2.1
S-DWH is a metadata-driven system
The Metadata Management of metadata used and produced in the different layers of the warehouse are
defined in the Metadata framework 2 and the Micro data linking3 documents. This is used for description,
identification and retrieval of information and links the various layers of the S-DWH, which occurs
through the mapping of different metadata description schemes.
2
Lundell L.G. (2012) Metadata Framework for Statistical Data Warehousing, ver. 1.0. Deliverable 1.1
3
Ennok M et al. (2013) On Micro data linking and data warehousing in production of business statistics, ver. 1.1.
Deliverable 1.4
25
Descriptions of all statistical actions, all classificators that are in use, input and output variables, selected
data sources, descriptions of output tables, questionnaires and so on, all these meta-objects should be
collected during design and build phases into one metadata repository.
The over-arching Metadata Management of a S-DWH as metadata-driven system supports Data
Management within the statistical program of an NSI, and it is therefore vital to thoroughly manage the
metadata. To address this we refer to the conclusions of WP1.14 where metadata are organized in six main
categories. The main six categories are:
- active metadata, metadata stored and organized in a way that enables operational use, manual or
automated
- passive metadata, are any metadata that are not active
- formalised metadata, metadata stored and organised according to standardised codes, lists and
hierarchies
- free-form metadata, metadata that contain descriptive information using formats ranging from
completely free-form to partly formalised
- reference metadata, metadata that describe the contents and quality of the data in order to help
the user understand and evaluate them (conceptually)
- structural metadata, metadata that help the user find, identify, access and utilise the data
(physically)
Metadata in each of these categories may also belong to a specific type, or subset of metadata. In WP1.1
the five subsets that are generally best known or considered most important is are described, they are:
- statistical metadata, data about statistical data e.g. variable definition, register description, code
list
- process metadata, metadata that describe the expected or actual outcome of one or more
processes using evaluable and operational metrics
- quality metadata, any kind of metadata that contribute to the description or interpretation of the
quality of data;
- technical metadata, metadata that describe or define the physical storage or location of data
- authorization metadata are administrative data that are used by programmes, systems or
subsystems to manage users’ access to data
In the S-DWH one of the key factor is consolidation of multiple databases into a single data base and
identifying redundant columns of data for consolidation or elimination. This involves coherence of
statistical metadata and in particular on managed variables. Statistical actions should collect unique input
variables not just rows and columns of tables in a questionnaire. Each input variable should be collected
and processed once in each period of time. This should be done so that the outcome, input variable in
warehouse, could be used for producing various different outputs. This variable centric focus triggers
changes in almost all phases of statistical production process. So, samples, questionnaires, processing
rules, imputation methods, data sources, etc., must be designed and built in compliance with standardized
input variables, not according to the needs of one specific statistical action.
The variable based on statistical production system reduces the administrative burden, lowers the cost of
data collection and processing and enables to produce richer statistical output faster. Of course, this is true
in boundaries of standardized design. This means that a coherent approach can be used if statisticians plan
their actions following a logical hierarchy of the variables estimation in a common frame. What the IT
must support is then an adequate environment for designing this strategy.
Then, according to a common strategy, as example, we consider Surveys 1 and 2 which collect data with
questionnaires and one administrative data source. But this time, decisions being in design phase, like
design of the questionnaire, sample selection, imputation method, etc., are made “globally”, in view of the
interests of all three surveys. This way, integration of processes gives us reusable data in the warehouse.
Our warehouse now contains each variable only once, making it much easier to reuse and manage our
valuable data.
4
Lundell L.G. (2012) Metadata Framework for Statistical Data Warehousing, ver. 1.0. Deliverable 1.1
26
Another way of reusing data already in the warehouse is to calculate new variables. The following figure
illustrates the scenario where a new variable E is calculated from variables C* and D, loaded already into
the warehouse.
It means that data can be moved back from the warehouse to the integration layer. Warehouse data can be
used in the integration layer in multiple purposes, calculating new variables is only one example.
Integrated variable based warehouse opens the way to any new possible sub-sequent statistical actions
that do not have to collect and process data and can produce statistics right from the warehouse. Skipping
the collection and processing phases, one can produce new statistics and analyses are very fast and much
cheaper than in case of the classical survey.
To design and build a statistical production system according to the integrated warehouse model takes
initially more time and effort than building the stovepipe model. But maintenance costs of integrated
warehouse system should be lower and new products which can be produced faster and cheaper, to meet
the changing needs, should compensate the initial investments soon.
The challenge in data warehouse environments is to integrate, rearrange and consolidate large volumes of
data from different sources to provide a new unified information base for business intelligence. To meet
this challenge, we propose that the processes defined in GSBPM are distributed into four groups of
specialized functionality, each represented as a layer in the S-DWH.
2.2.2
Layered approach of a full active S-DWH
The layered architecture reflects a conceptual organization in which we will consider the first two levels
as pure statistical operational infrastructures, functional for acquiring , storing, editing and validating
data, and the last two layers as the effective data warehouse, i.e. levels in which data are accessible for
data analysis.
These reflect two different IT environments, an operational where we support semi-automatic computer
interaction systems and an analytical, the warehouse, where we maximize human free interaction.
27
2.2.2.1 Source layer
The Source layer is the gathering point for all data that is going to be stored in the Data warehouse. Input
to the Source layer is data from both internal and external sources. Internal data is mainly data from
surveys carried out by the NSI, but it can also be data from maintenance programs used for manipulating
data in the Data warehouse. External data means administrative data which is data collected by someone
else, originally for some other purpose.
The structure of data in the Source layer depends on how the data is collected and the designs of the
various direct and internal to any NSI data collection processes. The specifications of collection processes
and their output, the data stored in the Source layer, have to be thoroughly described. Vital information
are names and meaning, definition and description, of any collected variable. Also the collection process
itself must be described, for example the source of a collected item, when it was collected and how.
When data are entering in the source layer from a external source, or administrative archive, data and
relative metadata must be checked in terms of completeness and coherence.
From a data structure point of view, external data are stored with the same data structure as they arrive.
The integration toward the integration layer should be then realized by a mapping of the source variable
with the target variable, i.e. the variable internal to the S-DWH.
The mapping is a graphic or conceptual representations of information to represent some relationships
within the data; i.e. the process of creating data element mappings between two distinct data models.
The common and original practice of mapping is effective interpretation of an administrative archive in
term of S-DWH definition and meaning.
Data mapping involves combining data residing in different sources and providing users with a unified
view of these data. These systems are formally defined as a triple <T,S,M> where T is the target schema,
S is the heterogeneous set of source schemas, and M is the mapping that maps queries between the source
and the target schemas.
Queries over the data mapping system also asserts the data linking between elements in the sources and
the business register units.
All the internal sources doesn’t need of mapping since the data collection process is defined in a S-DWH
during the design phase using internal definitions.
28
Figure 4, Source layer overview
2.2.2.2 Integration layer
From the Source layer, data is loaded into the Integration layer. This represent an operational system used
to process the day-to-day transactions of an organization. These systems are designed to process efficient
and integrity transactional. The process of ETL data from source systems and transform it into useful
content in the data warehouse is commonly called. In the Extract step data is moved from the Source layer
and made accessible in the Integration layer for further processing.
The Transformation step involves all the operational activities usually associated with the typical
statistical production process, examples of activities carried out during the transformation are:
• Find, and if possible, correct incorrect data
• Transform data to formats matching standard formats in the data warehouse
• Classify and code
• Derive new values
• Combine data from multiple sources
• Clean data, that is for example correct misspellings, remove duplicates and handle missing values
To accomplish the different tasks in the transformation of new data to useful output, data already in the
data warehouse is used to support the work. Examples of such usage are using existing data together with
new to derive a new value or using old data as a base for imputation.
Each variable in the data warehouse may be used for several different purposes in any number of
specified outputs. As soon as a variable is processed in the Integration layer in a way that makes it useful
in the context of data warehouse output it has to be loaded into the Interpretation layer and the Access
layer.
The Integration layer is an area for processing data; this is realized by operators specialized in ETL
functionalities. Since the focus for the Integration layer is on processing rather than search and analysis,
data in the Integration layer should be stored in generalized normalized structure, optimized for OLTP
(Online transaction processing, is a class of information systems that facilitate and manage transactionoriented applications, typically for data entry and retrieval transaction processing.), where all data are
29
stored in similar same data structure independently from the domain of topic and each fact is stored only
in one point in order to makes easier maintain consistent data.
During the several ETL process a variable will likely appear in several versions. Every time a value is
corrected or changed by some other reason, the old value should not be erased but a new version of that
variable should be stored. That is a mechanism used to ensure that all items in the database can be
followed over time.
Figure 5, Integration layer overview
2.2.2.3 Interpretation layer
This layer contains all collected data processed and structured to be optimized for analysis and as a base
for output planned by the NSI. The Interpretation layer is specially designed for statistical experts and is
built to support data manipulation of big complex search operations. Typical activities in the
Interpretation layer are hypothesis testing, data mining and design of new statistical strategies, as well as
designing data cubes functional to the Access layer.
The Interpretation layer will contain micro data, elementary observed fats, aggregations and calculated
values, but it will still also contain all data at the finest granular level in order to be able to cover all
possible queries and joins. A fine granularity is also a condition to manage changes of required output
over time.
Besides the actual data warehouse content, the Interpretation layer may contain temporary data structures
and databases created and used by the different ongoing analysis projects carried out by statistics
specialists.
The ETL process continuously creates metadata regarding the variables and the process itself that is
stored as a part of the data warehouse.
In a relational data base, fact tables of the Interpretation layer should be organized in dimensional
structure to support data analysis in an intuitive and efficient way. Dimensional models are generally
structured with fact tables and their belonging dimensions. Facts are generally numeric, and dimensions
are the reference information that gives context to the facts. For example, a sales trade transaction can be
30
broken up into facts, such as the number of products moved and the price paid for the products, and into
dimensions, such as order date, customer name and product number.
A key advantage of a dimensional approach is that the data warehouse is easy to use and operations on
data are very quick. In general, dimensional structures are easy to understand for business users, because
the structures are divided into measurements/facts and context/dimensions related to the organization’s
business processes.
The definition of a star schema would be realized by dynamic ad hoc queries from the integration layer,
by the proper meta data, in order to realize, generally, a data transposition query. With a dynamic
approach, any expert user should define their own analysis context starting from the already exiting
materialized DM, virtual or a temporary environment derived from the data structure of the integration
layer. This method allows users to automatically build permanent or temporary data marts in function of
their needs, leaving them free to test any possible new strategy.
A fact table consists of the measurements, metrics or facts of a statistical topic. Fact table in the DWH are
organized in a dimensional model, built on a star-like schema, with dimensions surrounding the fact table.
In the S-DWH, fact table are defined at the higher level of granularity with information organized in
columns distinguished in dimensions, classifications, and misuses. Dimensions are the foundation of the
fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date,
class of employ, territory, NACE, etc. Dimensions are where all the data is stored. For example, the date
dimension could contain data such as year, month and weekday.
2.2.2.4 Access layer
The Access layer is the layer for the final presentation, dissemination and delivery of information. This
layer is used by a wide range of users and computer instruments. The data is optimized to effectively
present and compile data. Data may be presented in data cubes and different formats specialized to
support different tools and software. Generally the data structure are optimized for MOLAP
(Multidimensional Online Analytical Processing) uses specific analytical tools on a multidimensional data
model or ROLAP, Relational Online Analytical Processing, uses specific analytical tools on a relational
dimensional data model which is easy to understand and does not require pre-compuation and storage of
the information.
31
Figure 6, Access layer overview
Multidimensional structure is defined as “a variation of the relational model that uses multidimensional
structures to organize data and express the relationships between data”. The structure is broken into cubes
and the cubes are able to store and access data within the confines of each cube. “Each cell within a
multidimensional structure contains aggregated data related to elements along each of its dimensions”.
Even when data is manipulated it remains easy to access and continues to constitute a compact database
format. The data still remains interrelated. Multidimensional structure is quite popular for analytical
databases that use online analytical processing (OLAP) application. Analytical databases use these
databases because of their ability to deliver answers to complex business queries swiftly. Data can be
viewed from different angles, which gives a broader perspective of a problem unlike other models.
2.2.3
CORE services and reuse of components
In a S-DWH there are several types of workflows in S-DWH: the acceptance and inclusion of sources in
the integration layer, the effective updates data in interpretation layer, the updates in-house dissemination
database, updates of dissemination database in the access layer and so on.
All of them are automated data flows that should be seen as independent work flows. It extracts raw data
from the source layer, make a check of coherence between data and metadata and processes it in
integration layer; others flows load interpretation layer. And on the other hand, it brings cleansed data to
the source layer for pre-filling questionnaires, prepares sample data for collection systems, etc. Let’s
name this flow the processing flow. Processing flows should be built up around input variables or groups
of input variables to feed variable based warehouse. Generate and publish cube flows are built around
cubes, i.e. each flow generates or publishes a cube.
There are many software tools available to build these modular flows. S-DWH’s layered architecture
itself provides the possibility to use different platforms and software in separate layers, i.e. to re-use
components already available in-house or internationally. In addition, different software can be used
32
inside the same layer to build up one particular flow. The problems arise when we try to use these
different modules and different data formats together.
Take a deeper look into CORE services. They are used to move data between S-DWH layers and also
inside the layers between different sub-tasks (e.g. edit, impute, etc.), then it is easier to use software
provided by statistical community or re-use self-developed components to build flows for different
purposes.
Generally CORE (COmmon Reference Environment) is an environment supporting the definition of
statistical processes and their automated execution. CORE processes are designed in a standard way,
starting from available services; specifically, process definition is provided in terms of abstract statistical
services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of
tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE
principles, and thus is easily integrated within a statistical process of another NSI. Moreover, having a
single environment for the execution of all statistical processes provides a high level of automation and a
complete reproducibility of processes execution.
NSI produce Official Statistics sharing very similar goals, hence several activities related to the
production of statistics are common. Nevertheless, such activities are currently carried on in an
independent way, without relying on shared solutions. Sharing a common architecture would result in a
reduction of costs due to duplicated activities and, at the same time, in an improvement of the quality of
produced statistics, due to the adoption of standardized solutions.
The main principles underlying CORA design are:
d) Platform Independence. NSIs use various platforms (e.g., hardware, operating systems,
database management systems, statistical software, etc.), hence architecture is bound to fail if it
endeavours to impose standards at a technical level. Moreover, the platform independence allows
to model statistical processes at a “conceptual level”, so that they do not need to be modified
when the implementation of a service changes.
e) Service Orientation. The vision is that the production of statistics takes place through services
calling other services. Hence services are the modular building blocks of the architecture. By
having clear communication interfaces, services implement principles of modern software
engineering like encapsulation and modularity.
f) Layered Approach. According to this principle, some services are rich and are positioned at the
top of the statistical process, so, for instance a publishing service requires the output of all sorts of
services positioned earlier in the statistical process, such as collecting data and storing
information. The ambition of this model is to bridge the whole range of layers from collection to
publication by describing all layers in terms of services delivered to a higher layer, in such a way
that each layer is dependent only on the first lower layer.
CORE principal objective is the design and implementation of an environment supporting the definition
of statistical processes and their automated execution. CORE processes are designed in a standard way,
starting from available services; specifically, process definition is provided in terms of abstract statistical
services that can be mapped to specific IT tools. CORE goes in the direction of fostering the sharing of
tools among NSIs. Indeed, a tool developed by a specific NSI can be wrapped according to CORE
principles, and thus easily integrated within a statistical process of another NSI. Moreover, having a
single environment for the execution of all statistical processes provides a high level of automation and a
complete reproducibility of processes execution.
For us it is very important to make some transitions and mappings between different models and
approaches. Unfortunately mapping a CORE process to a business model is not possible because the
CORE model is an information model and there is no way to map a business model onto an information
model in a direct way. The two models are about different things. They can only be connected if this
connection is in some way a part of the models.
The CORE information model was designed with such a mapping in mind. Within this model, a statistical
service is an object, and one of its attributes is a reference to its GSBPM process. Considering the
GSBPM a business model, any mapping of the CORE model onto a business model has to go through this
reference to the GSBPM.
Usually different services use different services with their own tools which expect different data formats.
So for service interactions we need conversions. Evidently conversions are expensive. Using for
interactions CORE services number of conversions can be reduced noticeably.
33
In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint, i.e. a
CORE executable service. CORE service is indeed composed by an inner part, which is the tool to be
wrapped, and by input and output integration APIs. Such APIs transform from/to CORE model into the
tool specific format.
As anticipated, CORE mappings are designed for classes of tools and hence integration APIs should
support the admitted transformations, e.g. CSV-to-CORE and CORE-to-CSV, Relational-to-CORE and
CORE-to-Relational, etc.
Basically, the integration API consists of a set of transformation components. Each transformation
component corresponds to a specific data format and the principal elements of their design are specific
mapping files, description files and transform operations.
In order to provide an input to a tool (inner part of a CORE service) the Transform-from-CORE operation
is invoked. Conversely, the output of the tool is converted by Transform-to-CORE operation. For each
single input or output file a transformation must be launched.
In that way reusing of components can be performed in a very easy and efficient way.
34

Download Report

DWH-SGA2-WP3 - 3.5 Relate the ideal architectural scheme into an

Paperzz.com

Your Paperzz