GSIM: Communication - UNECE Statistics Wikis

Generic Statistical Information Model (GSIM):
Communication
(Version 0.4, May 2012)
DRAFT FOR REVIEW
Please note the development of GSIM is a work in progress. GSIM v0.4 is not intended for
official publication.
Instructions for reviewers and a template for providing feedback is available at
http://www1.unece.org/stat/platform/display/metis/GSIM+Version+0.4
About this document
This document is part of the Communication Layer of GSIM. It is aimed at subject matter
statisticians, methodologists, process designers, business architects etc. It consists of one
main paper with a number of annexes. It provides more detailed information about the
information represented in GSIM (including definitions and diagrams of lower level objects),
descriptions of how the model could be used and use cases and descriptions of relationships
to other models and standards.
This work is licensed under the Creative Commons Attribution 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/3.0/. If you re-use all or part
of this work, please attribute it to the United Nations Economic
Commission for Europe (UNECE), on behalf of the international
statistical community.
1
Table of Contents
Introduction ...................................................................................................................................... 3
Scope ............................................................................................................................................ 3
Design Principles ......................................................................................................................... 3
Overview of GSIM (as produced at the end of Sprint 2) ............................................................. 5
Relationship to GSBPM............................................................................................................. 11
Relationships to other standards ................................................................................................ 12
Future work on GSIM ................................................................................................................ 13
Annex A: Further detail on the model ........................................................................................... 15
Activity ...................................................................................................................................... 15
Production .................................................................................................................................. 21
Conceptual ................................................................................................................................. 32
Information ................................................................................................................................ 42
Annex B: Using the Object level and Use cases ............................................................................ 48
Making use of the GSIM Object level ....................................................................................... 48
Use Case 1: Design statistics on environmental investments by businesses ............................. 50
Use Case 2: Harvesting Data off The Web ................................................................................ 60
Annex C: Mapping to other standards and models ........................................................................ 72
Relationship between GSBPM and GSIM ................................................................................. 72
Relationship between SDMX and GSIM ................................................................................... 73
Relationship between DDI and GSIM ....................................................................................... 74
Relationship between SDMX and DDI and potential impacts on GSIM .................................. 76
Relationship between CORE and GSIM.................................................................................... 76
Relationships with other standards and models ......................................................................... 78
Annex D. GSIM Metamodel .......................................................................................................... 79
2
Introduction
1.
The Generic Statistical Information Model (GSIM) is a reference framework of
information objects, which enables generic descriptions of the definition, management, and
use of data and metadata throughout the statistical production process.
2.
GSIM provides a set of standardized, consistently described information objects,
which are the inputs and outputs in the design and production of statistics. As a reference
framework, it helps readers understand significant relationships among the entities involved
in statistical production, and supports the development of consistent standards or
specifications.
3.
GSIM is one of the cornerstones for modernizing statistical production and moving
away from traditional silos. By defining and grouping objects common to all statistical
production, regardless of subject matter, GSIM enables statistical organizations to rethink
how their business could be organized to generate economies of scale.
4.
A model alone cannot transform an organization or its processes, but GSIM is
modeled to allow for innovative approaches to statistical production to the greatest extent
possible, for example, in the area of dissemination where demands for agility and innovation
are increasing. At the same time, GSIM supports more traditional approaches of producing
statistics.
Scope
5.
GSIM provides the information object framework supporting all statistical production
processes as described in the GSBPM, giving the information objects agreed names, defining
them, specifying essential properties, and indicating their relationships with other information
objects. It does not, however, make assumptions about the standards or technologies used in
implementation.
6.
The information objects defined include those to allow the specification and
introduction of new data sources for more innovative data collection, and also the generation
of new statistical products.
7.
GSIM does not include information objects related to supporting business functions
within an organization such as human resources, finance, or legal functions, except to the
extent that this information is used directly in statistical production.
Design Principles
8.
The following are a set of design principles that were used for developing GSIM.
1. GSIM supports GSBPM and covers the whole statistical process
2. GSIM can also be used stand-alone in any statistical production
environment
3. GSIM has an intuitive appeal to all stakeholders
4. GSIM supports the design, documentation and maintenance of statistical
products
5. GSIM enables explicit separation of the design and production phases
6. GSIM enables both traditional and new ways of producing statistics
3
7. GSIM supplies links between process steps at all desired levels of
granularity
8. GSIM provides a basis for common understanding of information objects
and their definitions
9. GSIM uses a layered approach
10. GSIM contains information objects only down to the level of agreement
between key stakeholders
11. GSIM is robust, but can be easily adapted and extended to meet users’
needs
12. GSIM objects and relationships are represented as simply as possible
13. GSIM makes optimal reuse of existing terms and definitions
14. GSIM does not refer to any specific IT setting or tool
15. GSIM defines and classifies its information objects appropriately,
including specification of attributes, relations and operations.
9.
Background to the development of GSIM
10.
The need for a Generic Statistical Information Model (GSIM) was first agreed at the
2010 Meeting on Management of Statistical Information Systems (MSIS). Since then, various
statistical organizations have been working together to develop the model. Figure 1 provides
an overview of the history of GSIM development.
Figure 1. Milestones in the development of GSIM
11.
The development of GSIM forms a key part of the strategic vision of the High Level
Group on Strategic Business Architecture for Statistics (HLG-BAS) - a group of heads of
National Statistical Institutes and International Agencies that support a common vision to
modernize statistical production.
“To enable statistical organizations to arrive at standardized generic industrialized
production of statistics, we first need to find one another at the conceptual
4
level....under the umbrella of the GSBPM and the GSIM. This is a very high
ambition which will take time. ”
HLG-BAS Strategic Vision (June 2011)
12.
GSIM is a cornerstone that needs to be developed in order to fulfill the HLG-BAS
vision. Given this, development work was accelerated under the sponsorship of the HLGBAS. The agreed approach to rapidly develop GSIM was to conduct two “sprint” sessions
between February and April 2012. Each sprint went for two weeks, the first being held in
Ljubljana, Slovenia from 20 February to 2 March 2012 and the second in Daejeon, Republic
of Korea from 16 to 27 April 2012.
Overview of GSIM (as produced at the end of Sprint 2)
What is new in GSIM v0.4
13.
This version of GSIM reflects the work to further develop the model at the second
sprint event, building on GSIM v0.3 from the first sprint. A range of feedback was received
on v0.3 of GSIM and this has been reviewed and integrated into v0.4.
14.
Key changes between GSIM v0.3 and v0.4 include:






increased number of information objects
elaborated definitions of information objects and the relationships between
them
improved alignment of GSIM with DDI, SDMX, Neuchâtel and ISO11179
standards
changes to the structure of the model to reflect deeper understanding of the
role and relationships of the sets of information objects within the model
review of terms used for object groups (now Activity, Production, Conceptual
and Information) and for the information objects themselves
revised structure and content of GSIM documentation
GSIM v0.4
15.
Similar to GSBPM, GSIM uses a layered approach. There are a limited number of
items at the highest level with more details given at the lower levels of the model. Each level
is designed to be used for a particular purpose and by different audiences.
Group Level
16.
The Group level is an overview level aimed at explaining GSIM to top managers. It
shows the different groups of information objects that GSIM consists of.
5
Figure 2. Group level diagram of GSIM
17.
The Group level diagram (Figure 2) is designed to be read clockwise starting with
Activity. A statistical agency starts an Activity, this initiates Production which uses
information objects in the Conceptual group and produces Information.




The Activity group contains sets of information objects required to manage the
programs that make up statistical production.
The Production group contains sets of information objects that describe the
processes, methods and rules that are used in statistical production.
The Conceptual group contains sets of information objects that describe the
concepts used and their practical implementation, allowing users to understand
what the statistics are measuring.
The Information group contains sets of information objects that describe the
results of the stages of statistical production.
Set level
18.
The Set Level is a communication level for statisticians, process designers,
methodologists, and business architects.
19.
In this level, the Groups are expanded to show sets of information objects which are at
a lower level. The Sets are not information objects in themselves. Rather they are
representative of the information objects that are at the lowest level of the model. For
example, the Population Unit set includes the information objects population and statistical
unit.
6
Figure 3. Set level diagram of GSIM
Object level
20.
The Object level is the most detailed level of GSIM. It is a specification level for IT
architects and metadata specialists. Figure 4 shows an indicative UML model of the
information objects at this level, grouped by Activity, Production, Conceptual and
Information.
7
Figure 4. UML diagram of Object level
21.
Further details about the Object level can be found in the Annexes of this document.
Annex A provides detailed models of each Group – including lower level objects, definitions,
relationships and explanatory text. Annex B provides information about how to use the
information objects and gives two use cases. Annex D provides the GSIM metamodel.
8
22.
Figure 5 below shows an alternative view of the Object level. It is not a description of
the entire object level. It can be used as a description for users who are not interested in the
detailed object level but still are interested in some objects and relationships.
Figure 5. Communication diagram of Object Level
Methodology and Quality in GSIM
Methodology
23.
Methodology is embedded in GSIM in the components, rules and parameters. These
information objects implement methodology in the design of a statistical production process.
9
The document that specifies the design of a statistical production process lays down the
methodology of this process.
Figure 6. Methodology in GSIM
24.
Below is an excerpt from the Dutch use case on environmental investments. Further
detail on this use case can be found in Annex B.
In order to specify the methodology one could refer to a standard method that identifies
1000-errors. For example, one could compare the value of the variable to be validated
to a so-called reference value. If the ratio is larger than a parameter a, say a=400, than
then there is probably a 1000-error.
Now, in this example, the process step that validates and corrects 1000-errors should
‘know’ which variables have to be checked, the specification of the edit rules in terms
of the parameter a, and the specification of the parameter a. We note that the
specification of the parameter a may be different for different variables. The output of
the process step always has the same structure: it provides quality indicators for each
variable that has been checked and a new status for the values of these variables.
There is one comment in order.
−
The reference values are used as auxiliary information. The designer of the flow
should wonder whether these reference values stem from an external source (and
should be modeled as an additional input source) or from an activity inside the
process. In the latter case this activity should also be modeled. This could be a
third (preparing) sub process.
Required objects:



Rule: there is a linear edit involved: Variable x > a multiplied by variable y.
Component:
 There is an edit part in the component: If the edit is ‘violated’, then variable z is
‘1000-error’, otherwise variable z is ‘not 1000-error’.
 There is an imputation part in the component: If z is ‘1000-error’, then variable
x = variable x divided by 1000; otherwise if z is ‘not 1000-error’ then variable
x = variable x.
Parameter: a
Quality
25.
While methodology is embedded in the design of the statistical production process,
quality is linked to the instance (i.e. to production runs) of the process.
10
26.
Quality appears at different levels in instances of information objects. For example, as
an attribute to an information element (e.g. quality flag), as an attribute to a data set (e.g.
status provisional data, final data, revised data). It also appears as process quality
information, for instance in the use case on Dutch Environmental Investments1 the OK-index
(both used for selecting records for imputation and as a process quality indicator). The
product quality is laid down in a quality report, which is itself also a statistical product.
Relationship to GSBPM
27.
GSIM and GSBPM are complementary models for the production and management of
statistical information. GSIM focuses on information objects used and/or produced in a
statistical business process. The GSBPM, as a common reference model for the statistical
business process, is intended to apply to all activities undertaken by producers of official
statistics that result in information outputs.
28.
Similarly to the way a particular statistical business process may not require every
sub-process described within the GSBPM, not every information object in GSIM is
necessarily required to be used and/or produced in the course of every statistical business
process.
29.
As described in the strategic vision of the High Level Group for Strategic Directions
in Business Architecture in Statistics, much greater value will be obtained from GSIM if it is
applied in conjunction with GSBPM. Likewise, greater value will be obtained from GSBPM
if it is applied in conjunction with GSIM. For example, harmonization of metadata is
necessary in order to achieve standardization of production processes.
30.
Nevertheless, just as GSBPM has been applied to date without GSIM, it is possible
(although usually less than ideal) to apply GSIM without GSBPM. For example, an agency
may currently be using a local variation on GSBPM to model their statistical business
processes, rather than using GSBPM itself. This decision in regard to modeling statistical
business processes should not necessarily prevent them deciding to apply GSIM as a
reference framework for statistical information.
31.
In the context of GSBPM, GSIM can be harnessed as a tool to help describe and
define the interrelated set of sub processes within a statistical business process and the types
of information used in those processes needed to produce official statistics.
32.
Applying GSIM as a reference framework in this manner can:


facilitate building efficient metadata driven collection, processing, and
dissemination systems
help harmonies statistical computing infrastructures
33.
GSIM identifies and describes the information, both data and metadata, supporting the
GSBPM phases, sub-processes and the overarching processes of Quality Management and
Metadata Management. Good metadata management is essential for the efficient operation of
statistical business processes. Metadata are present in every phase of the GSBPM, either
created; updated or carried forward unchanged from a previous phase. In the context of the
GSBPM, the emphasis of the over-arching process of metadata management is on the
1.
1
See Annex B for further details
11
creation, updating, use and reuse of statistical metadata. Metadata management strategy and
systems are therefore vital to the operation of the GSBPM.
34.
GSIM is intended to support the overarching strategies and systems used to create
and manage metadata as well as the statistical processing systems which then harness this
well designed and managed metadata.
35.
All process steps in GSBPM should be documented, and thus provide reference
metadata. Several aspects of reference metadata are covered by GSIM objects. For example:
the conceptual part of reference metadata is represented in GSIM in the group of conceptual
objects, the methodological and procedural aspects of reference metadata are represented by
GSIM objects in the production group. Other aspects of reference metadata may be modelled
by means of object attributes in future GSIM versions. Reference metadata can be attached to
any information object and, as such, can be an input to as well as an output of a GSBPM
process step.
36.
GSIM supports a consistent approach to metadata, facilitating the primary role for
metadata envisaged in the UNECE guide related to Statistical Metadata in a Corporate
Context2, i.e. that metadata is to uniquely and formally define the content and links between
objects and processes in the statistical information system.
Relationships to other standards
37.
One of the design principles of the GSIM is to make optimal reuse of existing terms
and definitions, wherever possible, to facilitate the use of the reference framework. In
developing GSIM, many existing models and standards have been examined, both to
determine the best approach for describing GSIM’s objects, and also to test the completeness
and usability of the resulting GSIM model. Annex C explores relationships between GSIM
and a number of other standards and models.
38.
GSIM must be implementable: In order to support the implementation of the GSIM
reference framework, many known standards and tools have also been examined, to ensure
that the reference framework is complete and useful in this respect. The relationship between
GSIM and other models and standards is two-fold: the standards and models serve as inputs
to the creation of GSIM, and also act as targets for the use of GSIM within organizations.
39.
By taking this approach, it is hoped that the GSIM model will be as similar as
possible to the information which user organizations already have within their statistical
production systems, allowing GSIM to be more understandable and easier to implement.
40.
Figure 7 illustrates how different relevant standards, models, and implementation
syntaxes and tools relate to GSIM. Standards/models that have provided significant input to
the GSIM model are presented on the left hand side of the figure. Implementation syntaxes
and tools that are currently of relevance to an implementation of GSIM are presented on the
right hand side of the figure. This list will become outdated as more and more
implementation syntaxes and tools are developed. The particular software packages listed are
widely used in statistical organizations, but are intended to be illustrative examples, and are
not a complete list.
2.
2
http://www1.unece.org/stat/platform/display/metis/Part+A+-
Statistical+Metadata+in+a+Corporate+Context
12
* There are too many others to show in the diagram
Figure 7.GSIM and its relationship to other relevant standards and models.
Future work on GSIM
Roadmap for getting to GSIM v1.0
41.
A detailed roadmap, outlining the work to be done to develop public release v1.0 of
GSIM has been prepared. It comprises three parts:
Additional modeling to complete the Specification layer of GSIM
42.
Completing the Specification Layer for inclusion in GSIM v1.0 will require additional
modeling. Mechanisms to achieve this may include the establishment of short-terms task
forces for each part of GSIM.
Establish GSIM Integration Team and conduct workshop
43.
A workshop to integrate work on the Specification Layer and review the input
received from external reviewers of the Overview layer and Communication layer as
documented within GSIM v0.4. The output from this workshop would form GSIM v0.8,
complete with all levels of detail required for implementation, which would then need
another round of review.
Preparation of GSIM v1.0
44.
Analyze and address the external review feedback on the complete documentation
(GSIM v0.8) and draft the proposed specification of GSIM v1.0. An accompanying
13
Communication Plan and User Guide would also be completed. These materials would need
to go through the relevant processes to be endorsed and adopted by the statistical community.
Beyond GSIM v1.0
45.
As a newly developed framework, it is expected additions and changes will be
identified as GSIM is applied in practice. An updated version of GSIM (e.g. v1.1) may be
warranted, for example, within a year of the release of v1.0.
46.
Processes will be established to capture feedback from practical use and feed this into
further evolution of GSIM. A process for setting and developing a release schedule will be
established, together with a process for design and stakeholder review of proposed new
releases. New releases will be designed in a way which minimizes the extent of change from
previous versions (e.g. maximize backwards compatibility) while still meeting the business
needs which initiated development of a new release.
47.
Release of updated versions would be expected to become less frequent once the
initial set of additional requirements encountered through widespread practical application of
GSIM had been addressed. New requirements within the community of producers of official
statistics, however, could still initiate the need for updated versions of GSIM.
14
Annex A: Further detail on the model
49.
The contents of this annex are the working notes from Sprint 2. It is known that they
are not comprehensive and not fully integrated at this time. For example, only main
relationships are shown in the diagrams. The current notes will be rationalized and detailed as
part of the process of completing the Specification Layer. Readers are welcome to provide
detailed feedback now, but will have a further opportunity when the first full draft of the
Specification Layer is released for review in September 2012.
50.
The objects in the detailed models of the specification layer have the same colour as
the objects in the group and set level. Hence the objects in the detailed Specification Layer
diagram describing Activity are blue, Production are red, Conceptual are green and
Information are yellow. Information objects can appear in more than one Specification Layer
diagram, if so the Specification Layer object will have the colour of the group object that it
belongs to.
Activity
Figure A1. Set level diagram of Activity group
51.
This part of the model outlines the information that identifies and defines the
statistical production activities (within the scope of GSIM) undertaken by agencies and the
information that is required to describe the contributing processes.
52.
An agency will receive an Information Request that identifies the information that a
person or organisation in the user community3 requires for a particular purpose (defined in
relation to a concept and population). From this Information Request an agency will generate
a Business Case and it is this that initiates a new Statistical Program. As the Statistical
Program is initiated a set of Requirements will be developed that are informed by the
Information Request.
3.
3 3
This community may include units within the agency as well as external to it. For example, a unit
responsible for compiling National Accounts may need a new statistical activity to be initiated to produce
new inputs to their compilation process.
15
53.
Processing of the Statistical Program will support two distinct use cases. It supports
both the traditional approach of collecting data for a particular need, and also the emerging
and future approach of collecting data and producing new outputs based on an existing data
source that is maintained and added to over time.
54.
In the case of the traditional silo/stovepipe approach, an agency received an
Information Request and a set of Requirements; and approved a Business Case. When this
happens, a new Statistical Program is initiated. This Statistical Program will identify the
Data Resource that it will need (existing or needing to be created). Once designed, the
Statistical Program will have one or more iterations of a Statistical Project, to investigate a
set of characteristics for a given Population in relation to a particular time period. If the
identified Data Resource is not sufficient for the purposes of the Statistical Project, an
Acquisition Program will be initiated, together with an Acquisition Project (for each instance
of the time period) which will add Datasets to the Data Resource. Once this is complete, the
Statistical Project will use a particular Dataset from the Data Resource to produce one or
more Products or Services.
55.
A possible future approach relates to a continuous collection process. In the age of
‘big data’, the cost of collecting and storing data (e.g. administrative data) is low. An agency
can feasibly collect data on a continuous basis without a particular Statistical Project,
Product or Service in mind. In this case the agency has an Acquisition Program which
consists of one or more Acquisition Projects that gather data relating to a particular point in
time and add that data to the Data Resource. This Data Resource may then be used by any
Statistical Program in the future.
56.
Acquisition Project represents the gathering of data from an internal or external
source. The objects concerned with the process of gathering data are detailed further below.
Figure A2. Object level diagram of Activity group
16
57.
Other objects needed for describing the collection of data are portrayed in Figure A5,
and cover surveys, administrative register data, data reported electronically, data obtained
from websites, data generated by devices, data collected through clinical procedures, and any
other form of data acquisition. It attempts to provide a flexible model which could be used to
represent objects required to support existing techniques for data collection, but also supports
new and innovative approaches. While we provide detailed information for some types of
data acquisition (surveys, administrative registers, and Internet robots) in this model (see
description of the Production group below), it is anticipated that additional specific types
could be described in future, as they become more important for statistical organizations.
58.
Note that the term “statistical organization” is used here to describe the agency acting
as the data collector; this organization could be a division within the overall statistical
agency, collecting data from within another division of the same organization. The model is
useful in describing both data acquisition from external organizations and individuals, and
also for data acquisition internal to the organization.
59.
The Instrument object describes the tool used to collect data. This could be a
traditional survey form, an administrative register, a clinical procedure, a software agent
scraping data from websites, or any other tool. Instrument is described from the perspective
of the statistical organization collecting the data. The contents of an administrative register
might be originally collected using printed forms of some type, by the administrative agency,
but this information is not recorded as an Instrument description for the purposes of GSIM. It
might instead be captured as quality information about the register’s contents if relevant. The
Administrative Register itself would be considered as the Instrument in such a case.
60.
The Provision Agreement represents a set of agreements around the provision of data
by the Data Provider to the statistical organization. This could be a service-level agreement,
a legal mandate, the terms of mutual agreement, or any other terms/conditions which affect
the provision of data. In some cases these terms will be drawn from elsewhere in the GSIM
model (for example, the agreed structure of the data to be provided, the use of specific
classifications or concepts, etc.)
.
17
Table A1. Definitions of information objects in the Activity Group
GSIM Group
GSIM Set
GSIM Object
Activity
Definition
Source
Example
The group Activity contains sets
of objects that are required to
manage the programs that make
up statistical production. This
includes data acquisition,
statistical and dissemination
activities.
GSIM Sprint 2
Information Request,
Acquisition Program,
Statistical Program,
Dissemination Program
The Information Request set
contains objects that describe the
data required for a particular
purpose.
GSIM Sprint 2
Business Case,
Requirements
Activity
Information
Request
Activity
Information
Request
Business Case
A Business Case gives the reason
for investing in and undertaking a
particular statistical activity.
GSIM Sprint 2
Activity
Information
Request
Information Request
An Information Request outlines
the data required for a particular
purpose.
GSIM Sprint 2
Activity
Information
Request
Requirement
A Requirement is a specification
GSIM Sprint 2
of details of the concepts,
population and outputs required to
meet a particular information
need.
Activity
Acquisition
Program
The Acquisition Program set
contains objects to represent a set
of activities undertaken by
GSIM Sprint 2
Acquisition Project,
Instrument
18
statistical agencies to gather data.
Activity
Acquisition
Program
Acquisition Program
An Acquisition Program is a set
of activities undertaken by
statistical agencies to gather data.
GSIM Sprint 2
Korean Population
Census processing and
analysis design
Activity
Acquisition
Program
Acquisition Project
An Acquisition Project is a set of
activities undertaken by statistical
agencies to gather data relating to
a particular reference period.
GSIM Sprint 2
Korean Population
Census 2012 processing
and analysis
Activity
Dissemination
Program
The Dissemination Program set
contains objects to represent a set
of activities undertaken by
statistical agencies to provide data
to users.
GSIM Sprint 2
Activity
Dissemination
Program
Dissemination
Program
A Dissemination Program
represents a set of activities
undertaken by statistical agencies
to provide data to users.
GSIM Sprint 2
Korean Population
Census dissemination
design
Activity
Dissemination
Program
Dissemination Project
A Dissemination Project is a set
of activities undertaken by
statistical agencies to provide data
relating to a particular reference
period to users.
GSIM Sprint 2
Korean Population
Census 2012
dissemination
Activity
Dissemination
Program
Data Provider
A Data Provider is an individual
or organization that makes data
available to the statistical
organization.
GSIM Sprint 2
Respondent
participating in a
survey, national
statistical organization
providing data to an
19
international
organization
Activity
Dissemination
Program
Activity
Statistical Program
Activity
Statistical Program
Activity
Statistical Program
A Provision Agreement is a set of
agreements that exist around the
exchange of data between a data
provider and a statistical
organization.
GSIM Sprint 2
Service-level
agreement, legal
mandate, the terms of
mutual agreement.
The Statistical Program set
contains objects to represent a set
of activities to investigate
characteristics of a given
population.
GSIM Sprint 2
Methodology,
Statistical Project
Statistical Program
A Statistical Program is a set of
activities to investigate
characteristics of a given
population.
GSIM Sprint 2
Korean Population
Census collection
design
Statistical Project
A Statistical Project is a set of
activities to investigate
characteristics of a given
population for a particular
reference period.
GSIM Sprint 2
Korean Population
Census 2012 collection
Provision Agreement
20
Production
Figure A3. Set level diagram of Production group
61.
When an agency conducts a Statistical Program or an Acquisition Program, a series of
activities are undertaken in order to achieve the desired outcome. The activities are affected
by Process Step Execution, using a set of Inputs and Outputs. The Process Design, Process
Step Definition, Process Method, and Rule define these steps, and they are performed by the
Process Agent.
62.
A process may use any GSIM object as an Input. The Input information object is
substituted for the particular instance of an information object. The Input can be identified as
one of three types. The transformable Input is used as a placeholder for any information
object that will be changed by the process (e.g. the status of a dataset changes from
provisional to final). The parameter Input represents any information object or attribute of an
information object that guides the process (e.g. accept all matches with similarity index 87 or
higher, where similarity index is the parameter and 87 is the value). The process support
Input covers any information object that is required as ancillary information without which
the process could not complete (e.g. code list to validate against) which is not transformed by
the process.
63.
Any object within GSIM can be produced as an Output from a process. Any
information object that is transformed or created during a process is a transformed Output.
Any information object that describes a measure of the process (e.g. time taken) is a Process
Metric. Both of these Outputs are referred to by Process Control in order to initiate the next
step in the process. Reference metadata attached to any of the input information objects may
also be modified by a process and new reference metadata may be created and attached to any
output information object.
64.
An agency will have a pre-defined set of Process Step Definitions (processes as
defined in the GSBPM or another process model). A Process Step Definition describes what
is undertaken by a process but is not itself active. A Process Step Definition may have subProcess Step Definitions (e.g. the Process Step Definition “Dissemination” will include the
Process Step Definition “publish data to website” as a sub Process Step Definition). An
21
agency will also have a pre-defined set of Process Methods. A Process Method describes how
a process is carried out. A Process Method may identify (via Process Step Design) the types
of inputs (e.g. datasets) that are required and the outputs produced by a process, it does not
specify the instance used. A Process Method has a series of Rules which guide the actions to
be performed. Each Process Step Definition is likely to have several Process Methods that
can be used for undertaking the activity. Together these may form a ‘process library’. For
example, the Process Step Definition for the activity ‘imputation’ may have the Process
Methods ‘donor’ and ‘heuristic’. Each Process Method may use Rules in their specification.
65.
During the design process an agency will create Process Step Designs (each may have
sub Process Step Designs). A Process Step Design identifies the Process Step Definition that
is undertaken by the process and the Process Method that describes how it is undertaken. The
Process Step Design also identifies specific instances of information objects that are required
as input or created as outputs, for example an agency would specify that a specific dataset is
an input. The Process Step Design is not active, but is a description of the resources required
for a process to be executed. In summary: a Process Step Design chooses a Process step
Definition, a Process Method, and a Process Agent that form the design of the process.
66.
Once created, a Process Step Design may be carried out. A Process Step Execution
identifies any further Inputs or Outputs required to execute the process at run-time and
captures a record of the process having taken place. The Process Step Execution will have
various types of Inputs and Outputs associated with it, as determined by the requirements of
the Process Step Design.
67.
Process Control is used to manage Process Step Execution. Process Step Execution
may be a planned activity defined by a Process Control or be triggered by a Process Control
that assesses results from the completion of a previous Process Step Execution. One of the
Outputs from a Process Step Execution will be Process Metrics that feed into a decision point
within the Process Control (based upon a set of Rules),which may trigger another Process
Step Execution if the Process Metric satisfies a particular Rule.
22
Figure A4. Object level diagram of Production group
68.
For the purposes of illustration, Figure A5 below extends some of the objects in the
figure above to show different instances of the Instrument object, and is also showing how
the Data Resource can be extended to show how a Staging Data store (receiving the data from
a Data Provider) would be applied. The Staging Data store represents the collected data
within the statistical organization in the form exactly as provided, and will often not be
structured in the fashion needed for internal processing. Thus, the contents of this store are
likely to require transformation into an internal format before becoming useful for creating
statistical products. Similarly, the Input Data store represents collected data which has been
processed to the point where it is useful for the statistical organization’s production systems.
23
Figure A5. Instrument Detail
69.
Various sub-classes are also illustrated here, to show how the Instrument could take a
variety of guises, each containing specialized information specific to its type:




An Administrative Register is a repository of data collected for non-statistical
purposes, but made available to the statistical organization for the creation of
statistical products.
A Survey Instrument is a form used for the explicit purpose of gathering
statistical data. Surveys may be administered in a variety of modes (as paper
forms, in face-to-face interviews, as online interviews, as computer-assisted aids
in telephone interviews, etc.). The Survey Instrument provides a generic
description of all types of modes.
The Internet Robot Instrument will hold information about software agents which
are designed to visit sites on the internet and collect data by programmatically
harvesting it.
The Generic Instrument is to be used when more detailed descriptions (as for
Survey, Administrative Register, and Internet Robot Instruments) is not needed. It
may be used to describe those types of instruments if desired, when more detail is
not needed, or can be used to describe non-Survey, non-Administrative Register,
and non-Internet Robot Instruments.
70.
The part of the model illustrated in Figure A6 shows the kind of extended properties
which some classes of data collection instruments may contain.
71.
It is based on objects and properties found in the DDI-Lifecycle standard and in many
of the popular computer-assisted interviewing software packages such as Blaise. It is capable
of describing surveys in a simple manner – as a sequence of question, statements, and
instructions – but may also be used to describe detailed information about dynamic
presentations and flows.
24
Figure A6. Survey Instrument Detail
72.
Survey Instrument is the form used to collect data from respondents, including printed
forms, those filled out via telephone, online surveys, face-to-face interviews, etc.
73.
Response Unit is the statistical unit being interviewed.
74.
Question is the text used to interrogate the respondent, including all properties and
relationships such as its response domain and relationships to statistical concepts. Questions
may be multiple. Text may be static or dynamic. Dynamic text is produced by computations
based on prior known values, earlier responses collected within the survey, or mode values.
75.
The Modes of a survey tells us how the survey is being conducted, and there could be
one or more modes employed for a particular survey (for example, Paper, Computer Assisted
Phone Interview, Computer-Assisted Self- Interview, Web-form etc.). The list of modes will
potentially grow in future, and will vary from organization to organization, and thus is not
specified within GSIM. The mode of a survey can be used in the conditional flow logic of the
survey and in computation constructs.
76.
Interviewer Instructions are the text or other information provided to the interviewer
to help in conducting the interview. In a self-completion scenario, the respondent is
considered to be the interviewer.
77.
Control Construct is an object which is used to describe the flow logic of a survey.
Control constructs may be combined with other control constructs and can include references
to external prompts such as pictures, sound files, or other media appropriate to the survey
instrument’s mode.
25
Table A2. Definitions of information objects in the Production Group
GSIM
Group
GSIM Set
GSIM
Object
GSIM Object
sub-type
Production
Production
Process Control
Production
Process Control
Input
Production
Process Control
Input
Parameter Input
Definition
Source
Example
The group Production contains sets
of objects that describe the
processes, methods and rules that are
used in statistical production.
GSIM
Sprint 2
Process Control,
Process Component,
Rule
The Process Control set contains
objects that describe the sequence
and selection of processes based on
an assessment of outputs according
to a set of rules.
GSIM
Sprint 2
Process Step
Execution, Input,
Output
An Input is an information object
which is used by a process step.
GSIM
Sprint 2
Classification to
validate against,
variable to be used
in the derivation of
another variable,
number of repeats
A Parameter is an input to a process
step execution which controls the
tasks performed and potential
outputs. The parameter may be an
information object or an attribute of
one.
GSIM
Sprint 2
e.g. match with 87%
similarity
26
Production
Process Control
Input
Process Support
Input
A Process Support input assists a
process step execution in the
completion of its tasks. The process
support input is not transformed
during process execution.
GSIM
Sprint 2
Production
Process Control
Input
Transformable
Input
A Transformable input is the
information object on which the
process method is applied during
process execution. It may become a
transformed output.
GSIM
Sprint 2
Production
Process Control
Output
Production
Process Control
Output
Transformed
Output
A Transformed Output is a new
information object or a modified
transformable input of a process step
execution.
GSIM
Sprint 2
Production
Process Control
Output
Process Metric
A Process Metric is a collection of
information objects and/or attributes
of information objects that describes
and helps assess a process step
execution and the transformed
information object(s) it produced.
GSIM
Sprint 2
Production
Process Control
Process
Control
A Process Control is a set of rules
that guides the execution and
sequencing of processes. It can
contain scheduling and rules for
GSIM
Sprint 2
e.g. code list to
validate against
An Output is the product of a process GSIM
step execution. It consists of the
Sprint 2
transformed output and the process
metric.
27
triggering process to begin.
A Process Step Execution is the act
of performing a particular process
according to a schedule or trigger. A
Process Step Execution identifies all
other inputs required for the process
to be run.
GSIM
Sprint 2
The Process Component set contains
objects that describe the process step
definition and design, the methods
used and the process agent.
GSIM
Sprint 2
Process Step
Definition, Process
Step Design, Process
Method, Process
Agent
Methodology
A Methodology is a specification of
the processes to undertake a
statistical activity.
GSIM
Sprint 2
Seasonal adjustment,
imputation
Process
Component
Process
Agent
A Process Agent is the actor that
performs a process.
GSIM
Sprint 2
Technology system,
organizational unit,
survey instrument
Production
Process
Component
Process
Agent
Technology
System
A Technology System is a set of
automated methods for performing a
process.
GSIM
Sprint 2
Production
Process
Component
Process
Agent
Organizational
Unit
An Organizational Unit is a person
or a team that performs a process.
GSIM
Sprint 2
Production
Process Control
Production
Process
Component
Production
Process
Component
Production
Process Step
Execution
28
Production
Process
Component
Process
Method
A Process Method is one or more
ordered actions to be followed to
achieve a particular outcome.
Based on
Statistics
New
Zealand
model
donor imputation,
heuristic imputation
Production
Process
Component
Process
Method
Survey
Instrument
A Survey Instrument is a tool to
collect data directly from units of a
population.
GSIM
Sprint 2
Printed forms, those
filled out via
telephone, online
surveys, face-to-face
interviews, etc.
Production
Process
Component
Process
Method
Administrative
Register
Instrument
An Administrative Register
Instrument is a repository of data
collected for non-statistical purposes
made available to the statistical
organization for the creation of
statistical products.
GSIM
Sprint 2
Tax register,
business register.
Production
Process
Component
Process
Method
Internet Robot
Instrument
An Internet Robot Instrument is a
software agent designed to visit sites
on the internet and collect data by
programmatically harvesting the
internet sites.
GSIM
Sprint 2
Software configured
to grab the number
of people in the
labour force by age
group from web site
www.xxx.xxx.
Production
Process
Component
Process
Method
Generic
Instrument
A Generic Instrument is any kind of
tool designed to collect data.
GSIM
Sprint 2
Survey,
Administrative
Register, Internet
Robot Instruments
29
“Dissemination”,
“publish data to
website”, imputation
Production
Process
Component
Process Step
Definition
A Process Step Definition describes
the intended purpose and identifies
the types of inputs and outputs of a
process. It has a set of associated
methods that may be used to achieve
the desired outcome.
GSIM
Sprint 2
Production
Process
Component
Process Step
Design
A Process Step Design identifies the
specific information objects that are
the inputs and outputs of a process
and identifies the methods it is going
to use.
GSIM
Sprint 2
Production
Process
Component
Question
A Question describes the text used to
interrogate a respondent, the concept
that is measured and the allowed
responses.
GSIM
Sprint 2
Production
Rule
The Rule set contains objects that
govern processes.
GSIM
Sprint 2
Interviewer
Instruction, Control
Construct
Production
Rule
Control
Construct
A Control Construct is a description
of a rule or set of rules that describe
the flow logic of an instrument.
GSIM
Sprint 2
If The Else, Loop,
Repeat Until, Repeat
While, Sequence
Production
Rule
Control
Construct
Computation
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Control
Construct
If Then Else
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Control
Loop
Specific control construct; Definition
30
Construct
to be added following Sprint 2
Production
Rule
Control
Construct
Question
Construct
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Control
Construct
Repeat Until
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Control
Construct
Repeat While
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Control
Construct
Sequence
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Control
Construct
Statement
Specific control construct; Definition
to be added following Sprint 2
Production
Rule
Interviewer
Instruction
An Interviewer Instruction is text or
other information provided to an
interviewer to help in conducting the
interview. In a self-completion
scenario, the respondent is
considered to be the interviewer.
GSIM
Sprint 2
Production
Rule
Rule
A Rule is an instruction that
constrains a process.
GSIM
Sprint 2
for donor
imputation:
maximum 2
imputation trials
31
Conceptual
Figure A7. Set level diagram of Conceptual group
78.
The Conceptual group contains sets of information objects that describe the concepts
used and their practical implementation, allowing users to understand what the statistics are
measuring. This group is used as inputs into the process described by the Production group
and is often referred to by objects within the Information group to provide definition and
structure. The objects in the group are also often referred to in products to provide
information to help users understand results (reference metadata).
79.
This group is divided into 3 sets of objects.
80.
The Population Unit set contains objects that define real world phenomena that are the
subject of a statistical activity. These are the subject of measurement which is described by
the objects within the Variable set. The values from these measurements are described or
delimited by objects within the Classification set.
81.
The particular set of units that a statistical agency is interested in at any point during
the statistical production process is defined by the Population. There are several types of
Population including a Target Population, Frame Population, Survey Population and Analysis
Population. Each Population is made up of Population Units which is a representation of any
entity that can be described by a particular set of characteristics. Households, Enterprises are
examples of Population Units. Depending on the role that the Population Unit plays at
various stages of the statistical lifecycle it may be referred to as one of multiple types,
including: Collection Unit, Statistical Unit, Analysis Unit.
82.
Once a Population has been defined it is usually the case that an agency is interested
in measuring something about the group (using data that may already exist or is yet to be
collected). The measurement of a particular characteristic about a Population is described by
a Variable. The characteristic that is being measured is described as a Concept. Income is an
example of a Concept. Each Variable identifies the Population Unit that is the subject of
measurement identified by a Concept. The Variable does not include any information on how
the resulting value may be represented. This is to prevent duplication of Variable information
where the essence of what is being measured remains the same but is represented in a
32
different manner. The Contextual Variable adds information that describes how the resulting
values may be represented through association with a Value Domain. The valid values may
be of several types including coded (reference to a code list which may or may not represent
a classification) or non-coded i.e. (text, numeric or date time). Numeric non-coded value
domains are represented through a Unit of Measurement.
83.
Each time Contextual Variable is used within a different statistical activity
(Acquisition, Statistical or Dissemination project) it is represented as an Instance Variable.
This is used to identify and acknowledge the meaningful differences that arise when data
sources are used to populate the value of the variable. It is important to note that the Instance
Variable should not be confused with the actual content of the variable once it has been
assigned values.
84.
A Classification is a categorization of real world objects so that they may be grouped,
by like characteristics, for the purposes of measurement. A Classification may have one or
more Classification Versions which represent changes over time and is valid only for a
particular period. Each Classification Version has one or more Classification Levels which
consist of one or more Classification Items. A Category groups together real world items
according to a common property and exist independent of its inclusion in a particular
Classification Version. The Category is referenced by a Classification Item to place it into the
level of a particular Classification Version.
85.
A Classification categorizes real world items for the purposes of measurement but
does not prescribe representation for the Categories which it includes. Depending on the
Classification used a Category may be represented by a different Code with a Code list. Many
Code lists may exist to represent the same Categories in a Classification and used in different
contexts. Some Code lists may not refer to any Classifications. A Code list consists of Code
Items that can be organized into a Hierarchy of Levels. Together a Classification and a Code
list provide representation and meaning that can be used to enumerate a set of real world
values.
33
Figure A8. Object level diagram of Conceptual group
34
Table A3. Definitions of information objects in the Conceptual Group
GSIM
Group
GSIM Set
GSIM Object GSIM Object Definition
sub-type
Conceptual
Source
Example
The group Conceptual contains sets of
objects that define the measurement and
representation of data.
GSIM Sprint
2
Population Unit,
Variable,
Classification
The Population Unit set contains objects
that define real world phenomena that are
the subject of a statistical activity.
GSIM Sprint
2
Statistical Unit,
Collection Unit,
Analysis Unit,
Target
Population,
Survey
Population
A Population is a set of units.
GSIM Sprint
2
Target
Population,
Statistical
Population
Conceptual
Population
Unit
Conceptual
Population
Unit
Population
Conceptual
Population
Unit
Population
Target
population
A Target Population describes the set of
statistical units (i.e. the objects of interest)
as defined during the design stage of a
statistical activity.
Based on
Neuchâtel
All cars in
Korea.
Conceptual
Population
Unit
Population
Frame
population
A Frame Population describes the set of
statistical units that represent the
observable part of the target population
and provides a reasonable approximation
of it.
GSIM Sprint
2
All cars
registered in
Korea.
35
Conceptual
Population
Unit
Population
Survey
population
A Survey Population is the set of statistical GSIM Sprint
units from which information can be
2
obtained in a survey. It is based on the
frame population.
All owners of
cars registered in
Korea.
Conceptual
Population
Unit
Population
Analysis
population
An Analysis Population is the set of
derived statistical units required for the
analysis, processing, or dissemination of
statistical data.
A set of real or
artificial
(aggregate) unit
applied for
statistics without
an obvious
statistical unit as
for example in
price statistics or
pollution
statistics.
Conceptual
Population
Unit
Population
Unit
Conceptual
Population
Unit
Population
Unit
Statistical
unit
A Statistical Unit is the entity of interest in
a statistical activity, i.e. for which
information is sought.
SDMX MCV Cars.
updated
Conceptual
Population
Unit
Population
Unit
Collection
unit
The Collection Unit is the entity for which
information can actually be obtained
during data collection.
GSIM Sprint
2
Car owners.
Conceptual
Population
Unit
Population
Unit
Analysis unit
An Analysis Unit is a derived entity that is
defined for the analysis, processing, or
dissemination of statistical data.
GSIM Sprint
2
A real or
artificial
(aggregate) unit
applied for
GSIM Sprint
2
A Unit is an entity that can be described by GSIM Sprint
characteristics and is a member of a
2
population.
Statistical unit,
collection unit,
analysis unit
36
statistics without
an obvious
statistical unit as
for example in
price statistics or
pollution
statistics.
The Variable set contains objects that
describe the measurement of real world
phenomena that are the subject of a
statistical activity.
GSIM Sprint
2
Concept,
Variable,
Contextual
Variable,
Instance Variable
Concept
A Concept is a characteristic common to a
set of objects.
Based
ISO1087
Income
Variable
Variable
A Variable is a characteristic of a
statistical unit being observed.
Updated part Income of a
of UN
person
Glossary of
Classification
terms via
SDMX MCV
Variable
Contextual
variable
A Contextual Variable is a characteristic
of a unit being observed that may assume
one or more of a set of values.
Updated UN
Glossary of
Classification
terms via
SDMX MCV
Conceptual
Variable
Conceptual
Variable
Conceptual
Conceptual
Income of a
person measured
in USD with a
numerical value
domain from 0 to
infinity.
37
Conceptual
Variable
Instance
Variable
An Instance Variable is a characteristic of
a unit being observed that may assume one
or more of a set of values as used in a
particular data resource.
GSIM Sprint
2
Income of a
person measured
in USD with a
numerical value
domain from 0 to
infinity in the tax
office database.
Conceptual
Variable
Value
domain
A set of permissible values.
ISO/IEC
11179
Coded value
domain, uncoded
value domain.
Conceptual
Variable
Value
Domain
Coded value
domain
A Coded Value Domain is a set of
permissible values specified in a code list.
Based on
ISO/IEC
11179
ISO country
codes. M or F to
represent values
of gender.
Conceptual
Variable
Value
Domain
Uncoded
An Uncoded Value Domain is a set of
value domain rules that defines a set of permissible
values where not specified by a code list.
Based on
ISO/IEC
11179
USD with a
numerical value
domain from 0 to
infinity. A
dwelling address.
Date of birth.
Conceptual
Classification
GSIM Sprint
2
Value Domain,
Code List,
Classification
Version,
Classification
Level,
Classification
Item
The Classification set contains objects that
describe or delimit the values that can be
used to measure real world phenomena
that are the subject of a statistical activity.
38
Conceptual
Classification
Category
A Category is an object that groups
together real world items according to a
common property.
GSIM Sprint
2
Female.
Conceptual
Classification
Classification
A Classification is an ensemble of one or
more related lists of mutually exclusive
categories.
Based on
Neuchatel
The UN
International
Standard
Industrial
Classification of
All Economic
Activities (ISIC)
Conceptual
Classification
Classification
version
A Classification Version is a list of
mutually exclusive categories representing
the version-specific values of the
classification variable. A classification
version has a certain normative status and
is valid for a given period of time.
Neuchatel
ISIC Revision 4
Conceptual
Classification
Classification
level
A Classification Level is the set of items at
the same granularity in a classification
(version or variant).
Based on
Neuchatel
The levels of
ISIC Rev 4.:
principal,
secondary and
ancillary
activities.
Conceptual
Classification
Classification
item
A Classification Item represents a category Neuchatel
at a certain level within a classification
version or variant.
In the ISIC Rev.
4 one of the
classification
items is "0121
Growing of
grapes"
39
Conceptual
Classification
Code list
A Code List is a predefined list from
which some coded statistical
characteristics take their values.
SDMX MCV Gender with
members F
updated
"Female" and M
"Male".
Conceptual
Classification
Code item
A Code Item is a member of a code list. It
can either be a code or an included code.
GSIM Sprint
2
Conceptual
Classification
Code
A language independent set of letters,
numbers or symbols that represent a
conceptual value whose meaning is
described in a natural language.
SDMX MCV F (for "Female").
updated
Conceptual
Classification
Code
Conceptual
Classification
Hierarchy
Included
code
F "Female".
An Included Code is a member of a code
GSIM Sprint
list that is defined by reference to a code
2
item that is maintained in another code list.
In a national
variant of a
classification the
top level codes
are defined via
reference to the
codes in the
international
standard
classification.
A Hierarchy arranges code items in levels
of detail from the broadest to the most
detailed level. Each level of the
classification is defined in terms of the
categories at the next lower level of the
classification.
In ISIC Revision
4 the hierarchy is
represented by
the number of
digits.
OECD
Glossary of
Statistical
Terms
updated
40
Conceptual
Classification
Value
Domain
Unit of
A Unit of Measurement is the quantity or
Measurement increment by which a concept represented
by a (typically numerical) variable is
counted or described. It is part of the
description of an uncoded value domain.
Based on
amount in euros,
SDMX MCV kilograms
41
Information
Figure A9. Set level diagram of Information Group
86.
As described in the section on Production, an Acquisition Program conducted by a
statistical agency produces or supplies a Data Resource that can be used by Statistical
Programs and Dissemination Programs. The Data Resource is based on Datasets that may be
included in Products and Services, either as Datasets (e.g. when providing access to publicuse micro data) or as represented by a Visualization (e.g. a table in a report or an interactive
chart on a website).
87.
Datasets come in different guises, for example as Administrative Register, Time
Series, Panel Data, or Survival Data, just to name a few. The type of a Dataset determines the
set of specific attributes to be defined, the type of Data Structure Definition required (Unit
Data Structure Definition or Cube Data Structure Definition), and the methods applicable to
the data. For instance, an Administrative Register is characterized by a Unit Data Structure
Definition, attributes such as its original purpose or the last update date of each record,
contains a record identifying variable, and can be used to define a Frame Population, to
replace or complement existing surveys, or as an auxiliary input to imputation. Record
matching is an example for a method specifically relevant for registers. An example for a
type of Dataset defined by a Cube Data Structure Definition is a Time Series. It has specific
attributes such as frequency and type of temporal aggregation and specific methods, e.g.
seasonal adjustment, and must contain a temporal variable.
88.
A Cube Data Structure Definition4 describes the structure of an aggregate, multidimensional table (macro data) by means of Dimensions and Measures. Both are Variables
with specific roles in such a table. The combination of Dimensions contained in a Cube Data
Structure Definition creates a key or identifier of the measured values. For instance, country,
4
The Cube Data Structure Definition and its components are mainly based on the SDMX Data Structure
Definition but also includes elements from the DDI NCube structure. In contrast to the GSIM Cube Data
Structure Definition, its SDMX equivalent also contains ”attributes”. An SDMX attribute is an additional
characteristic that is not required to uniquely identify a cell in the multi-dimensional structure. It can be
mandatory or optional and attached to a cell, the entire dataset, or any combination of dimensions. In GSIM,
these attributes are considered as reference metadata and thus not represented as separate information object.
42
indicator, measurement unit, frequency, and time Dimensions together identify the cells in a
cross-country Time Series with multiple indicators (e.g. gross domestic product, gross
domestic debt) measured in different units (e.g. various currencies, percent changes) and at
different frequencies (e.g. annual, quarterly). The cells in such a multi-dimensional table
contain the observation values. A Measure is the Variable that provides a container for these
observation values. It takes its semantics from a subset of the Dimensions of the Cube Data
Structure Definition. In the previous example, indicator and measurement unit can be
considered as those semantics-providing Dimensions, whereas frequency and time are the
temporal Dimensions and country the geographic Dimension. An example for a Measure in
addition to the plain ’observation value’ could be ’pre-break observation value’ in case of a
Time Series. Dimensions typically refer to Variables with Coded Value Domains, Measures
to Variables with Uncoded Value Domains.
89.
A Unit Data Structure Definition5 specifies the structure of unaggregated micro data.
It discerns between the logical and physical structure of a Dataset. A Logical Record
describes the structure independent of physical features by referring to Variables that may
include a unit identification (e.g. household number). A Record Layout describes the physical
layout of a Logical Record by means of attributes of Variables such as storage format, start
position, and width. A Physical Structure provides the overall layout of a physical instance of
a Logical Record. It refers to Instance Variables of the Dataset. A Record Relationship
defines source-target relations between Logical Records. A Response is an example of what
can be represented by an Instance Variable.
Figure A10. Object level diagram of Information group
5
The Unit Data Structure Definition and its components are based on the DDI model for record structures.
43
Table A4. Definitions of information objects in the Information Group
GSIM
Group
GSIM
Set
GSIM Object
GSIM Object
sub-type
Information
Definition
Source
Example
The group Information contains sets of
objects that describe the results of stages of
statistical production.
GSIM Sprint
2
The result of
acquisition.
The Dataset set contains objects that define
the structure of the result of an acquisition,
processing or dissemination activity.
GSIM Sprint
2
Data resource, data
structure definition
A Dataset is any organized collection of
data.
GSIM Sprint
1
Information
Dataset
Information
Dataset
Dataset
Information
Dataset
Dataset
Administrativ An Administrative Register is an organized
collection of data that includes data of one
e Register
or more unit types where production of
statistics was not the original or main
purpose.
Based on
GSIM Sprint
1
Information
Dataset
Dataset
Logical
Record
A Logical Record is a description of a data
record that is independent of its physical
features. It can include references to
included variables, record type, case
identification, and multiple record
segments.
Based on
DDI
Information
Dataset
Dataset
Previous
Response
A Previous Response is a value which is
known prior to conducting the survey, and
which is used in the instrument to influence
selection of questions or other conditional
GSIM Sprint
2
The municipality the
respondent lives in.
44
aspects of the instrument.
Information
Dataset
Dataset
Record
Layout
A Record Layout is a description of the
details of a physical record. Core elements
are details on variables like start position,
width, storage format, and decimal
positions.
Based on
DDI
Information
Dataset
Dataset
Record
Relationship
It describes a link to logical records, and
optional format and default characteristics
such as decimal positions, decimal or digit
separators, data type, and missing data
indicators.
Based on
DDI
Information
Dataset
Dataset
Status
A Status for a Dataset records the status of
the data
GSIM Sprint
2
Information
Dataset
Data
Resource
A Data Resource is an organized collection GSIM Sprint
of data which may be sourced from multiple 2
acquisition or statistical projects and may be
used in dissemination projects. It is made up
of one or more datasets.
Information
Dataset
Data
Structure
Definition
A Data Structure Definition describes the
structure of a Dataset in terms of a set of
Variables and attached Value Domains.
Provisional data,
final data, revised
data
Economic data
warehouse. Census
register based on
multiple
administrative
registers and sample
surveys.
Based on
SDMX
45
Information
Dataset
Data
Structure
Definition
Cube Data
Structure
Definition
A Cube Data Structure Definition is a
description of the structure of an aggregate,
multi-dimensional table (macro data) by
means of Dimensions and Measures.
GSIM Sprint
2
Information
Dataset
Data
Structure
Definition
Dimension
A Dimension is a Variable that is required
GSIM Sprint
to identify each observation value in a
2
macro dataset. The combination of all
Dimensions in a Cube Data Structure
Definition provides a key or identifier of the
observation values.
Information
Dataset
Data
Structure
Definition
Measure
A Measure is a Variable that provides a
container for the observation values in a
macro dataset. It takes its semantics from a
subset of the Dimensions of the Cube Data
Structure Definition.
GSIM Sprint
2
Information
Dataset
Data
Structure
Definition
Unit Data
Structure
Definition
A Unit Data Structure Definition is a
description of the structure of a micro
dataset by means of a set of Variables
defined by logical and physical records.
GSIM Sprint
2
Information
Dataset
Response
The reaction of an individual unit to some
form of stimulus or to a request for
information.
OECD
Glossary of
Statistical
Terms
Information
Product
The Product set contains objects that
describe static published (internal or
external) results of a statistical activity. A
Product that is the output of one process
may be the input into another.
GSIM Sprint
2
Country, frequency,
time.
A statistical
publication including
methodological
information, tables,
and charts; a public46
use micro dataset
Information
Product
Product
A Statistical Product is a package of data
GSIM Sprint
and data representations (tables, graphs) and 2
accompanying documentation that results
from a statistical activity.
Information
Product
Visualization
A Visualization is a (usually graphical)
representation of a Dataset or components
of a Dataset.
GSIM Sprint
2
Histogram, table,
heatmap.
Information
Service
The Service set contains objects that
describe dynamically created results of a
statistical activity.
GSIM Sprint
2
An interactive chart
on the website of the
statistical agency.
47
Annex B: Using the Object level and Use cases
Making use of the GSIM Object level
90.
The Object level of GSIM provides a set of standardized, consistently described
objects which can be used to understand and structure the information required to describe
the production of official statistics, including processes and their inputs and outputs.
91.
For each object GSIM provides:




92.
a definition
a preferred name
a set of relationships to other information objects
a set of attributes used to describe each object (not yet developed, to be included
in future versions of GSIM)
GSIM can be used by statistical agencies in a variety of ways:
93.
Understand the information required to support the statistical production
process


94.
Validate or extend existing information models


95.
GSIM can be used to identify which information should be captured through
the entire lifecycle, including well understood and commonly used
information objects such as Variable, Population, Dataset, and new
information objects that support innovative approaches to statistics (e.g. new
ways of acquiring statistical data using the Acquisition Program object).
GSIM can be used to understand how information used at different stages in
the lifecycle is related. For example a user who produces a Dataset, will
understand how this information may be used to produce a Product. The
defined attributes and relationships of the Dataset object will guide the user
into properly capturing the required metadata.
The information objects in GSIM can assist in evaluating the applicability of
existing or new standards for data description and exchange. The multiple
levels of GSIM allow this to be done at different levels of abstraction.
The GSIM metamodel and design principles can provide a template for
specifying new or local, information objects to support evolving needs of
statistical agencies if they choose not to adopt the standard GSIM.
Facilitate communication within and between statistical agencies

The definitions of information objects and those of their attributes will ensure
that different stakeholders (business, methodology, IT) have a shared
understanding of the data and metadata required when designing and
implementing statistical processes.
10. 48
11.

96.
Supports metadata used to describe and run processes



97.
Common terminology and definitions facilitate communication with internal
and external providers and consumers of statistics.
The attributes belonging to each object enable precise description of the
inputs and outputs of statistical processes.
The set of defined attributes standardizes the capture of data and metadata,
enables consistent governance and tracking of information objects through
their lifecycle.
Clearly defined and well described information objects allow their reuse of
the information objects between different processes or components (e.g.
services, systems) within and outside statistical agencies.
Understand information in context



The relationships between information objects help to identify all the inputs
and outputs required for a statistical process. For example, the relationships
between the objects Process Step, Rule, Output and Method inform a process
designer of all the objects required to design a process.
The relationships indicate the role the objects play with regard to each other
and the constraints between objects that must be implemented as rules when
designing processes and components.
Sharing and reuse of information items implies that users must be able to find
and discover information. Relationships support search and discovery by
linking related objects together. For example, the relationship between
Variable and Statistical Unit helps in locating all available data about a unit
of interest.
10. 49
11.
Use Case 1: Design statistics on environmental investments by businesses
Use case from Statistics Netherlands supplied for the GSIM sprint April 2012
Introduction
99.
This is an example of a process model of the environmental expenditure business at
Statistics Netherlands.
100.
A few comments are in order.
1. We have followed an outside-in approach to model the process of this statistics,
that is, first we have modeled the output of the whole process, second we have
modeled the input of the whole process and based on the differences between the
output and the input we have modeled the process.
2. Based on, for example, transparency requirements and the ‘structure’ of the data
to be processed, the process may consist of more than one subprocess. Each
subprocess has its output (to be modeled) and its input (to be modeled)
3. When modeling a (sub) process we have distinguished between the design of the
“know” and the design of the “flow”. The design of the “know” involves the
choice of the statistical methodology, which may be considered as the application
of a general statistical method like an imputation method or an estimation method.
The design of the “flow” involves the design of the process steps in which the
methods are applied, the path and order of the process steps, the decision points
that guide the statistical data through the path depending on their status, the
triggers to start or end a process (step), and the process indicators to manage the
flow as a whole. Note that the triggers may be quality driven, cost driven, time
drive or event driven.
4. With reference to GSIM, it should be possible to model the out- and input of the
whole process, of each subprocess and even of a process step by the ‘Information’
(structural) and ‘Conceptual’ information objects. So, there is a recurring
application of these information objects.
5. With reference to GSIM it should also be possible to model the flow of a (sub)
process with the ‘Production’ information objects.
Statistical Processes covered
101. The process description starts after data collection (Phase 4 of GSBPM) and ends with
validation (Phase 6 of the GSBPM).
Test result
102. Most objects identified in this process description are instances of objects defined in
GSIM v0.4. A few cases of doubt were highlighted in yellow. The process uses all sets of
objects at the second level of GSIM, except for the whole first group ‘Activity’. ‘Activity’ is
implicit. Attributes and relations are seldom explicit.
Process Description
10. 50
11.
Describing the out- and input of the whole process
103. Statistics Netherlands produces annual statistics on environmental investments by
businesses. The ‘same’ picture is published twice. The first publication contains preliminary
figures and the latter publication contains final figures. The difference between these two
publications concerns data quality. Both publications concern the same population
definitions, the same set of (attribute) variables and the same reference period (which is
yearly). Below, you will find our simplified model to describe a part of the output in terms of
the population definitions and the set of variables. Naturally, the NACE-delineation and the
investments variables involved should be defined properly.
NACE-delineation of enterprise units
-NACE-id
-Environmental investments (on behalf of water)
-Environmental investments (on behalf of air)
-Environmental investments (on behalf of the rest)
-Environmental investments (total)
Figure B1: Description of the output
104.
Required objects:







Statistical unit and target population for the required sub domain and time period;
Concept: environmental investments
Variables: water, air, rest, total (for the Netherlands per year)
Value domain of all variables: [0, large value) in Euro’s
Statistical output: Different quality versions of the data: how is this covered in the
model?
Contextual variables: all values in euro
Data set: aggregate statistic with NACE as classification variable, investments (for
water, air, rest, total) as counting variables and enterprise unit as underlying micro
objects
105. In order to produce the final and preliminary figures that are described in Figure B1,
we need input. The environmental investments by businesses uses primary data collection as
one input source and (a part of) our Business Register as another input source. Below you
will find our simplified model that describes the input sources (and their relation).
10. 51
11.
Statistical unit (enterprise)
Observational unit
-Enterprise-id
-Observational unit-id
-Enterprise-id
-Inclusion probability
-Sampling Indicator
-NACE-id (2008)
-Number of Employees
Investment
-Enterprise-id
-Observational unit-id
-Investment-id
-Type of investment (water, air, rest)
-Amount of Investment
Figure B2: Description of the input
106.
A number of remarks are in order.
−
The Statistical unit part of the model (right side of Figure B2) stems from the Business
Register of Statistics Netherlands. The corresponding data is observed and processed by
Business Register unit. From the point of view of the environmental expenditure
business, these data already exist.
−
In order to observe the investment variables, a sample is drawn from the Business
Register. Some of the drawn enterprises are combined to form an observational unit,
other enterprises are split into several observational units; most of the enterprises will
be used directly as observational units. Due to the splitting and merging of the
enterprises a new unit (type) is created, which we have called observational unit.
−
Each observational unit may have zero, one or more investments. Each investment is
modeled as a unit.
107.
Required objects:





Population/unit: enterprise, observational unit; investments
Population: Business register, sample (sample is subset of business register)
Source: Business register, survey
Component: derivation of observational unit from statistical unit (splitting and
merging). This has already been done in preparatory process steps.
The cardinality of relation between observational unit and investment is specified;
the cardinality of the relation between statistical unit and observational unit is
specified
Designing the flow
108. When the output model is compared to the input model, a number of observations can
be made. We mention the following:
−
We need estimates for the population totals of the several investment variables.
−
These variables are related: investment on behalf of water + investment on behalf of air
+ investment on behalf of the rest = investment total.
10. 52
11.
−
These variables are defined on the enterprise unit.
−
Per (drawn) observational unit we get response, per response we actually observe
whether or not there is an investment, and if any, the type of the investment and the
expenditure involved.
109. The challenge is to model a flow to process the desired output from the input. Below
you will find two simplified process flows, as we have modeled the whole process into two
sub processes.
110.
Required objects:


Components: subprocess1, subprocess2
Sequencing: the sub processes are trigged at another level with different frequency
(explained below)
111. The flow of the first sub process involves the following typical process steps or
activities (the order is not yet modeled)
−
Match the observed responses to the Business Register (the sampled part). Referring to
Figure B1, this gives the first instances of the Observational unit part and the
Investment part.
−
Validate and correct for 1000-errors
−
Divide the data flow into two parts. The first has to be validated and corrected manually
and the second part is not validated and corrected. Note that this is a simple example of
selective editing. The criterion to the divide the data flow into two parts is based on a
so-called OK-index. So, before the data can be divided, this OK-index has to be
calculated.
−
Calculate the OK-index
−
Validate the data manually
−
Correct (or impute) the data manually
−
Estimate the amount of investment (on the Investment level) to the Observational unitlevel, taking into account the type of investment.
112.



Required objects:
(implicit) population of observational units
Components: matching, validation1000, determine OK-index, router validation and
correction, validate manually, correct manually, estimate
Variable: OK index (=quality variable), distinguish between (1-to-1, split and
merged) statistical units (=relation variable)
113. The flow of the second sub process involves the process steps/activities (again the
order is not modeled yet)
−
Split the joined observations into the related enterprise units, taking into account any
over coverage with respect to observational units.
10. 53
11.
−
Join the split observations to the related enterprise unit, taking into account any partial
non-response
−
Estimate the NACE population totals of the environmental investment variables.
114.
Required objects:


Components: router on (1-to-1, split and merged) statistical units, for split:
transform observation unit information to enterprise information, deal with overcoverage, for merged: transform observation unit information to enterprise
information, deal with partial non-response, estimate the total environmental
investment by NACE and investment type
Variable: sample design characteristics (inclusion probability, sampling indicator)
are in Figure B2 at the enterprise level
115. So far, the identified process steps/activities could be mapped to GSBPM. Based on
these steps you will find an example of both process flows (the order and the path is modeled
now).
Match response
Validate and correct 1000-errors
Calculate OK-idex
OK?
No
Yes
Validate (manually)
Yes
Correct (manually)
Aggregate
More units?
No
Figure B3: Flow of the first sub process
10. 54
11.
Split or join
Join
Do nothing
Split
Estimate population totals
Figure B4: Flow of the second sub process
116.
Again, a number of remarks are in order
−
The technique we have used to model these flows is non-standard.
−
The first sub process is able to process each response separately (there may be 10.000
responses). So, the sub process starts, i.e. is triggered by a response. This response is
matched, edited, and there is an aggregation (from Investment to Observational unit)
−
There is a decision point (OK?) that controls the response flow
−
There is a decision point (More units?) that controls the end of the sub process.
−
The second sub process starts to produce the preliminary figures if there are, 6.000
responses processed by the first sub process. In the meanwhile the first sub process
continues. If there are 10.000 responses processed, then the second sub process starts
again.
−
The second sub process processes the 6.000 (or 10.000) responses as a whole. This is
necessary because of the Join process step and the Estimation process step.
−
The output of the first sub process and the input of the second sub process are not
modeled yet. In fact this should be done.
−
Each process step is based on a (statistical) method, which has its own specification (or
parameterization). Furthermore, each activity has its own inputs and outputs. In fact,
both the specification and the input and output should be described/modeled at this
level as well.
−
In the end, all descriptions of the input and output should fit like a puzzle.
117.
Required objects:



Components: the routers have already been identified
Path: the sequence of actions is specified
Trigger: subprocess1 is triggered by each response (observation unit), subprocess2
is triggered by the number of observation units (6000, 10000)
10. 55
11.


Variable: number of responses; this variable has two roles: it is used as a process
indicator to manage the flow; it is also used as a quality indicator to describe the
response fraction.
Parameter: requirement on number of responses for provisional and final data
Designing the know (statistical methodology)
118. As already said, each process step is based on a (statistical) method, which has its
own specification (or parameterization). Furthermore, each activity has its own in- and
outputs. In fact, both the specification and the in- and output should be described/modeled.
As an example, we only describe the step that validates and corrects 1000-errors.
119. In order to specify the methodology one could refer to a standard method that
identifies 1000-errors. For example, one could compare the value of the variable to be
validated to a so-called reference value. If the ratio is larger than a parameter a, say a=400,
than then there is probably a 1000-error.
120. Now, in this example, the process step that validates and corrects 1000-errors should
‘know’ which variables have to be checked, the specification of the edit rules in terms of the
parameter a, and the specification of the parameter a. We note that the specification of the
parameter a may be different between different variables. The output of the process step
always has the same structure: it provides quality indicates for each variable that has been
checked and a new status of the values of these variables.
121.
−
122.
There is one comment in order.
The reference values are used as auxiliary information. The designer of the flow should
wonder whether these reference values stem from an external source (and should be
modeled as an additional input source) or from an activity inside the process. In the
latter case this activity should also be modeled. This could be a third (preparing) sub
process.
Required objects:



Rule: there is a linear edit involved: Variable x > a multiplied by variable y.
Component:
 There is an edit part in the component: If the edit is ‘violated’, then
variable z is ‘1000-error’, otherwise variable z is ‘not 1000-error’.
 There is an imputation part in the component: If z is ‘1000-error’, then
variable x = variable x divided by 1000; otherwise if z is ‘not 1000-error’
then variable x = variable x.
Parameter: a
Information Objects
123. In Table B1, the information objects identified in the use case are listed (first column)
and mapped to the GSIM objects (third column).
10. 56
11.
Table B1: Comparison of Information objects
GSIM
Identified
Attributes
information objects
found
information objects
applicable
Definition
match
attributes
applicable
relationships
applicable
(Y/N)
(Y/N)
(Y/N)
Enterprise
Statistical unit
yes
yes
Target population
Target population
yes
yes
Environmental investments
Concept
yes
yes
Variables
yes
yes
Contextual variable
yes
yes
Value domain
yes
yes
Quality attribute
Products
yes
yes
NACE, type of
investment,
aggregate value
Data set
yes
yes
Data set (composite)
yes
yes
Observational unit
yes
yes
Sample
Sample population
yes
yes
Business register, survey
Source
yes
yes
Derivation of observational unit
Component
yes
yes
Subproces1, Subprocess2
Component
yes
yes
Variables
environmental
investments for water, air, rest,
total
All variables in euro
Value larger 0 and larger (high
values)
Two versions (provisional, final)
Data set
Business
register
frame)
Observational unit
(sampling
Netherlands,
year
10. 57
11.
GSIM
Identified
Attributes
information objects
found
information objects
applicable
Definition
match
attributes
applicable
relationships
applicable
(Y/N)
(Y/N)
(Y/N)
Sequencing sub processes
Control
yes
yes
Matching,
validation1000,
determine
OK-index,
router
validation
and
correction,
validate
manually,
correct
manually, estimate
OK index (=quality variable),
distinguish between (1-to-1, split
and merged) statistical units
(=relation variable)
Router on (1-to-1, split and
merged) statistical units, for split:
transform
observation
unit
information
to
enterprise
information, deal with overcoverage, for merged: transform
observation unit information to
enterprise information, deal with
partial non-response, estimate the
total environmental investment
by NACE and investment type
Sample design characteristics
Component
yes
yes
Variable
yes
yes
Components,
rule,
process control
yes
yes
Variable
yes
yes
Attribute
sample
to
Path through components
Rule
Triggers for starting the process
Rule
yes
yes
Number of responses; this
variable has two roles: it is used
as a process indicator to manage
the flow, it is also used as a
Variable
yes
yes
10. 58
11.
GSIM
Identified
Attributes
information objects
found
quality indicator to the describe
the response fraction
Requirement on number of
responses for provisional and
final data
Variable x > a times variable y
a
information objects
applicable
Definition
match
attributes
applicable
relationships
applicable
(Y/N)
(Y/N)
(Y/N)
Parameter
yes
yes
Rule
yes
yes
Parameter
yes
yes
10. 59
11.
Use Case 2: Harvesting Data off The Web
Use case supplied by the IMF for GSIM Sprint 2 in April 2012
Introduction
124. This use case was defined based on the IMF's experience with web scraping of data
from member countries' and international organizations’ web pages (currently approx. 35
robots scraping 4000 series from 85 websites). Using robots to harvest data off the web has
potential to ease the task of data collection for data providers as well as data collectors. Data
that are already publicly available can be automatically extracted from the web reducing the
manual retrieval of data. Reuse of existing data and higher degrees of process automation
play an important role in the modernization of statistical business processes and are thus
considered as a use case relevant to GSIM.
Statistical processes covered
125. In principle, data scraped from the web can traverse all nine phases of the statistical
business process as described by the GSBPM, but phases without any special requirements in
the web-scraping scenario are not included in this use case (phases 6-8). With respect to the
other phases, only the relevant sub-processes are described.
Test result
126. The objects identified in this use case correspond to information objects defined in
GSIM v0.4. The process uses all level 2 objects except for Acquisition Program, Statistical
Program, and Dissemination Program from the Activity group. Activity is not explicitly
modeled, but the use case is actually an Acquisition Program. Attributes or relations are not
explicitly modeled, but a number of characteristics of objects were identified and listed in
section 4 for consideration as additions in the next development phase of the model. Not all
of the objects identified in the use case are currently explicitly modeled and implemented in
the structured and formalized way as proposed by GSIM. GSIM can help to improve
transparency concerning those objects as created and transformed in the process, facilitate
maintenance of these objects as well as of the process components, and to enhance reusability
of process components (web scraping robots in this case). Reference metadata are not
explicitly modeled in GSIM, but they may be attributes of any information object or
information objects (for example (meta-) data sets) attached to any other information object
(by means of a generic relationship that all information objects have).
Process description
1. Specify Needs
1.1. /1.2. Determine needs for information / Consult & confirm needs
127. Web scraping is used in two different scenarios: to fill gaps in existing data collection
exercises and as instrument for new data collection exercises. In the first scenario, need
specification comprises the identification of gaps in an existing data collection, e.g. countries
that do not report using existing data submission modes or do not report timely or not at the
desired frequency. To do this, the coverage of the existing dataset is evaluated. This involves
the dataset, the variables and/or statistical units used, and validation rules that determine what
is missing at certain points in time. A quality report (incl. coverage, timeliness) will result
from the coverage validation. The report may already be available from regular quality
assessments. In that case it is just an input to this process step and does not need to be defined
60
and generated. A description of the gap to be filled by web scraping is then the specification
of the information request.
128.
Required information objects (examples in parentheses):







Data Set (Direction of Trade production database)
Variable (Exports, Imports, Imports Free on Board; broken down by Partner
Country, Measurement Unit, Frequency, Time)
Population Unit (Country)
Process Component (Validate dataset coverage)
Rule (a country’s coverage vis-à-vis all partner countries below x% in the
previous 8 quarters)
Product (Coverage report for Direction of Trade production database; contains a
data set and a list of countries that satisfy the rule)
Information Request (Country X’s quarterly Exports to and Imports from all
individual Partner Countries in the previous three quarters)
129. We consider the quality report as reference metadata attached to a dataset. The report
is the output product of a validation process component. In case the report is reasonably
formalized, it is represented as a data set.
130.
In the second scenario a new information need is expressed.
131.
Required information objects:

Information request (Member countries’ latest Public Sector Debt Statistics)
1.3. Establish output objectives
132. In order to describe the purpose of the new data collection, that is either to close gaps
in existing datasets or to satisfy a new information need, the same information objects as in
the previous process step (1.1/1.2) are used.
1.4. Identify concepts
133. In the “close gaps” scenario we assume that a formal representation of the concepts
and value domains is already available and can be used to specify the constraints defining the
gap (e.g. which indicators, countries, measurement units, methods (e.g. for seasonal
adjustment), etc.). In IMF data collection, country is the statistical unit, but it is usually
treated as a variable.
134.
Required information objects:




Variable (Exports, Imports, Imports Free on Board; Partner Country,
Measurement Unit, Frequency, Time)
Value Domain (numeric, Precision=2 decimals, Scale=Millions)
Population Unit (Country)
Rule (Country=X and Frequency=Q and Measurement Unit=US Dollars and
Adjustment=None)
135. In the “new need” scenario a less detailed description of concepts is created in this
step using less information objects.
61
136.
Required information objects:


Variable (Exports, Imports, Imports Free on Board; Partner Country,
Measurement Unit , Frequency, Time)
Population Unit (Country)
1.5. Check data availability
137.
Checking data availability involves the following steps:
1. Identify web site(s) of data provider(s) (i.e. national statistical institutes,
central banks, or international organizations; but could also be commercial
providers) that contain the required data
2. Identify the available data format, e.g. HTML table, downloadable file (xls,
csv, pdf, SDMX-ML, ...), query interface and assess whether it suits the needs
(typically, a database query interface is preferred to downloadable files (with a
stable structure) which are preferred to plain HTML pages, as the structure of
a database is changed less frequently than the layout of an HTML page)
3. Check quality of available data, e.g. frequency of updates, timeliness, etc.
4. If the quality is not appropriate or the desired data format is not available,
identify further sources and repeat steps 2 and 3
138. We considered data available on a data provider’s web site as that data provider’s
product irrespective of the data format. We assumed the data format and transmission
channel to be attributes of a product. The quality indicators required to evaluate the data are
modeled as variables with value domains in a data set. A process component specifies
how to check the quality, and a process agent performs the check. The “check data
availability” process iterates until the quality of the identified data satisfies the requirements.
139.
Required information objects:






Product/Service (International trade table on website of Country X’s statistical
office; The World Bank’s online data catalogue and Open Data API)
Data set (Quality Dataset created for Product Y)
Variable (Update Frequency)
Value Domain (daily, weekly, monthly)
Rule (Update Frequency is at least weekly; Data Format=Query Interface)
Process Component (Validate quality of product)
1.6 Prepare business case
140. The preparation of a business case requires information objects from process steps 1.1
to 1.5, especially the availability check and information on the available data format and
quality.
2. Design
2.1 Design outputs
141. In the scenario of enhancing an existing dataset, the data collected via web scraping is
added to existing statistical outputs (i.e. dataset, publications, quality reports, etc.). This
means that an existing output design is identified and reused and the part of the output to
which the supplementary collection will contribute data and/or (reference) metadata (such as
62
a link to the source website) is specified. In terms of information objects, output design
would be done via reference to existing products, data sets (potentially a separate one for
metadata) as well as a rule specified in terms of variables and value domains that defines
the part of the data sets which the scraped data will fill. We would model any additional
reference metadata required as variables in another data set. In the new collection scenario,
the output data set (=target data model) including output reference metadata (separate data
set) as well as output products need to be defined (as for any other new data collection). In
both scenarios, the intermediary outputs, e.g. the input data store (before further processing),
have to be designed. In the new data collection scenario, the same objects are required but
need to be newly defined instead of referenced.
142.
Required information objects:






Data set (Direction of Trade reference metadata database and production
database, Web scraping input data store)
Data provider (National Statistical Office of country A)
Variable (Direct Source Link, Update Frequency)
Value Domain (IMF Country and Group code list)
Rule (Data area=[Country=A and Variable=Imports and Frequency=Quarterly])
Product (Direction of Trade dissemination database, Public Sector Debt Country
Report)
2.2 Design variable descriptions
143. Variable design and description includes cross-domain concepts such as country,
frequency, time, measurement unit. For those general concepts references to existing
variable definitions are sufficient. New concepts and appropriate source, derived, and output
variables and related value domains may need to be defined for new data collections.
Concepts and variables at all processing levels are considered as Variables.
144.
Required information objects:



Variable (Country)
Classification (ISO 2 country codes)
Value Domain (date format=YYYY, Time between 2000 and 2030)
2.3 Design data collection methodology
145.
To design a web scraping instrument, the following steps are required:
1. Determine the number and types of robots needed
2. Specify which data sources to use (data may be available on multiple
websites) and how
3. web address
4. format (i.e. HTML table, linked file, query tool)
5. navigation steps to get to the data
6. Define source data model, i.e. identify relevant elements of the website
7. Define mapping from source data model to target data model (variable and
value correspondence)
146. Translated into GSIM objects, this means that a “web scraping through a scheduled
robot” process component is defined and the configuration parameters specified. Web
63
address and data format are considered as attributes of a product. The navigation steps are
specified as rules. A mapping process component with additional rules is used to identify
and map the information elements from the web site to the target data model. The mapping
defines the correspondence of web site elements to variables and their values (taken from a
value domain). The schedule and/or sequencing of robots are defined in a process control.
147.
Required information objects:






Process Component (Web scraping through a scheduled robot; Mapping)
Rule (Select “PPI Statistical Bulletin”; Select “1. Output Prices: Summary NSA”;
Select “Download CSV”; Column A: 1=Imports, 2=Exports)
Product/Service (International trade table on website of Country X’s statistical
office)
Variable (Partner Country)
Classification (ISO 2-digit country codes)
Value Domain (numeric, Precision=2 decimals, Scale=Millions)
2.4 Design frame & sample methodology
148. The target population unit is country. It is usually treated as a variable. No sample
methodology is applied.
149.
Required information objects:




150.
Population Unit (Country)
Variable (if used instead of population unit)
Classification (ISO 2-digit country codes)
Rule (Country=A,B,C)
The target population is typically defined as all member countries.
2.5 Design statistical processing methodology
151.
The following process steps need to be followed:
1. Define validation rules (e.g. identify unmapped items)
2. Define rules for derivation of variables (requires also the values and value
domains of original and derived variables)
3. Define rules for integration of data from different robots (requires target data set
and dimensions along which the scraped data are combined; can be the population
unit or a variable)
4. Define rules for reference metadata (variables in a separate dataset or attributes of
objects)
152.
Required information objects:



Process Component (Validation of load process, Derive ratio of two variables,
Integrate dataset into data resource)
Rule (Variable A as Percent of GDP = Variable A in US Dollars / GDP in US
Dollars * 100; Measurement Unit = % of GDP; Attribute Last Update Date of
data set = time stamp of load time)
Variable (Time, Partner Country)
64




Classification (SNA 2008 Sector Classification)
Value Domain (1990 to 2030)
Data Set (Public Sector Debt Dissemination Database, Economic Data
Warehouse)
Population Unit (Country)
2.6 Design production systems & workflow
153.
The following process steps are required:
1. Define web scraping schedule
2. Specify format and location of robot output
3. Define workflow; includes
a) integration of "collection via robot" into standard processes
b) exception handling, e.g. no new data, unmapped item detected, robot fails
c) alert system/notifications
d) activity log for robots
e) remedy log for required interventions in case of robot failure
154.
Required information objects:





Process Control (Run robot “DOT Country A” every Sunday at 11PM; If file
download successful execute - then save file step – else send failure notification)
Process Component (Create process log)
Rule (Save downloaded file as filename.xml in location X)
Data Set (format and location should be attributes of a data set)
Process Metrics (Robot activity log, robot remedy log)
3. Build
3.1 Build data collection instrument
1. Implement robot(s) based on the design specified in process phase 2
2. Reuse existing robots if possible
155.
Required information objects:



Process Control (Run robot “DOT Country A” every Sunday at 11PM)
Process Component (Create remedy log)
Rule (Save downloaded file as filename.xml in location X)
3.2 Build or enhance process components/3.3. Configure workflows
156. Implement integration and other rules and configure workflow based on the design
specified in process phase 2.
157.
Required information objects:

Same objects as in 3.1. Build data collection instrument
3.4 Test production system
1. Test robots and other components individually
2. Modify them if necessary
65
3. Re-test components (loop)
4. Test integration of components
5. Modify & re-test (loop)
158.
This process step creates a test report.
159.
Required information objects:

All objects specified for the scraping process are subject to test/review and
thus needed in this process step. The test report can be represented as process
metrics or product.
3.5 Test statistical business process
1. Test entire workflow
2. Modify if required
3. Re-test workflow (loop)
160.
Required information objects:

Same objects as in 3.4 Test production system
3.6 Finalize production system
1. Implement final schedule and obtain approval
2. Communicate to data providers
161.
Required information objects:

Same objects as in 3.4 Test production system
4. Collect
4.1 Select sample
162.
Not relevant for this use case
4.2 Set up collection
163.
Put robots and other components into production.
164.
Required information objects:

Same objects as in 3.4 Test production system
4.3/4.4 Run collection/Finalize collection
165.
Robots will run according to the specified schedule
-
166.
includes populating the activity log and failure remedy log (see 2.6)
Required information objects:

Same objects as in 2.6 Design production systems & workflow and 3.6 Finalize
production system
66
5. Process
5.1 Integrate data
1. Integrate data and metadata from different robots
2. Integrate scraped data and metadata with data/metadata already in the system or
data/metadata collected via different channels
167.
Required information objects:





Data Set (Direction of Trade Input Database, Economic Data Warehouse)
Process Component (Integrate dataset into data resource)
Variable (Partner Country, Time, Frequency)
Classification (ISO country codes)
Value Domain (1990-2030)
5.2. Classify & code
168.
Not relevant in this scenario.
169. The coding is already done by the mapping component of the robot, typically during
collection and before integration.
170. (To further modularize the processes, it should be considered to move this step from
“Collection” to “Process”.)
5.3 Review, validate & edit
171. Based on validation rules specified in phase 2. This step would typically happen
prior to integration.
172.
Required information objects:

Same as used in 2.5 Design statistical processing methodology
5.4 Impute, 5.5 Derive new variables & statistical units, 5.6 Calculate weights, 5.7 Calculate
aggregates, 5.8 Finalize data files
173. No specific rules for web-scraped data; same categories as in other types of data
collection
6. Analyze, 7. Disseminate, 8. Archive
174.
No specific rules for web-scraped data
9. Evaluate
9.1 Gather evaluation inputs
175.
Gather as evaluation input:
1. Robot activity log
2. Remedy log for required interventions in case of robot failure
176.
Both logs are data sets. They can be represented as process metrics.
177.
Required information objects:
67

Process metrics (activity log, remedy log)
9.2 Conduct evaluation
1. Review evaluation inputs
2. Identify failure-prone robots and analyses reasons for failure and used
remedies
3. Prepare evaluation report (represented as data set or product)
178.
Required information objects:




Data Set (Robot evaluation metrics)
Rule (Average rate of robot failures greater than 10%)
Process Component (Calculate robot evaluation metrics)
Product (Robot evaluation report)
9.3 Agree action plan
179.
An action plan may include modification of robots, for example in terms of
-
180.
more stable data sources
automation of remedies
Required information objects:




Process Control (If Rule 1 and Rule 2 satisfied then start Process Step
Execution Identify potential data sources)
Process Component (Validate quality of product)
Rule (Update Frequency is at least weekly; Data Format=Query Interface)
Product/Service (International trade table on website of Country X’s ministry
of finance)
181. For example, actions may include revision of the schedule, the configuration of
process components and/or rules, using a different data product (maybe of a different data
provider), or adding another product as source. Any of the other previously used categories
may be involved.
Information objects
182. In Table B2, the IMF information objects identified in the use case are listed (first
column) and mapped to the GSIM objects (third column). The second column lists attributes
to be considered in the further specification of GSIM information objects. Columns 4, 5 and 6
can be completed once those levels of detail will have been specified for GSIM.
68
Table B2: Comparison of information objects
Identified
information
objects
Data
set,
(reference)
metadata
set,
production data set,
dissemination data
set
Attributes
found
Data
location
GSIM
Information
objects
applicable
format, Data set
Concept,
dimension,
attribute, reference
metadata type
Variable
Dimension,
source
data
Population
unit
Web scraping tool,
robot, mapping
Process
component
GSBPM
Definition Attributes Relationships Steps of GSBPM Test
match
applicable applicable
where used
comments
(Y/N)
(Y/N)
(Y/N)
1.1, 1.2, 1.3, 1.5,
1.6
2.1, 2.5, 2.6
3.4, 3.5, 3.6
4.2, 4.3, 4.4
5.1, 5.3
9.3
1.1, 1.2, 1.3, 1.4,
1.5, 1.6
2.1, 2.2, 2.3, 2.4,
2.5
3.4, 3.5, 3.6
4.2, 4.3, 4.4
5.1, 5.3
9.2, 9.3
1.1, 1.2, 1.3, 1.4,
1.6
2.4, 2.5
3.4, 3.5, 3.6
4.2, 4.3, 4.4
5.3
9.3
1.1, 1.2, 1.3, 1.5,
1.6
2.3, 2.5, 2.6
3.1, 3.2, 3.3, 3.4,
69
Identified
information
objects
Attributes
found
Constraint, business
rule,
formula,
mapping rule
Data
report,
metadata
report,
quality
report,
table, publication
Data
format,
transmission
channel,
location
(e.g. web address),
last update date
(should be generic)
Requirements,
business need
Authoritative list, Data type, format,
code
list, measurement unit,
production
list, precision, scale
source dimension,
measurement unit
(a dimension)
GSIM
Information
objects
applicable
GSBPM
Definition Attributes Relationships Steps of GSBPM Test
match
applicable applicable
where used
comments
(Y/N)
(Y/N)
(Y/N)
3.5, 3.6
4.2, 4.3, 4.4
5.1, 5.3
9.2, 9.3
Rule
1.1, 1.2, 1.3, 1.4,
1.5, 1.6, 2.6
2.1, 2.3, 2.4, 2.5
3.1, 3.2, 3.3, 3.4,
3.5, 3.6
4.2, 4.3, 4.4
5.1, 5.3
9.2, 9.3
Product
1.1, 1.2, 1.3, 1.5,
1.6
2.1, 2.3
3.4, 3.5, 3.6
4.2, 4.3, 4.4
9.3
Information
1.1, 1.2, 1.3, 1.6
request
3.4, 3.5, 3.6
4.2, 4.3, 4.4
9.3
Value domain
1.4, 1.5, 1.6
Level 3
2.1, 2.2
3.4, 3.5, 3.6
4.2, 4.3, 4.4
9.3
70
Identified
Attributes
information
found
objects
Online
query Channel/format
facility, SDMX web
service
Data provider, data
source,
direct
source
Hierarchical code
list,
authoritative
list, dimension
Schedule, business
rule/logic
Robot activity log,
failure remedy log
GSIM
Information
objects
applicable
Service
GSBPM
Definition Attributes Relationships Steps of GSBPM Test
match
applicable applicable
where used
comments
(Y/N)
(Y/N)
(Y/N)
1.5, 1.6
3.4, 3.5, 3.6
4.2, 4.3, 4.4
9.3
Data provider
2.1
Level 3
3.4, 3.5, 3.6
4.2, 4.3, 4.4
9.3
Classification
2.2, 2.3, 2.4, 2.5
3.4, 3.5, 3.6
4.2, 4.3, 4.4
5.1, 5.3
9.3
Process
2.6
Control
3.1, 3.2, 3.3, 3.4,
3.5, 3.6
4.2, 4.3, 4.4
9.3
Process
2.6
Metrics
3.4, 3.5, 3.6
4.2, 4.3, 4.4
9.1, 9.3
71
Annex C: Mapping to other standards and models
183. This document begins to explore the relationships between GSIM and a number of
other standards and models. These include GSBPM, SDMX, DDI and CORE.
184. For each standard or model, the differences in terms of modeling, similar terms, gaps
or differences and example mappings are given. This work is in its preliminary stages and
will continue to be expanded upon as the GSIM develops further.
Relationship between GSBPM and GSIM
Differences in terms of modeling
185. The GSBPM is a business process model comprising four levels (the statistical
production process, nine phases, sub-processes within each phase, a textual description of
each sub-process). The GSBPM is available as a combination of a text document and a
diagram. The text contains many synonyms, homonyms, broader and narrower terms.
Similar terms
186. The GSIM is designed to support the GSBPM and to cover the whole statistical
process. Therefore, there are a great many similarities between the terms used in the GSBPM
and the names of the information objects in GSIM. The strength of the GSBPM lies in its
ample use of synonyms, homonyms, broader and narrower terms for communicating to a
wide audience. However, these are not suitable for an information model which requires a
much more controlled vocabulary and a much higher degree of semantic precision.
187. Future work should focus on a more detailed mapping between names of information
objects in GSIM and GSBPM terms. This work has begun, as shown in Table C1.
Table C1. Examples of mapping between GSIM and GSBPM
GSIM
Group
GSIM Set
Conceptual Population
Unit
GSIM Definition
The Population Unit set contains
objects that define real world
phenomena that are the subject of a
statistical activity.
Conceptual Variable
The Variable set contains objects that
describe the measurement of real
world phenomena that are the subject
of a statistical activity.
Conceptual Classification The Classification set contains objects
that describe or delimit the values that
can be used to measure real world
phenomena that are the subject of a
statistical activity.
GSBPM term
(details/
specialisations/
synonyms/ irregular
plural forms)
unit e.g. statistical
unit, collection unit,
legal unit
variables incl. derived
variables, statistical
variables,
characteristic
classification,
classification scheme,
code
72
Gaps/differences
188. GSBPM gives equal focus to all phases of statistical production. GSIM in its current
state gives a greater focus to information objects that are required for the data collection
(acquisition) and dissemination phases e.g. the analysis phase is currently less detailed in the
GSIM.
189. GSBPM does not clearly distinguish between the structure of the data file, database,
data record, data extraction or data set and the contents i.e. the data. GSIM attempts to clearly
distinguish between these.
Relationship between SDMX and GSIM
Differences in terms of modeling
190. SDMX is model-driven through the SDMX Information Model on a conceptual level.
All technology specifications in SDMX are implementations of this model. SDMX-ML is the
XML format for the exchange of SDMX-structured data and metadata. It is the detailed
technical implementation of the conceptual model. Content-oriented guideline, including the
Metadata Common Vocabulary (MCV), are available to provide users with a uniform
understanding of standard statistical metadata concepts.
191. The models of GSIM (level 3) and SDMX are both expressed in a similar way. A
conceptual model is available as UML class diagram.
Similar constructs
192. There are strong relationships in the area of concept, variable, aggregate data, and
classification. In the area of concepts, variables, and classifications GSIM is a richer model,
being aligned more closely with ISO/IEC 11179 and the Neuchâtel concept/variable model.
GSIM code lists are essentially similar to SDMX Code lists but GSIM also includes an
explicit classification object, while this is subsumed into Code list in SDMX. In general
GSIM and SDMX are both aligned with ISO/IEC 11179. The GSIM cube data structure is
closely aligned with the SDMX data structure definition. The GSIM provider and provision
agreement objects, relating to both data acquisition and to data dissemination are based on the
SDMX objects of the same names. GSIM, as it stands, does not include category and
category scheme objects to support categorization of model objects of all types but this will
probably be added based on the SDMX approach.
193. Future work should focus on a detailed mapping between GSIM and SDMX objects.
This work has begun. The mapping includes the detailed description of the GSIM and SDMX
object and notes (for example if mapping is in some sense constraint).
Table C2. Examples of mapping between GSIM and SDMX
GSIM
Object
Provision
Agreement
GSIM
Definition
A Provision
Agreement
is a set of
agreements
that exist
SDMX Object
SDMX Definition
ProvisionAgreement
Links the Data Provider
to the relevant Structure
Usage (e.g. Dataflow
Definition or
Metadataflow Definition)
SDMX
Notes
73
Cube data
structure
definition
around the
for which the provider
exchange of
supplies data or metadata
data
The agreement may
between a
constrain the scope of the
data
data or metadata that can
provider and
be provided, by means of
a statistical
a Constraint.
organisation.
A Cube
DataStructureDefiniton Data Structure Definition
Data
(DSD) is metadata
Structure
describing the structure
Definition
and organisation of a
describes the
data set, the statistical
structure of
concepts and attached to
a dataset
them code lists used
where the
within the data set.
data is
aggregated.
Gaps/differences
194. GSIM is a much broader model than SDMX. SDMX has its emphasis on
macrodata/aggregate data. Gaps are therefore in the areas of data collection, methods and
microdata. Further missing areas are in the field of statistical processes. The GSIM Process
model is significantly different from the SDMX Process object. The SDMX object on
metadata inputs and outputs while the GSIM process objects aim to model statistical
processes in a richer fashion.
Issues/differences with terminology, definition and concepts
195. In GSIM a Category is an object that groups together real world items according to a
common property, whereas SDMX uses Category in the context of categorizing metadata
(and similar objects) as an aid to search and retrieval.
Relationship between DDI and GSIM
Differences in terms of modeling
196. DDI Lifecycle (DDI-L 3.1) is expressed in XML Schema as the detailed technical
implementation. At the same time, an implicit conceptual model is based on a particular view
of the data lifecycle. This is further described in two documents. A formal expression in a
UML class diagram is available as working document for groups of the DDI Alliance. The
conceptual model of the next generation of DDI will be expressed as UML class diagram.
GSIM and DDI-L are both aligned with ISO/IEC 11179. A cross-referenced field-level
documentation is available.
Similar constructs
197. DDI has its emphasis on microdata in the whole data life cycle. There are strong
relationships in the areas of data collection of surveys, concept, and variable. GSIM and DDI
are both aligned with ISO/IEC 11179. The GSIM unit data structure object and related
74
objects are closely aligned with DDI. The instrument, control constructs, and question
objects are also closely aligned with DDI.
198. The rich set of descriptions of data on the logical and physical level in GSIM is
borrowed from DDI. This explains for example relationships of different logical record types
and details of the storage format. The same is the case for survey instrument and related
control constructs for conditional process.
199. A subset of process rules related strongly to constructs in DDI. This comprehends
coding instructions that pertain to data collection or data processing overall such as handling
of non-response to questions, imputation practices, suppression rules, and instructions for
recodes, derivations from multiple question or variable sources.
200. Future work should focus on a detailed mapping between GSIM and DDI objects.
This work has begun. The mapping includes the detailed description of the GSIM and DDI
object and notes (for example if mapping is in some sense constraint).
Table C3. Examples of mapping between GSIM and DDI
GSIM
Object
Instance
Variable
GSIM
Definition
An Instance
Variable is a
characteristic
of a unit
being
observed that
may assume
one or more
of a set of
values as
used in a
particular
data resource.
Classificat A
ion
Classification
is an
ensemble of
one or more
related lists
of mutually
exclusive
categories.
DDI 3.1
Object
Variable
DDI Definition
Category
Scheme
Contains
descriptions of
particular
categories used as
question responses
and in the logical
product. Their
relationships and
code values are
described in the
code scheme.
Describes a
variable contained
in the variable
scheme.
DDI
Module
Logical
product
DDI Notes
Logical
product
Use as
classification.
CodeScheme
can be used
additionally
if
classification
has codes.
Maps
directly.
Gaps/differences
201. GSIM explicitly seeks to be more general than DDI in the data acquisition area. The
DDI focus in this area is essentially on survey questionnaires while GSIM seeks to handle
surveys, administrative data and registers, and other sources from business or the internet in
75
an even-handed fashion. The GSIM handling of aggregate data structures (cube data
structure) is more closely aligned to the SDMX data structure definition than to the DDI
ncube structure (although some aspects are borrowed from the ncube). GSIM modeling of
processes is richer and more general than the DDI objects in this area. GSIM uses different
terminology from DDI in modeling statistical activities and statistical collections but there are
similarities that will probably allow mapping. DDI focuses on grouping the metadata for
related activities into container objects and this is something that GSIM almost certainly
needs.
Issues/differences with terminology, definition and concepts
202. As noted above the handling of aggregate structures is more closely aligned to SDMX
than DDI and the terminology is also more closely aligned to SDMX. Also as noted above,
the terminology around higher-level objects for activities and programs is different. The
terminology in the handling of processes is also different.
DDI Codebook
203. DDI Codebook is an older version focusing on a single study from an archival
perspective. It has a variable-centric view. It has a subset of objects of DDI Lifecycle. A
mapping path from DDI Codebook to DDI Lifecycle is available.
Relationship between SDMX and DDI and potential impacts on GSIM
204. There are a number of published documents6 that describe the relationship of SDMX
and DDI on a high level.
205. The detailed mapping of GSIM to DDI and SDMX is dependent on the detailed GSIM
model, the definitions of the GSIM objects, and clear relationships of the GSIM objects. A
path to how to do the mapping was developed. The mapping work has only started and
should be continued after Sprint2. The detailed mapping could be used as a type of review of
the GSIM model if something is missing. It could be also used as review of DDI and SDMX
to see where the standards have gaps.
206.
Further work can be done in this area after Sprint 2.
Relationship between CORE and GSIM
Differences in terms of modeling
207. The CORE information model is organized as a structure consisting of 6 packages:
Data set definitions, Expressions, Parameters, Rules, Service info, and Messages,
respectively. Each package consists of classes and class interdependencies.
6
Exploring the relationship between DDI, SDMX and the Generic Statistical Business Process Model.
Steven
Vale,
United
Nations
Economic
Commission
for
Europe.http://www1.unece.org/stat/platform/download/attachments/57835554/EDDI+paper.pdf?version=1,
http://dx.doi.org/10.3886/DDIOtherTopics01
DDI and SDMX: Complementary, Not Competing, Standards. Arofan Gregory, Pascal Heus.
http://www.opendatafoundation.org/papers/DDI_and_SDMX.pdf
76
Data set
definitions
Param eters
Expressions
Rules
Messages
Service info
Channels
Figure C1. Packages in CORE information model
208. The dependencies between the packages are shown above. Note that dependencies are
transitive: the Rules package also depends on the Data set definitions package, i.e. through
the Expressions package.
Similar constructs
209.
Both CORE and GSIM have Rule as an information object.
210. Both CORE and GSIM can use processes modeled by GSBPM or by another business
process model.
211. The CORE model knows of the existence of statistical information objects, but knows
nothing else about them. However, CORE is designed to be able to make use of GSIM
information objects.
212. Future work should focus on a detailed mapping between GSIM and CORE
information objects. This work has begun.
Table C4. Examples of mapping between GSIM and CORE
GSIM Set
Variable
GSIM
Definition
The Variable
set contains
objects that
describe the
measurement
of real world
phenomena that
are the subject
of a statistical
activity.
GSIM Example
CORE
Definition
Concept, Variable, A denotation
Contextual
and definition
Variable, Instance (including role
Variable
and value type)
of a column of
data in a data
set.
CORE Example
turnover (a
variable)
average turnover (a
measure)
NACE Rev. 2,
second digit (a
level)
Execution time (a
logging indicator)
77
Rule
The Rule set
contains
objects that
govern
processes.
An expression
together with a
data set
definition. Upon
execution of a
service, the
expression is
evaluated when
it is applied to a
row of data
from a data set
that conforms to
the data set
definition.
Examples of
expressions:
x < 100 and x > 10
if-thenelse(x>5,x,5)
SUM(x) /
COUNT(x)
Gaps/differences
213. In CORE Figures, Time Series, Population, Unit, Variable and Value are used as
categories for the layers of the model. This construct has not been used in GSIM
Issues/differences with terminology, definition and concepts
214. In CORE, classifications are modeled as rectangular data sets. In GSIM classifications
are modeled as hierarchical structures in accordance with the Neuchâtel model.
215. In CORE a variable is a kind of column, but a column is also more general than a
variable i.e. it could be a dimension, measure, logging indicator or level (of a classification).
Relationships with other standards and models
Neuchâtel model
216. Classifications in GSIM are modeled based on the Neuchâtel model for
classifications. GSIM does not currently include the Neuchâtel correspondence table.
217. The Neuchâtel model for variables and related concepts modeled variables, units and
populations to a more detailed level than GSIM currently has.
218.
Other models and standards
219.
It is the intention that the following standards/models and implementation
syntaxes/tools are examined in future GSIM work.






ISO/IEC 11179
ISO 19115 and related standards/models
Google DSPL
Open data RDF based vocabularies
Dublin Core
Standards-based/supporting software tools (Blaise, PC Axis, etc.)
78
Annex D. GSIM Metamodel
220. Each of the objects in GSIM will be described using the following metamodel (the
description of how each of the objects of metadata are themselves structured).
221. Not all of this information has been created for the purposes of GSIM v0.4; this
information will be completed before release of GSIM v1.0 in December 2012.
Template
Object
ID:
Object Name:
Definition:
Value Domain:
Explanatory
Text:
Synonyms:
Object Version:
Default Value:
Attributes (repeat as needed):
Property Name:
Description:
Cardinality:
Value Domain:
Relationships (repeat as needed):
Relationship Name:
Target Object
ID:
Description:
Relationship
type:
Target Object Name:
Relationship
Cardinality:
Constraints:
Relationship
Constraint Type:
Relationship
Constraint Type:
Relationship
Constraint
Value:
Relationship
Constraint
Value:
79
Example
Object
ID:
Object Name:
Definition:
Value Domain:
Explanatory
Text:
Synonyms:
GSIM123 Object Version:
1.0
Variable
A variable is a characteristic of a unit being observed that may assume
more than one of a set of values to which a numerical measure or a
category from a classification can be assigned …
Compound Object
Default Value:
This object is used to…
Properties:
Property Name:
Description:
Cardinality:
Weight Status
Value Domain:
Logical
(True/False)
Indicates whether a variable is or is not holding a value which
is a weight
One and only one (1:1)
Relationships:
Relationship Name:
Measured concept
Target Object GSIM-Con12345 Target Object Name:
Concept
ID:
Description:
Indicates that the Variable measures one and only one Concept …
Relationship
type:
Measures
Relationship
Cardinality:
One and only one (1:1)
Constraints:
Relationship
Constraint Type:
Relationship
Constraint Type:
Ordered
Relationship
Constraint
Value:
Relationship
Constraint
Value:
1
80