Generic Statistical Information Model (GSIM): Communication (Version 0.4, May 2012) DRAFT FOR REVIEW Please note the development of GSIM is a work in progress. GSIM v0.4 is not intended for official publication. Instructions for reviewers and a template for providing feedback is available at http://www1.unece.org/stat/platform/display/metis/GSIM+Version+0.4 About this document This document is part of the Communication Layer of GSIM. It is aimed at subject matter statisticians, methodologists, process designers, business architects etc. It consists of one main paper with a number of annexes. It provides more detailed information about the information represented in GSIM (including definitions and diagrams of lower level objects), descriptions of how the model could be used and use cases and descriptions of relationships to other models and standards. This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0/. If you re-use all or part of this work, please attribute it to the United Nations Economic Commission for Europe (UNECE), on behalf of the international statistical community. 1 Table of Contents Introduction ...................................................................................................................................... 3 Scope ............................................................................................................................................ 3 Design Principles ......................................................................................................................... 3 Overview of GSIM (as produced at the end of Sprint 2) ............................................................. 5 Relationship to GSBPM............................................................................................................. 11 Relationships to other standards ................................................................................................ 12 Future work on GSIM ................................................................................................................ 13 Annex A: Further detail on the model ........................................................................................... 15 Activity ...................................................................................................................................... 15 Production .................................................................................................................................. 21 Conceptual ................................................................................................................................. 32 Information ................................................................................................................................ 42 Annex B: Using the Object level and Use cases ............................................................................ 48 Making use of the GSIM Object level ....................................................................................... 48 Use Case 1: Design statistics on environmental investments by businesses ............................. 50 Use Case 2: Harvesting Data off The Web ................................................................................ 60 Annex C: Mapping to other standards and models ........................................................................ 72 Relationship between GSBPM and GSIM ................................................................................. 72 Relationship between SDMX and GSIM ................................................................................... 73 Relationship between DDI and GSIM ....................................................................................... 74 Relationship between SDMX and DDI and potential impacts on GSIM .................................. 76 Relationship between CORE and GSIM.................................................................................... 76 Relationships with other standards and models ......................................................................... 78 Annex D. GSIM Metamodel .......................................................................................................... 79 2 Introduction 1. The Generic Statistical Information Model (GSIM) is a reference framework of information objects, which enables generic descriptions of the definition, management, and use of data and metadata throughout the statistical production process. 2. GSIM provides a set of standardized, consistently described information objects, which are the inputs and outputs in the design and production of statistics. As a reference framework, it helps readers understand significant relationships among the entities involved in statistical production, and supports the development of consistent standards or specifications. 3. GSIM is one of the cornerstones for modernizing statistical production and moving away from traditional silos. By defining and grouping objects common to all statistical production, regardless of subject matter, GSIM enables statistical organizations to rethink how their business could be organized to generate economies of scale. 4. A model alone cannot transform an organization or its processes, but GSIM is modeled to allow for innovative approaches to statistical production to the greatest extent possible, for example, in the area of dissemination where demands for agility and innovation are increasing. At the same time, GSIM supports more traditional approaches of producing statistics. Scope 5. GSIM provides the information object framework supporting all statistical production processes as described in the GSBPM, giving the information objects agreed names, defining them, specifying essential properties, and indicating their relationships with other information objects. It does not, however, make assumptions about the standards or technologies used in implementation. 6. The information objects defined include those to allow the specification and introduction of new data sources for more innovative data collection, and also the generation of new statistical products. 7. GSIM does not include information objects related to supporting business functions within an organization such as human resources, finance, or legal functions, except to the extent that this information is used directly in statistical production. Design Principles 8. The following are a set of design principles that were used for developing GSIM. 1. GSIM supports GSBPM and covers the whole statistical process 2. GSIM can also be used stand-alone in any statistical production environment 3. GSIM has an intuitive appeal to all stakeholders 4. GSIM supports the design, documentation and maintenance of statistical products 5. GSIM enables explicit separation of the design and production phases 6. GSIM enables both traditional and new ways of producing statistics 3 7. GSIM supplies links between process steps at all desired levels of granularity 8. GSIM provides a basis for common understanding of information objects and their definitions 9. GSIM uses a layered approach 10. GSIM contains information objects only down to the level of agreement between key stakeholders 11. GSIM is robust, but can be easily adapted and extended to meet users’ needs 12. GSIM objects and relationships are represented as simply as possible 13. GSIM makes optimal reuse of existing terms and definitions 14. GSIM does not refer to any specific IT setting or tool 15. GSIM defines and classifies its information objects appropriately, including specification of attributes, relations and operations. 9. Background to the development of GSIM 10. The need for a Generic Statistical Information Model (GSIM) was first agreed at the 2010 Meeting on Management of Statistical Information Systems (MSIS). Since then, various statistical organizations have been working together to develop the model. Figure 1 provides an overview of the history of GSIM development. Figure 1. Milestones in the development of GSIM 11. The development of GSIM forms a key part of the strategic vision of the High Level Group on Strategic Business Architecture for Statistics (HLG-BAS) - a group of heads of National Statistical Institutes and International Agencies that support a common vision to modernize statistical production. “To enable statistical organizations to arrive at standardized generic industrialized production of statistics, we first need to find one another at the conceptual 4 level....under the umbrella of the GSBPM and the GSIM. This is a very high ambition which will take time. ” HLG-BAS Strategic Vision (June 2011) 12. GSIM is a cornerstone that needs to be developed in order to fulfill the HLG-BAS vision. Given this, development work was accelerated under the sponsorship of the HLGBAS. The agreed approach to rapidly develop GSIM was to conduct two “sprint” sessions between February and April 2012. Each sprint went for two weeks, the first being held in Ljubljana, Slovenia from 20 February to 2 March 2012 and the second in Daejeon, Republic of Korea from 16 to 27 April 2012. Overview of GSIM (as produced at the end of Sprint 2) What is new in GSIM v0.4 13. This version of GSIM reflects the work to further develop the model at the second sprint event, building on GSIM v0.3 from the first sprint. A range of feedback was received on v0.3 of GSIM and this has been reviewed and integrated into v0.4. 14. Key changes between GSIM v0.3 and v0.4 include: increased number of information objects elaborated definitions of information objects and the relationships between them improved alignment of GSIM with DDI, SDMX, Neuchâtel and ISO11179 standards changes to the structure of the model to reflect deeper understanding of the role and relationships of the sets of information objects within the model review of terms used for object groups (now Activity, Production, Conceptual and Information) and for the information objects themselves revised structure and content of GSIM documentation GSIM v0.4 15. Similar to GSBPM, GSIM uses a layered approach. There are a limited number of items at the highest level with more details given at the lower levels of the model. Each level is designed to be used for a particular purpose and by different audiences. Group Level 16. The Group level is an overview level aimed at explaining GSIM to top managers. It shows the different groups of information objects that GSIM consists of. 5 Figure 2. Group level diagram of GSIM 17. The Group level diagram (Figure 2) is designed to be read clockwise starting with Activity. A statistical agency starts an Activity, this initiates Production which uses information objects in the Conceptual group and produces Information. The Activity group contains sets of information objects required to manage the programs that make up statistical production. The Production group contains sets of information objects that describe the processes, methods and rules that are used in statistical production. The Conceptual group contains sets of information objects that describe the concepts used and their practical implementation, allowing users to understand what the statistics are measuring. The Information group contains sets of information objects that describe the results of the stages of statistical production. Set level 18. The Set Level is a communication level for statisticians, process designers, methodologists, and business architects. 19. In this level, the Groups are expanded to show sets of information objects which are at a lower level. The Sets are not information objects in themselves. Rather they are representative of the information objects that are at the lowest level of the model. For example, the Population Unit set includes the information objects population and statistical unit. 6 Figure 3. Set level diagram of GSIM Object level 20. The Object level is the most detailed level of GSIM. It is a specification level for IT architects and metadata specialists. Figure 4 shows an indicative UML model of the information objects at this level, grouped by Activity, Production, Conceptual and Information. 7 Figure 4. UML diagram of Object level 21. Further details about the Object level can be found in the Annexes of this document. Annex A provides detailed models of each Group – including lower level objects, definitions, relationships and explanatory text. Annex B provides information about how to use the information objects and gives two use cases. Annex D provides the GSIM metamodel. 8 22. Figure 5 below shows an alternative view of the Object level. It is not a description of the entire object level. It can be used as a description for users who are not interested in the detailed object level but still are interested in some objects and relationships. Figure 5. Communication diagram of Object Level Methodology and Quality in GSIM Methodology 23. Methodology is embedded in GSIM in the components, rules and parameters. These information objects implement methodology in the design of a statistical production process. 9 The document that specifies the design of a statistical production process lays down the methodology of this process. Figure 6. Methodology in GSIM 24. Below is an excerpt from the Dutch use case on environmental investments. Further detail on this use case can be found in Annex B. In order to specify the methodology one could refer to a standard method that identifies 1000-errors. For example, one could compare the value of the variable to be validated to a so-called reference value. If the ratio is larger than a parameter a, say a=400, than then there is probably a 1000-error. Now, in this example, the process step that validates and corrects 1000-errors should ‘know’ which variables have to be checked, the specification of the edit rules in terms of the parameter a, and the specification of the parameter a. We note that the specification of the parameter a may be different for different variables. The output of the process step always has the same structure: it provides quality indicators for each variable that has been checked and a new status for the values of these variables. There is one comment in order. − The reference values are used as auxiliary information. The designer of the flow should wonder whether these reference values stem from an external source (and should be modeled as an additional input source) or from an activity inside the process. In the latter case this activity should also be modeled. This could be a third (preparing) sub process. Required objects: Rule: there is a linear edit involved: Variable x > a multiplied by variable y. Component: There is an edit part in the component: If the edit is ‘violated’, then variable z is ‘1000-error’, otherwise variable z is ‘not 1000-error’. There is an imputation part in the component: If z is ‘1000-error’, then variable x = variable x divided by 1000; otherwise if z is ‘not 1000-error’ then variable x = variable x. Parameter: a Quality 25. While methodology is embedded in the design of the statistical production process, quality is linked to the instance (i.e. to production runs) of the process. 10 26. Quality appears at different levels in instances of information objects. For example, as an attribute to an information element (e.g. quality flag), as an attribute to a data set (e.g. status provisional data, final data, revised data). It also appears as process quality information, for instance in the use case on Dutch Environmental Investments1 the OK-index (both used for selecting records for imputation and as a process quality indicator). The product quality is laid down in a quality report, which is itself also a statistical product. Relationship to GSBPM 27. GSIM and GSBPM are complementary models for the production and management of statistical information. GSIM focuses on information objects used and/or produced in a statistical business process. The GSBPM, as a common reference model for the statistical business process, is intended to apply to all activities undertaken by producers of official statistics that result in information outputs. 28. Similarly to the way a particular statistical business process may not require every sub-process described within the GSBPM, not every information object in GSIM is necessarily required to be used and/or produced in the course of every statistical business process. 29. As described in the strategic vision of the High Level Group for Strategic Directions in Business Architecture in Statistics, much greater value will be obtained from GSIM if it is applied in conjunction with GSBPM. Likewise, greater value will be obtained from GSBPM if it is applied in conjunction with GSIM. For example, harmonization of metadata is necessary in order to achieve standardization of production processes. 30. Nevertheless, just as GSBPM has been applied to date without GSIM, it is possible (although usually less than ideal) to apply GSIM without GSBPM. For example, an agency may currently be using a local variation on GSBPM to model their statistical business processes, rather than using GSBPM itself. This decision in regard to modeling statistical business processes should not necessarily prevent them deciding to apply GSIM as a reference framework for statistical information. 31. In the context of GSBPM, GSIM can be harnessed as a tool to help describe and define the interrelated set of sub processes within a statistical business process and the types of information used in those processes needed to produce official statistics. 32. Applying GSIM as a reference framework in this manner can: facilitate building efficient metadata driven collection, processing, and dissemination systems help harmonies statistical computing infrastructures 33. GSIM identifies and describes the information, both data and metadata, supporting the GSBPM phases, sub-processes and the overarching processes of Quality Management and Metadata Management. Good metadata management is essential for the efficient operation of statistical business processes. Metadata are present in every phase of the GSBPM, either created; updated or carried forward unchanged from a previous phase. In the context of the GSBPM, the emphasis of the over-arching process of metadata management is on the 1. 1 See Annex B for further details 11 creation, updating, use and reuse of statistical metadata. Metadata management strategy and systems are therefore vital to the operation of the GSBPM. 34. GSIM is intended to support the overarching strategies and systems used to create and manage metadata as well as the statistical processing systems which then harness this well designed and managed metadata. 35. All process steps in GSBPM should be documented, and thus provide reference metadata. Several aspects of reference metadata are covered by GSIM objects. For example: the conceptual part of reference metadata is represented in GSIM in the group of conceptual objects, the methodological and procedural aspects of reference metadata are represented by GSIM objects in the production group. Other aspects of reference metadata may be modelled by means of object attributes in future GSIM versions. Reference metadata can be attached to any information object and, as such, can be an input to as well as an output of a GSBPM process step. 36. GSIM supports a consistent approach to metadata, facilitating the primary role for metadata envisaged in the UNECE guide related to Statistical Metadata in a Corporate Context2, i.e. that metadata is to uniquely and formally define the content and links between objects and processes in the statistical information system. Relationships to other standards 37. One of the design principles of the GSIM is to make optimal reuse of existing terms and definitions, wherever possible, to facilitate the use of the reference framework. In developing GSIM, many existing models and standards have been examined, both to determine the best approach for describing GSIM’s objects, and also to test the completeness and usability of the resulting GSIM model. Annex C explores relationships between GSIM and a number of other standards and models. 38. GSIM must be implementable: In order to support the implementation of the GSIM reference framework, many known standards and tools have also been examined, to ensure that the reference framework is complete and useful in this respect. The relationship between GSIM and other models and standards is two-fold: the standards and models serve as inputs to the creation of GSIM, and also act as targets for the use of GSIM within organizations. 39. By taking this approach, it is hoped that the GSIM model will be as similar as possible to the information which user organizations already have within their statistical production systems, allowing GSIM to be more understandable and easier to implement. 40. Figure 7 illustrates how different relevant standards, models, and implementation syntaxes and tools relate to GSIM. Standards/models that have provided significant input to the GSIM model are presented on the left hand side of the figure. Implementation syntaxes and tools that are currently of relevance to an implementation of GSIM are presented on the right hand side of the figure. This list will become outdated as more and more implementation syntaxes and tools are developed. The particular software packages listed are widely used in statistical organizations, but are intended to be illustrative examples, and are not a complete list. 2. 2 http://www1.unece.org/stat/platform/display/metis/Part+A+- Statistical+Metadata+in+a+Corporate+Context 12 * There are too many others to show in the diagram Figure 7.GSIM and its relationship to other relevant standards and models. Future work on GSIM Roadmap for getting to GSIM v1.0 41. A detailed roadmap, outlining the work to be done to develop public release v1.0 of GSIM has been prepared. It comprises three parts: Additional modeling to complete the Specification layer of GSIM 42. Completing the Specification Layer for inclusion in GSIM v1.0 will require additional modeling. Mechanisms to achieve this may include the establishment of short-terms task forces for each part of GSIM. Establish GSIM Integration Team and conduct workshop 43. A workshop to integrate work on the Specification Layer and review the input received from external reviewers of the Overview layer and Communication layer as documented within GSIM v0.4. The output from this workshop would form GSIM v0.8, complete with all levels of detail required for implementation, which would then need another round of review. Preparation of GSIM v1.0 44. Analyze and address the external review feedback on the complete documentation (GSIM v0.8) and draft the proposed specification of GSIM v1.0. An accompanying 13 Communication Plan and User Guide would also be completed. These materials would need to go through the relevant processes to be endorsed and adopted by the statistical community. Beyond GSIM v1.0 45. As a newly developed framework, it is expected additions and changes will be identified as GSIM is applied in practice. An updated version of GSIM (e.g. v1.1) may be warranted, for example, within a year of the release of v1.0. 46. Processes will be established to capture feedback from practical use and feed this into further evolution of GSIM. A process for setting and developing a release schedule will be established, together with a process for design and stakeholder review of proposed new releases. New releases will be designed in a way which minimizes the extent of change from previous versions (e.g. maximize backwards compatibility) while still meeting the business needs which initiated development of a new release. 47. Release of updated versions would be expected to become less frequent once the initial set of additional requirements encountered through widespread practical application of GSIM had been addressed. New requirements within the community of producers of official statistics, however, could still initiate the need for updated versions of GSIM. 14 Annex A: Further detail on the model 49. The contents of this annex are the working notes from Sprint 2. It is known that they are not comprehensive and not fully integrated at this time. For example, only main relationships are shown in the diagrams. The current notes will be rationalized and detailed as part of the process of completing the Specification Layer. Readers are welcome to provide detailed feedback now, but will have a further opportunity when the first full draft of the Specification Layer is released for review in September 2012. 50. The objects in the detailed models of the specification layer have the same colour as the objects in the group and set level. Hence the objects in the detailed Specification Layer diagram describing Activity are blue, Production are red, Conceptual are green and Information are yellow. Information objects can appear in more than one Specification Layer diagram, if so the Specification Layer object will have the colour of the group object that it belongs to. Activity Figure A1. Set level diagram of Activity group 51. This part of the model outlines the information that identifies and defines the statistical production activities (within the scope of GSIM) undertaken by agencies and the information that is required to describe the contributing processes. 52. An agency will receive an Information Request that identifies the information that a person or organisation in the user community3 requires for a particular purpose (defined in relation to a concept and population). From this Information Request an agency will generate a Business Case and it is this that initiates a new Statistical Program. As the Statistical Program is initiated a set of Requirements will be developed that are informed by the Information Request. 3. 3 3 This community may include units within the agency as well as external to it. For example, a unit responsible for compiling National Accounts may need a new statistical activity to be initiated to produce new inputs to their compilation process. 15 53. Processing of the Statistical Program will support two distinct use cases. It supports both the traditional approach of collecting data for a particular need, and also the emerging and future approach of collecting data and producing new outputs based on an existing data source that is maintained and added to over time. 54. In the case of the traditional silo/stovepipe approach, an agency received an Information Request and a set of Requirements; and approved a Business Case. When this happens, a new Statistical Program is initiated. This Statistical Program will identify the Data Resource that it will need (existing or needing to be created). Once designed, the Statistical Program will have one or more iterations of a Statistical Project, to investigate a set of characteristics for a given Population in relation to a particular time period. If the identified Data Resource is not sufficient for the purposes of the Statistical Project, an Acquisition Program will be initiated, together with an Acquisition Project (for each instance of the time period) which will add Datasets to the Data Resource. Once this is complete, the Statistical Project will use a particular Dataset from the Data Resource to produce one or more Products or Services. 55. A possible future approach relates to a continuous collection process. In the age of ‘big data’, the cost of collecting and storing data (e.g. administrative data) is low. An agency can feasibly collect data on a continuous basis without a particular Statistical Project, Product or Service in mind. In this case the agency has an Acquisition Program which consists of one or more Acquisition Projects that gather data relating to a particular point in time and add that data to the Data Resource. This Data Resource may then be used by any Statistical Program in the future. 56. Acquisition Project represents the gathering of data from an internal or external source. The objects concerned with the process of gathering data are detailed further below. Figure A2. Object level diagram of Activity group 16 57. Other objects needed for describing the collection of data are portrayed in Figure A5, and cover surveys, administrative register data, data reported electronically, data obtained from websites, data generated by devices, data collected through clinical procedures, and any other form of data acquisition. It attempts to provide a flexible model which could be used to represent objects required to support existing techniques for data collection, but also supports new and innovative approaches. While we provide detailed information for some types of data acquisition (surveys, administrative registers, and Internet robots) in this model (see description of the Production group below), it is anticipated that additional specific types could be described in future, as they become more important for statistical organizations. 58. Note that the term “statistical organization” is used here to describe the agency acting as the data collector; this organization could be a division within the overall statistical agency, collecting data from within another division of the same organization. The model is useful in describing both data acquisition from external organizations and individuals, and also for data acquisition internal to the organization. 59. The Instrument object describes the tool used to collect data. This could be a traditional survey form, an administrative register, a clinical procedure, a software agent scraping data from websites, or any other tool. Instrument is described from the perspective of the statistical organization collecting the data. The contents of an administrative register might be originally collected using printed forms of some type, by the administrative agency, but this information is not recorded as an Instrument description for the purposes of GSIM. It might instead be captured as quality information about the register’s contents if relevant. The Administrative Register itself would be considered as the Instrument in such a case. 60. The Provision Agreement represents a set of agreements around the provision of data by the Data Provider to the statistical organization. This could be a service-level agreement, a legal mandate, the terms of mutual agreement, or any other terms/conditions which affect the provision of data. In some cases these terms will be drawn from elsewhere in the GSIM model (for example, the agreed structure of the data to be provided, the use of specific classifications or concepts, etc.) . 17 Table A1. Definitions of information objects in the Activity Group GSIM Group GSIM Set GSIM Object Activity Definition Source Example The group Activity contains sets of objects that are required to manage the programs that make up statistical production. This includes data acquisition, statistical and dissemination activities. GSIM Sprint 2 Information Request, Acquisition Program, Statistical Program, Dissemination Program The Information Request set contains objects that describe the data required for a particular purpose. GSIM Sprint 2 Business Case, Requirements Activity Information Request Activity Information Request Business Case A Business Case gives the reason for investing in and undertaking a particular statistical activity. GSIM Sprint 2 Activity Information Request Information Request An Information Request outlines the data required for a particular purpose. GSIM Sprint 2 Activity Information Request Requirement A Requirement is a specification GSIM Sprint 2 of details of the concepts, population and outputs required to meet a particular information need. Activity Acquisition Program The Acquisition Program set contains objects to represent a set of activities undertaken by GSIM Sprint 2 Acquisition Project, Instrument 18 statistical agencies to gather data. Activity Acquisition Program Acquisition Program An Acquisition Program is a set of activities undertaken by statistical agencies to gather data. GSIM Sprint 2 Korean Population Census processing and analysis design Activity Acquisition Program Acquisition Project An Acquisition Project is a set of activities undertaken by statistical agencies to gather data relating to a particular reference period. GSIM Sprint 2 Korean Population Census 2012 processing and analysis Activity Dissemination Program The Dissemination Program set contains objects to represent a set of activities undertaken by statistical agencies to provide data to users. GSIM Sprint 2 Activity Dissemination Program Dissemination Program A Dissemination Program represents a set of activities undertaken by statistical agencies to provide data to users. GSIM Sprint 2 Korean Population Census dissemination design Activity Dissemination Program Dissemination Project A Dissemination Project is a set of activities undertaken by statistical agencies to provide data relating to a particular reference period to users. GSIM Sprint 2 Korean Population Census 2012 dissemination Activity Dissemination Program Data Provider A Data Provider is an individual or organization that makes data available to the statistical organization. GSIM Sprint 2 Respondent participating in a survey, national statistical organization providing data to an 19 international organization Activity Dissemination Program Activity Statistical Program Activity Statistical Program Activity Statistical Program A Provision Agreement is a set of agreements that exist around the exchange of data between a data provider and a statistical organization. GSIM Sprint 2 Service-level agreement, legal mandate, the terms of mutual agreement. The Statistical Program set contains objects to represent a set of activities to investigate characteristics of a given population. GSIM Sprint 2 Methodology, Statistical Project Statistical Program A Statistical Program is a set of activities to investigate characteristics of a given population. GSIM Sprint 2 Korean Population Census collection design Statistical Project A Statistical Project is a set of activities to investigate characteristics of a given population for a particular reference period. GSIM Sprint 2 Korean Population Census 2012 collection Provision Agreement 20 Production Figure A3. Set level diagram of Production group 61. When an agency conducts a Statistical Program or an Acquisition Program, a series of activities are undertaken in order to achieve the desired outcome. The activities are affected by Process Step Execution, using a set of Inputs and Outputs. The Process Design, Process Step Definition, Process Method, and Rule define these steps, and they are performed by the Process Agent. 62. A process may use any GSIM object as an Input. The Input information object is substituted for the particular instance of an information object. The Input can be identified as one of three types. The transformable Input is used as a placeholder for any information object that will be changed by the process (e.g. the status of a dataset changes from provisional to final). The parameter Input represents any information object or attribute of an information object that guides the process (e.g. accept all matches with similarity index 87 or higher, where similarity index is the parameter and 87 is the value). The process support Input covers any information object that is required as ancillary information without which the process could not complete (e.g. code list to validate against) which is not transformed by the process. 63. Any object within GSIM can be produced as an Output from a process. Any information object that is transformed or created during a process is a transformed Output. Any information object that describes a measure of the process (e.g. time taken) is a Process Metric. Both of these Outputs are referred to by Process Control in order to initiate the next step in the process. Reference metadata attached to any of the input information objects may also be modified by a process and new reference metadata may be created and attached to any output information object. 64. An agency will have a pre-defined set of Process Step Definitions (processes as defined in the GSBPM or another process model). A Process Step Definition describes what is undertaken by a process but is not itself active. A Process Step Definition may have subProcess Step Definitions (e.g. the Process Step Definition “Dissemination” will include the Process Step Definition “publish data to website” as a sub Process Step Definition). An 21 agency will also have a pre-defined set of Process Methods. A Process Method describes how a process is carried out. A Process Method may identify (via Process Step Design) the types of inputs (e.g. datasets) that are required and the outputs produced by a process, it does not specify the instance used. A Process Method has a series of Rules which guide the actions to be performed. Each Process Step Definition is likely to have several Process Methods that can be used for undertaking the activity. Together these may form a ‘process library’. For example, the Process Step Definition for the activity ‘imputation’ may have the Process Methods ‘donor’ and ‘heuristic’. Each Process Method may use Rules in their specification. 65. During the design process an agency will create Process Step Designs (each may have sub Process Step Designs). A Process Step Design identifies the Process Step Definition that is undertaken by the process and the Process Method that describes how it is undertaken. The Process Step Design also identifies specific instances of information objects that are required as input or created as outputs, for example an agency would specify that a specific dataset is an input. The Process Step Design is not active, but is a description of the resources required for a process to be executed. In summary: a Process Step Design chooses a Process step Definition, a Process Method, and a Process Agent that form the design of the process. 66. Once created, a Process Step Design may be carried out. A Process Step Execution identifies any further Inputs or Outputs required to execute the process at run-time and captures a record of the process having taken place. The Process Step Execution will have various types of Inputs and Outputs associated with it, as determined by the requirements of the Process Step Design. 67. Process Control is used to manage Process Step Execution. Process Step Execution may be a planned activity defined by a Process Control or be triggered by a Process Control that assesses results from the completion of a previous Process Step Execution. One of the Outputs from a Process Step Execution will be Process Metrics that feed into a decision point within the Process Control (based upon a set of Rules),which may trigger another Process Step Execution if the Process Metric satisfies a particular Rule. 22 Figure A4. Object level diagram of Production group 68. For the purposes of illustration, Figure A5 below extends some of the objects in the figure above to show different instances of the Instrument object, and is also showing how the Data Resource can be extended to show how a Staging Data store (receiving the data from a Data Provider) would be applied. The Staging Data store represents the collected data within the statistical organization in the form exactly as provided, and will often not be structured in the fashion needed for internal processing. Thus, the contents of this store are likely to require transformation into an internal format before becoming useful for creating statistical products. Similarly, the Input Data store represents collected data which has been processed to the point where it is useful for the statistical organization’s production systems. 23 Figure A5. Instrument Detail 69. Various sub-classes are also illustrated here, to show how the Instrument could take a variety of guises, each containing specialized information specific to its type: An Administrative Register is a repository of data collected for non-statistical purposes, but made available to the statistical organization for the creation of statistical products. A Survey Instrument is a form used for the explicit purpose of gathering statistical data. Surveys may be administered in a variety of modes (as paper forms, in face-to-face interviews, as online interviews, as computer-assisted aids in telephone interviews, etc.). The Survey Instrument provides a generic description of all types of modes. The Internet Robot Instrument will hold information about software agents which are designed to visit sites on the internet and collect data by programmatically harvesting it. The Generic Instrument is to be used when more detailed descriptions (as for Survey, Administrative Register, and Internet Robot Instruments) is not needed. It may be used to describe those types of instruments if desired, when more detail is not needed, or can be used to describe non-Survey, non-Administrative Register, and non-Internet Robot Instruments. 70. The part of the model illustrated in Figure A6 shows the kind of extended properties which some classes of data collection instruments may contain. 71. It is based on objects and properties found in the DDI-Lifecycle standard and in many of the popular computer-assisted interviewing software packages such as Blaise. It is capable of describing surveys in a simple manner – as a sequence of question, statements, and instructions – but may also be used to describe detailed information about dynamic presentations and flows. 24 Figure A6. Survey Instrument Detail 72. Survey Instrument is the form used to collect data from respondents, including printed forms, those filled out via telephone, online surveys, face-to-face interviews, etc. 73. Response Unit is the statistical unit being interviewed. 74. Question is the text used to interrogate the respondent, including all properties and relationships such as its response domain and relationships to statistical concepts. Questions may be multiple. Text may be static or dynamic. Dynamic text is produced by computations based on prior known values, earlier responses collected within the survey, or mode values. 75. The Modes of a survey tells us how the survey is being conducted, and there could be one or more modes employed for a particular survey (for example, Paper, Computer Assisted Phone Interview, Computer-Assisted Self- Interview, Web-form etc.). The list of modes will potentially grow in future, and will vary from organization to organization, and thus is not specified within GSIM. The mode of a survey can be used in the conditional flow logic of the survey and in computation constructs. 76. Interviewer Instructions are the text or other information provided to the interviewer to help in conducting the interview. In a self-completion scenario, the respondent is considered to be the interviewer. 77. Control Construct is an object which is used to describe the flow logic of a survey. Control constructs may be combined with other control constructs and can include references to external prompts such as pictures, sound files, or other media appropriate to the survey instrument’s mode. 25 Table A2. Definitions of information objects in the Production Group GSIM Group GSIM Set GSIM Object GSIM Object sub-type Production Production Process Control Production Process Control Input Production Process Control Input Parameter Input Definition Source Example The group Production contains sets of objects that describe the processes, methods and rules that are used in statistical production. GSIM Sprint 2 Process Control, Process Component, Rule The Process Control set contains objects that describe the sequence and selection of processes based on an assessment of outputs according to a set of rules. GSIM Sprint 2 Process Step Execution, Input, Output An Input is an information object which is used by a process step. GSIM Sprint 2 Classification to validate against, variable to be used in the derivation of another variable, number of repeats A Parameter is an input to a process step execution which controls the tasks performed and potential outputs. The parameter may be an information object or an attribute of one. GSIM Sprint 2 e.g. match with 87% similarity 26 Production Process Control Input Process Support Input A Process Support input assists a process step execution in the completion of its tasks. The process support input is not transformed during process execution. GSIM Sprint 2 Production Process Control Input Transformable Input A Transformable input is the information object on which the process method is applied during process execution. It may become a transformed output. GSIM Sprint 2 Production Process Control Output Production Process Control Output Transformed Output A Transformed Output is a new information object or a modified transformable input of a process step execution. GSIM Sprint 2 Production Process Control Output Process Metric A Process Metric is a collection of information objects and/or attributes of information objects that describes and helps assess a process step execution and the transformed information object(s) it produced. GSIM Sprint 2 Production Process Control Process Control A Process Control is a set of rules that guides the execution and sequencing of processes. It can contain scheduling and rules for GSIM Sprint 2 e.g. code list to validate against An Output is the product of a process GSIM step execution. It consists of the Sprint 2 transformed output and the process metric. 27 triggering process to begin. A Process Step Execution is the act of performing a particular process according to a schedule or trigger. A Process Step Execution identifies all other inputs required for the process to be run. GSIM Sprint 2 The Process Component set contains objects that describe the process step definition and design, the methods used and the process agent. GSIM Sprint 2 Process Step Definition, Process Step Design, Process Method, Process Agent Methodology A Methodology is a specification of the processes to undertake a statistical activity. GSIM Sprint 2 Seasonal adjustment, imputation Process Component Process Agent A Process Agent is the actor that performs a process. GSIM Sprint 2 Technology system, organizational unit, survey instrument Production Process Component Process Agent Technology System A Technology System is a set of automated methods for performing a process. GSIM Sprint 2 Production Process Component Process Agent Organizational Unit An Organizational Unit is a person or a team that performs a process. GSIM Sprint 2 Production Process Control Production Process Component Production Process Component Production Process Step Execution 28 Production Process Component Process Method A Process Method is one or more ordered actions to be followed to achieve a particular outcome. Based on Statistics New Zealand model donor imputation, heuristic imputation Production Process Component Process Method Survey Instrument A Survey Instrument is a tool to collect data directly from units of a population. GSIM Sprint 2 Printed forms, those filled out via telephone, online surveys, face-to-face interviews, etc. Production Process Component Process Method Administrative Register Instrument An Administrative Register Instrument is a repository of data collected for non-statistical purposes made available to the statistical organization for the creation of statistical products. GSIM Sprint 2 Tax register, business register. Production Process Component Process Method Internet Robot Instrument An Internet Robot Instrument is a software agent designed to visit sites on the internet and collect data by programmatically harvesting the internet sites. GSIM Sprint 2 Software configured to grab the number of people in the labour force by age group from web site www.xxx.xxx. Production Process Component Process Method Generic Instrument A Generic Instrument is any kind of tool designed to collect data. GSIM Sprint 2 Survey, Administrative Register, Internet Robot Instruments 29 “Dissemination”, “publish data to website”, imputation Production Process Component Process Step Definition A Process Step Definition describes the intended purpose and identifies the types of inputs and outputs of a process. It has a set of associated methods that may be used to achieve the desired outcome. GSIM Sprint 2 Production Process Component Process Step Design A Process Step Design identifies the specific information objects that are the inputs and outputs of a process and identifies the methods it is going to use. GSIM Sprint 2 Production Process Component Question A Question describes the text used to interrogate a respondent, the concept that is measured and the allowed responses. GSIM Sprint 2 Production Rule The Rule set contains objects that govern processes. GSIM Sprint 2 Interviewer Instruction, Control Construct Production Rule Control Construct A Control Construct is a description of a rule or set of rules that describe the flow logic of an instrument. GSIM Sprint 2 If The Else, Loop, Repeat Until, Repeat While, Sequence Production Rule Control Construct Computation Specific control construct; Definition to be added following Sprint 2 Production Rule Control Construct If Then Else Specific control construct; Definition to be added following Sprint 2 Production Rule Control Loop Specific control construct; Definition 30 Construct to be added following Sprint 2 Production Rule Control Construct Question Construct Specific control construct; Definition to be added following Sprint 2 Production Rule Control Construct Repeat Until Specific control construct; Definition to be added following Sprint 2 Production Rule Control Construct Repeat While Specific control construct; Definition to be added following Sprint 2 Production Rule Control Construct Sequence Specific control construct; Definition to be added following Sprint 2 Production Rule Control Construct Statement Specific control construct; Definition to be added following Sprint 2 Production Rule Interviewer Instruction An Interviewer Instruction is text or other information provided to an interviewer to help in conducting the interview. In a self-completion scenario, the respondent is considered to be the interviewer. GSIM Sprint 2 Production Rule Rule A Rule is an instruction that constrains a process. GSIM Sprint 2 for donor imputation: maximum 2 imputation trials 31 Conceptual Figure A7. Set level diagram of Conceptual group 78. The Conceptual group contains sets of information objects that describe the concepts used and their practical implementation, allowing users to understand what the statistics are measuring. This group is used as inputs into the process described by the Production group and is often referred to by objects within the Information group to provide definition and structure. The objects in the group are also often referred to in products to provide information to help users understand results (reference metadata). 79. This group is divided into 3 sets of objects. 80. The Population Unit set contains objects that define real world phenomena that are the subject of a statistical activity. These are the subject of measurement which is described by the objects within the Variable set. The values from these measurements are described or delimited by objects within the Classification set. 81. The particular set of units that a statistical agency is interested in at any point during the statistical production process is defined by the Population. There are several types of Population including a Target Population, Frame Population, Survey Population and Analysis Population. Each Population is made up of Population Units which is a representation of any entity that can be described by a particular set of characteristics. Households, Enterprises are examples of Population Units. Depending on the role that the Population Unit plays at various stages of the statistical lifecycle it may be referred to as one of multiple types, including: Collection Unit, Statistical Unit, Analysis Unit. 82. Once a Population has been defined it is usually the case that an agency is interested in measuring something about the group (using data that may already exist or is yet to be collected). The measurement of a particular characteristic about a Population is described by a Variable. The characteristic that is being measured is described as a Concept. Income is an example of a Concept. Each Variable identifies the Population Unit that is the subject of measurement identified by a Concept. The Variable does not include any information on how the resulting value may be represented. This is to prevent duplication of Variable information where the essence of what is being measured remains the same but is represented in a 32 different manner. The Contextual Variable adds information that describes how the resulting values may be represented through association with a Value Domain. The valid values may be of several types including coded (reference to a code list which may or may not represent a classification) or non-coded i.e. (text, numeric or date time). Numeric non-coded value domains are represented through a Unit of Measurement. 83. Each time Contextual Variable is used within a different statistical activity (Acquisition, Statistical or Dissemination project) it is represented as an Instance Variable. This is used to identify and acknowledge the meaningful differences that arise when data sources are used to populate the value of the variable. It is important to note that the Instance Variable should not be confused with the actual content of the variable once it has been assigned values. 84. A Classification is a categorization of real world objects so that they may be grouped, by like characteristics, for the purposes of measurement. A Classification may have one or more Classification Versions which represent changes over time and is valid only for a particular period. Each Classification Version has one or more Classification Levels which consist of one or more Classification Items. A Category groups together real world items according to a common property and exist independent of its inclusion in a particular Classification Version. The Category is referenced by a Classification Item to place it into the level of a particular Classification Version. 85. A Classification categorizes real world items for the purposes of measurement but does not prescribe representation for the Categories which it includes. Depending on the Classification used a Category may be represented by a different Code with a Code list. Many Code lists may exist to represent the same Categories in a Classification and used in different contexts. Some Code lists may not refer to any Classifications. A Code list consists of Code Items that can be organized into a Hierarchy of Levels. Together a Classification and a Code list provide representation and meaning that can be used to enumerate a set of real world values. 33 Figure A8. Object level diagram of Conceptual group 34 Table A3. Definitions of information objects in the Conceptual Group GSIM Group GSIM Set GSIM Object GSIM Object Definition sub-type Conceptual Source Example The group Conceptual contains sets of objects that define the measurement and representation of data. GSIM Sprint 2 Population Unit, Variable, Classification The Population Unit set contains objects that define real world phenomena that are the subject of a statistical activity. GSIM Sprint 2 Statistical Unit, Collection Unit, Analysis Unit, Target Population, Survey Population A Population is a set of units. GSIM Sprint 2 Target Population, Statistical Population Conceptual Population Unit Conceptual Population Unit Population Conceptual Population Unit Population Target population A Target Population describes the set of statistical units (i.e. the objects of interest) as defined during the design stage of a statistical activity. Based on Neuchâtel All cars in Korea. Conceptual Population Unit Population Frame population A Frame Population describes the set of statistical units that represent the observable part of the target population and provides a reasonable approximation of it. GSIM Sprint 2 All cars registered in Korea. 35 Conceptual Population Unit Population Survey population A Survey Population is the set of statistical GSIM Sprint units from which information can be 2 obtained in a survey. It is based on the frame population. All owners of cars registered in Korea. Conceptual Population Unit Population Analysis population An Analysis Population is the set of derived statistical units required for the analysis, processing, or dissemination of statistical data. A set of real or artificial (aggregate) unit applied for statistics without an obvious statistical unit as for example in price statistics or pollution statistics. Conceptual Population Unit Population Unit Conceptual Population Unit Population Unit Statistical unit A Statistical Unit is the entity of interest in a statistical activity, i.e. for which information is sought. SDMX MCV Cars. updated Conceptual Population Unit Population Unit Collection unit The Collection Unit is the entity for which information can actually be obtained during data collection. GSIM Sprint 2 Car owners. Conceptual Population Unit Population Unit Analysis unit An Analysis Unit is a derived entity that is defined for the analysis, processing, or dissemination of statistical data. GSIM Sprint 2 A real or artificial (aggregate) unit applied for GSIM Sprint 2 A Unit is an entity that can be described by GSIM Sprint characteristics and is a member of a 2 population. Statistical unit, collection unit, analysis unit 36 statistics without an obvious statistical unit as for example in price statistics or pollution statistics. The Variable set contains objects that describe the measurement of real world phenomena that are the subject of a statistical activity. GSIM Sprint 2 Concept, Variable, Contextual Variable, Instance Variable Concept A Concept is a characteristic common to a set of objects. Based ISO1087 Income Variable Variable A Variable is a characteristic of a statistical unit being observed. Updated part Income of a of UN person Glossary of Classification terms via SDMX MCV Variable Contextual variable A Contextual Variable is a characteristic of a unit being observed that may assume one or more of a set of values. Updated UN Glossary of Classification terms via SDMX MCV Conceptual Variable Conceptual Variable Conceptual Conceptual Income of a person measured in USD with a numerical value domain from 0 to infinity. 37 Conceptual Variable Instance Variable An Instance Variable is a characteristic of a unit being observed that may assume one or more of a set of values as used in a particular data resource. GSIM Sprint 2 Income of a person measured in USD with a numerical value domain from 0 to infinity in the tax office database. Conceptual Variable Value domain A set of permissible values. ISO/IEC 11179 Coded value domain, uncoded value domain. Conceptual Variable Value Domain Coded value domain A Coded Value Domain is a set of permissible values specified in a code list. Based on ISO/IEC 11179 ISO country codes. M or F to represent values of gender. Conceptual Variable Value Domain Uncoded An Uncoded Value Domain is a set of value domain rules that defines a set of permissible values where not specified by a code list. Based on ISO/IEC 11179 USD with a numerical value domain from 0 to infinity. A dwelling address. Date of birth. Conceptual Classification GSIM Sprint 2 Value Domain, Code List, Classification Version, Classification Level, Classification Item The Classification set contains objects that describe or delimit the values that can be used to measure real world phenomena that are the subject of a statistical activity. 38 Conceptual Classification Category A Category is an object that groups together real world items according to a common property. GSIM Sprint 2 Female. Conceptual Classification Classification A Classification is an ensemble of one or more related lists of mutually exclusive categories. Based on Neuchatel The UN International Standard Industrial Classification of All Economic Activities (ISIC) Conceptual Classification Classification version A Classification Version is a list of mutually exclusive categories representing the version-specific values of the classification variable. A classification version has a certain normative status and is valid for a given period of time. Neuchatel ISIC Revision 4 Conceptual Classification Classification level A Classification Level is the set of items at the same granularity in a classification (version or variant). Based on Neuchatel The levels of ISIC Rev 4.: principal, secondary and ancillary activities. Conceptual Classification Classification item A Classification Item represents a category Neuchatel at a certain level within a classification version or variant. In the ISIC Rev. 4 one of the classification items is "0121 Growing of grapes" 39 Conceptual Classification Code list A Code List is a predefined list from which some coded statistical characteristics take their values. SDMX MCV Gender with members F updated "Female" and M "Male". Conceptual Classification Code item A Code Item is a member of a code list. It can either be a code or an included code. GSIM Sprint 2 Conceptual Classification Code A language independent set of letters, numbers or symbols that represent a conceptual value whose meaning is described in a natural language. SDMX MCV F (for "Female"). updated Conceptual Classification Code Conceptual Classification Hierarchy Included code F "Female". An Included Code is a member of a code GSIM Sprint list that is defined by reference to a code 2 item that is maintained in another code list. In a national variant of a classification the top level codes are defined via reference to the codes in the international standard classification. A Hierarchy arranges code items in levels of detail from the broadest to the most detailed level. Each level of the classification is defined in terms of the categories at the next lower level of the classification. In ISIC Revision 4 the hierarchy is represented by the number of digits. OECD Glossary of Statistical Terms updated 40 Conceptual Classification Value Domain Unit of A Unit of Measurement is the quantity or Measurement increment by which a concept represented by a (typically numerical) variable is counted or described. It is part of the description of an uncoded value domain. Based on amount in euros, SDMX MCV kilograms 41 Information Figure A9. Set level diagram of Information Group 86. As described in the section on Production, an Acquisition Program conducted by a statistical agency produces or supplies a Data Resource that can be used by Statistical Programs and Dissemination Programs. The Data Resource is based on Datasets that may be included in Products and Services, either as Datasets (e.g. when providing access to publicuse micro data) or as represented by a Visualization (e.g. a table in a report or an interactive chart on a website). 87. Datasets come in different guises, for example as Administrative Register, Time Series, Panel Data, or Survival Data, just to name a few. The type of a Dataset determines the set of specific attributes to be defined, the type of Data Structure Definition required (Unit Data Structure Definition or Cube Data Structure Definition), and the methods applicable to the data. For instance, an Administrative Register is characterized by a Unit Data Structure Definition, attributes such as its original purpose or the last update date of each record, contains a record identifying variable, and can be used to define a Frame Population, to replace or complement existing surveys, or as an auxiliary input to imputation. Record matching is an example for a method specifically relevant for registers. An example for a type of Dataset defined by a Cube Data Structure Definition is a Time Series. It has specific attributes such as frequency and type of temporal aggregation and specific methods, e.g. seasonal adjustment, and must contain a temporal variable. 88. A Cube Data Structure Definition4 describes the structure of an aggregate, multidimensional table (macro data) by means of Dimensions and Measures. Both are Variables with specific roles in such a table. The combination of Dimensions contained in a Cube Data Structure Definition creates a key or identifier of the measured values. For instance, country, 4 The Cube Data Structure Definition and its components are mainly based on the SDMX Data Structure Definition but also includes elements from the DDI NCube structure. In contrast to the GSIM Cube Data Structure Definition, its SDMX equivalent also contains ”attributes”. An SDMX attribute is an additional characteristic that is not required to uniquely identify a cell in the multi-dimensional structure. It can be mandatory or optional and attached to a cell, the entire dataset, or any combination of dimensions. In GSIM, these attributes are considered as reference metadata and thus not represented as separate information object. 42 indicator, measurement unit, frequency, and time Dimensions together identify the cells in a cross-country Time Series with multiple indicators (e.g. gross domestic product, gross domestic debt) measured in different units (e.g. various currencies, percent changes) and at different frequencies (e.g. annual, quarterly). The cells in such a multi-dimensional table contain the observation values. A Measure is the Variable that provides a container for these observation values. It takes its semantics from a subset of the Dimensions of the Cube Data Structure Definition. In the previous example, indicator and measurement unit can be considered as those semantics-providing Dimensions, whereas frequency and time are the temporal Dimensions and country the geographic Dimension. An example for a Measure in addition to the plain ’observation value’ could be ’pre-break observation value’ in case of a Time Series. Dimensions typically refer to Variables with Coded Value Domains, Measures to Variables with Uncoded Value Domains. 89. A Unit Data Structure Definition5 specifies the structure of unaggregated micro data. It discerns between the logical and physical structure of a Dataset. A Logical Record describes the structure independent of physical features by referring to Variables that may include a unit identification (e.g. household number). A Record Layout describes the physical layout of a Logical Record by means of attributes of Variables such as storage format, start position, and width. A Physical Structure provides the overall layout of a physical instance of a Logical Record. It refers to Instance Variables of the Dataset. A Record Relationship defines source-target relations between Logical Records. A Response is an example of what can be represented by an Instance Variable. Figure A10. Object level diagram of Information group 5 The Unit Data Structure Definition and its components are based on the DDI model for record structures. 43 Table A4. Definitions of information objects in the Information Group GSIM Group GSIM Set GSIM Object GSIM Object sub-type Information Definition Source Example The group Information contains sets of objects that describe the results of stages of statistical production. GSIM Sprint 2 The result of acquisition. The Dataset set contains objects that define the structure of the result of an acquisition, processing or dissemination activity. GSIM Sprint 2 Data resource, data structure definition A Dataset is any organized collection of data. GSIM Sprint 1 Information Dataset Information Dataset Dataset Information Dataset Dataset Administrativ An Administrative Register is an organized collection of data that includes data of one e Register or more unit types where production of statistics was not the original or main purpose. Based on GSIM Sprint 1 Information Dataset Dataset Logical Record A Logical Record is a description of a data record that is independent of its physical features. It can include references to included variables, record type, case identification, and multiple record segments. Based on DDI Information Dataset Dataset Previous Response A Previous Response is a value which is known prior to conducting the survey, and which is used in the instrument to influence selection of questions or other conditional GSIM Sprint 2 The municipality the respondent lives in. 44 aspects of the instrument. Information Dataset Dataset Record Layout A Record Layout is a description of the details of a physical record. Core elements are details on variables like start position, width, storage format, and decimal positions. Based on DDI Information Dataset Dataset Record Relationship It describes a link to logical records, and optional format and default characteristics such as decimal positions, decimal or digit separators, data type, and missing data indicators. Based on DDI Information Dataset Dataset Status A Status for a Dataset records the status of the data GSIM Sprint 2 Information Dataset Data Resource A Data Resource is an organized collection GSIM Sprint of data which may be sourced from multiple 2 acquisition or statistical projects and may be used in dissemination projects. It is made up of one or more datasets. Information Dataset Data Structure Definition A Data Structure Definition describes the structure of a Dataset in terms of a set of Variables and attached Value Domains. Provisional data, final data, revised data Economic data warehouse. Census register based on multiple administrative registers and sample surveys. Based on SDMX 45 Information Dataset Data Structure Definition Cube Data Structure Definition A Cube Data Structure Definition is a description of the structure of an aggregate, multi-dimensional table (macro data) by means of Dimensions and Measures. GSIM Sprint 2 Information Dataset Data Structure Definition Dimension A Dimension is a Variable that is required GSIM Sprint to identify each observation value in a 2 macro dataset. The combination of all Dimensions in a Cube Data Structure Definition provides a key or identifier of the observation values. Information Dataset Data Structure Definition Measure A Measure is a Variable that provides a container for the observation values in a macro dataset. It takes its semantics from a subset of the Dimensions of the Cube Data Structure Definition. GSIM Sprint 2 Information Dataset Data Structure Definition Unit Data Structure Definition A Unit Data Structure Definition is a description of the structure of a micro dataset by means of a set of Variables defined by logical and physical records. GSIM Sprint 2 Information Dataset Response The reaction of an individual unit to some form of stimulus or to a request for information. OECD Glossary of Statistical Terms Information Product The Product set contains objects that describe static published (internal or external) results of a statistical activity. A Product that is the output of one process may be the input into another. GSIM Sprint 2 Country, frequency, time. A statistical publication including methodological information, tables, and charts; a public46 use micro dataset Information Product Product A Statistical Product is a package of data GSIM Sprint and data representations (tables, graphs) and 2 accompanying documentation that results from a statistical activity. Information Product Visualization A Visualization is a (usually graphical) representation of a Dataset or components of a Dataset. GSIM Sprint 2 Histogram, table, heatmap. Information Service The Service set contains objects that describe dynamically created results of a statistical activity. GSIM Sprint 2 An interactive chart on the website of the statistical agency. 47 Annex B: Using the Object level and Use cases Making use of the GSIM Object level 90. The Object level of GSIM provides a set of standardized, consistently described objects which can be used to understand and structure the information required to describe the production of official statistics, including processes and their inputs and outputs. 91. For each object GSIM provides: 92. a definition a preferred name a set of relationships to other information objects a set of attributes used to describe each object (not yet developed, to be included in future versions of GSIM) GSIM can be used by statistical agencies in a variety of ways: 93. Understand the information required to support the statistical production process 94. Validate or extend existing information models 95. GSIM can be used to identify which information should be captured through the entire lifecycle, including well understood and commonly used information objects such as Variable, Population, Dataset, and new information objects that support innovative approaches to statistics (e.g. new ways of acquiring statistical data using the Acquisition Program object). GSIM can be used to understand how information used at different stages in the lifecycle is related. For example a user who produces a Dataset, will understand how this information may be used to produce a Product. The defined attributes and relationships of the Dataset object will guide the user into properly capturing the required metadata. The information objects in GSIM can assist in evaluating the applicability of existing or new standards for data description and exchange. The multiple levels of GSIM allow this to be done at different levels of abstraction. The GSIM metamodel and design principles can provide a template for specifying new or local, information objects to support evolving needs of statistical agencies if they choose not to adopt the standard GSIM. Facilitate communication within and between statistical agencies The definitions of information objects and those of their attributes will ensure that different stakeholders (business, methodology, IT) have a shared understanding of the data and metadata required when designing and implementing statistical processes. 10. 48 11. 96. Supports metadata used to describe and run processes 97. Common terminology and definitions facilitate communication with internal and external providers and consumers of statistics. The attributes belonging to each object enable precise description of the inputs and outputs of statistical processes. The set of defined attributes standardizes the capture of data and metadata, enables consistent governance and tracking of information objects through their lifecycle. Clearly defined and well described information objects allow their reuse of the information objects between different processes or components (e.g. services, systems) within and outside statistical agencies. Understand information in context The relationships between information objects help to identify all the inputs and outputs required for a statistical process. For example, the relationships between the objects Process Step, Rule, Output and Method inform a process designer of all the objects required to design a process. The relationships indicate the role the objects play with regard to each other and the constraints between objects that must be implemented as rules when designing processes and components. Sharing and reuse of information items implies that users must be able to find and discover information. Relationships support search and discovery by linking related objects together. For example, the relationship between Variable and Statistical Unit helps in locating all available data about a unit of interest. 10. 49 11. Use Case 1: Design statistics on environmental investments by businesses Use case from Statistics Netherlands supplied for the GSIM sprint April 2012 Introduction 99. This is an example of a process model of the environmental expenditure business at Statistics Netherlands. 100. A few comments are in order. 1. We have followed an outside-in approach to model the process of this statistics, that is, first we have modeled the output of the whole process, second we have modeled the input of the whole process and based on the differences between the output and the input we have modeled the process. 2. Based on, for example, transparency requirements and the ‘structure’ of the data to be processed, the process may consist of more than one subprocess. Each subprocess has its output (to be modeled) and its input (to be modeled) 3. When modeling a (sub) process we have distinguished between the design of the “know” and the design of the “flow”. The design of the “know” involves the choice of the statistical methodology, which may be considered as the application of a general statistical method like an imputation method or an estimation method. The design of the “flow” involves the design of the process steps in which the methods are applied, the path and order of the process steps, the decision points that guide the statistical data through the path depending on their status, the triggers to start or end a process (step), and the process indicators to manage the flow as a whole. Note that the triggers may be quality driven, cost driven, time drive or event driven. 4. With reference to GSIM, it should be possible to model the out- and input of the whole process, of each subprocess and even of a process step by the ‘Information’ (structural) and ‘Conceptual’ information objects. So, there is a recurring application of these information objects. 5. With reference to GSIM it should also be possible to model the flow of a (sub) process with the ‘Production’ information objects. Statistical Processes covered 101. The process description starts after data collection (Phase 4 of GSBPM) and ends with validation (Phase 6 of the GSBPM). Test result 102. Most objects identified in this process description are instances of objects defined in GSIM v0.4. A few cases of doubt were highlighted in yellow. The process uses all sets of objects at the second level of GSIM, except for the whole first group ‘Activity’. ‘Activity’ is implicit. Attributes and relations are seldom explicit. Process Description 10. 50 11. Describing the out- and input of the whole process 103. Statistics Netherlands produces annual statistics on environmental investments by businesses. The ‘same’ picture is published twice. The first publication contains preliminary figures and the latter publication contains final figures. The difference between these two publications concerns data quality. Both publications concern the same population definitions, the same set of (attribute) variables and the same reference period (which is yearly). Below, you will find our simplified model to describe a part of the output in terms of the population definitions and the set of variables. Naturally, the NACE-delineation and the investments variables involved should be defined properly. NACE-delineation of enterprise units -NACE-id -Environmental investments (on behalf of water) -Environmental investments (on behalf of air) -Environmental investments (on behalf of the rest) -Environmental investments (total) Figure B1: Description of the output 104. Required objects: Statistical unit and target population for the required sub domain and time period; Concept: environmental investments Variables: water, air, rest, total (for the Netherlands per year) Value domain of all variables: [0, large value) in Euro’s Statistical output: Different quality versions of the data: how is this covered in the model? Contextual variables: all values in euro Data set: aggregate statistic with NACE as classification variable, investments (for water, air, rest, total) as counting variables and enterprise unit as underlying micro objects 105. In order to produce the final and preliminary figures that are described in Figure B1, we need input. The environmental investments by businesses uses primary data collection as one input source and (a part of) our Business Register as another input source. Below you will find our simplified model that describes the input sources (and their relation). 10. 51 11. Statistical unit (enterprise) Observational unit -Enterprise-id -Observational unit-id -Enterprise-id -Inclusion probability -Sampling Indicator -NACE-id (2008) -Number of Employees Investment -Enterprise-id -Observational unit-id -Investment-id -Type of investment (water, air, rest) -Amount of Investment Figure B2: Description of the input 106. A number of remarks are in order. − The Statistical unit part of the model (right side of Figure B2) stems from the Business Register of Statistics Netherlands. The corresponding data is observed and processed by Business Register unit. From the point of view of the environmental expenditure business, these data already exist. − In order to observe the investment variables, a sample is drawn from the Business Register. Some of the drawn enterprises are combined to form an observational unit, other enterprises are split into several observational units; most of the enterprises will be used directly as observational units. Due to the splitting and merging of the enterprises a new unit (type) is created, which we have called observational unit. − Each observational unit may have zero, one or more investments. Each investment is modeled as a unit. 107. Required objects: Population/unit: enterprise, observational unit; investments Population: Business register, sample (sample is subset of business register) Source: Business register, survey Component: derivation of observational unit from statistical unit (splitting and merging). This has already been done in preparatory process steps. The cardinality of relation between observational unit and investment is specified; the cardinality of the relation between statistical unit and observational unit is specified Designing the flow 108. When the output model is compared to the input model, a number of observations can be made. We mention the following: − We need estimates for the population totals of the several investment variables. − These variables are related: investment on behalf of water + investment on behalf of air + investment on behalf of the rest = investment total. 10. 52 11. − These variables are defined on the enterprise unit. − Per (drawn) observational unit we get response, per response we actually observe whether or not there is an investment, and if any, the type of the investment and the expenditure involved. 109. The challenge is to model a flow to process the desired output from the input. Below you will find two simplified process flows, as we have modeled the whole process into two sub processes. 110. Required objects: Components: subprocess1, subprocess2 Sequencing: the sub processes are trigged at another level with different frequency (explained below) 111. The flow of the first sub process involves the following typical process steps or activities (the order is not yet modeled) − Match the observed responses to the Business Register (the sampled part). Referring to Figure B1, this gives the first instances of the Observational unit part and the Investment part. − Validate and correct for 1000-errors − Divide the data flow into two parts. The first has to be validated and corrected manually and the second part is not validated and corrected. Note that this is a simple example of selective editing. The criterion to the divide the data flow into two parts is based on a so-called OK-index. So, before the data can be divided, this OK-index has to be calculated. − Calculate the OK-index − Validate the data manually − Correct (or impute) the data manually − Estimate the amount of investment (on the Investment level) to the Observational unitlevel, taking into account the type of investment. 112. Required objects: (implicit) population of observational units Components: matching, validation1000, determine OK-index, router validation and correction, validate manually, correct manually, estimate Variable: OK index (=quality variable), distinguish between (1-to-1, split and merged) statistical units (=relation variable) 113. The flow of the second sub process involves the process steps/activities (again the order is not modeled yet) − Split the joined observations into the related enterprise units, taking into account any over coverage with respect to observational units. 10. 53 11. − Join the split observations to the related enterprise unit, taking into account any partial non-response − Estimate the NACE population totals of the environmental investment variables. 114. Required objects: Components: router on (1-to-1, split and merged) statistical units, for split: transform observation unit information to enterprise information, deal with overcoverage, for merged: transform observation unit information to enterprise information, deal with partial non-response, estimate the total environmental investment by NACE and investment type Variable: sample design characteristics (inclusion probability, sampling indicator) are in Figure B2 at the enterprise level 115. So far, the identified process steps/activities could be mapped to GSBPM. Based on these steps you will find an example of both process flows (the order and the path is modeled now). Match response Validate and correct 1000-errors Calculate OK-idex OK? No Yes Validate (manually) Yes Correct (manually) Aggregate More units? No Figure B3: Flow of the first sub process 10. 54 11. Split or join Join Do nothing Split Estimate population totals Figure B4: Flow of the second sub process 116. Again, a number of remarks are in order − The technique we have used to model these flows is non-standard. − The first sub process is able to process each response separately (there may be 10.000 responses). So, the sub process starts, i.e. is triggered by a response. This response is matched, edited, and there is an aggregation (from Investment to Observational unit) − There is a decision point (OK?) that controls the response flow − There is a decision point (More units?) that controls the end of the sub process. − The second sub process starts to produce the preliminary figures if there are, 6.000 responses processed by the first sub process. In the meanwhile the first sub process continues. If there are 10.000 responses processed, then the second sub process starts again. − The second sub process processes the 6.000 (or 10.000) responses as a whole. This is necessary because of the Join process step and the Estimation process step. − The output of the first sub process and the input of the second sub process are not modeled yet. In fact this should be done. − Each process step is based on a (statistical) method, which has its own specification (or parameterization). Furthermore, each activity has its own inputs and outputs. In fact, both the specification and the input and output should be described/modeled at this level as well. − In the end, all descriptions of the input and output should fit like a puzzle. 117. Required objects: Components: the routers have already been identified Path: the sequence of actions is specified Trigger: subprocess1 is triggered by each response (observation unit), subprocess2 is triggered by the number of observation units (6000, 10000) 10. 55 11. Variable: number of responses; this variable has two roles: it is used as a process indicator to manage the flow; it is also used as a quality indicator to describe the response fraction. Parameter: requirement on number of responses for provisional and final data Designing the know (statistical methodology) 118. As already said, each process step is based on a (statistical) method, which has its own specification (or parameterization). Furthermore, each activity has its own in- and outputs. In fact, both the specification and the in- and output should be described/modeled. As an example, we only describe the step that validates and corrects 1000-errors. 119. In order to specify the methodology one could refer to a standard method that identifies 1000-errors. For example, one could compare the value of the variable to be validated to a so-called reference value. If the ratio is larger than a parameter a, say a=400, than then there is probably a 1000-error. 120. Now, in this example, the process step that validates and corrects 1000-errors should ‘know’ which variables have to be checked, the specification of the edit rules in terms of the parameter a, and the specification of the parameter a. We note that the specification of the parameter a may be different between different variables. The output of the process step always has the same structure: it provides quality indicates for each variable that has been checked and a new status of the values of these variables. 121. − 122. There is one comment in order. The reference values are used as auxiliary information. The designer of the flow should wonder whether these reference values stem from an external source (and should be modeled as an additional input source) or from an activity inside the process. In the latter case this activity should also be modeled. This could be a third (preparing) sub process. Required objects: Rule: there is a linear edit involved: Variable x > a multiplied by variable y. Component: There is an edit part in the component: If the edit is ‘violated’, then variable z is ‘1000-error’, otherwise variable z is ‘not 1000-error’. There is an imputation part in the component: If z is ‘1000-error’, then variable x = variable x divided by 1000; otherwise if z is ‘not 1000-error’ then variable x = variable x. Parameter: a Information Objects 123. In Table B1, the information objects identified in the use case are listed (first column) and mapped to the GSIM objects (third column). 10. 56 11. Table B1: Comparison of Information objects GSIM Identified Attributes information objects found information objects applicable Definition match attributes applicable relationships applicable (Y/N) (Y/N) (Y/N) Enterprise Statistical unit yes yes Target population Target population yes yes Environmental investments Concept yes yes Variables yes yes Contextual variable yes yes Value domain yes yes Quality attribute Products yes yes NACE, type of investment, aggregate value Data set yes yes Data set (composite) yes yes Observational unit yes yes Sample Sample population yes yes Business register, survey Source yes yes Derivation of observational unit Component yes yes Subproces1, Subprocess2 Component yes yes Variables environmental investments for water, air, rest, total All variables in euro Value larger 0 and larger (high values) Two versions (provisional, final) Data set Business register frame) Observational unit (sampling Netherlands, year 10. 57 11. GSIM Identified Attributes information objects found information objects applicable Definition match attributes applicable relationships applicable (Y/N) (Y/N) (Y/N) Sequencing sub processes Control yes yes Matching, validation1000, determine OK-index, router validation and correction, validate manually, correct manually, estimate OK index (=quality variable), distinguish between (1-to-1, split and merged) statistical units (=relation variable) Router on (1-to-1, split and merged) statistical units, for split: transform observation unit information to enterprise information, deal with overcoverage, for merged: transform observation unit information to enterprise information, deal with partial non-response, estimate the total environmental investment by NACE and investment type Sample design characteristics Component yes yes Variable yes yes Components, rule, process control yes yes Variable yes yes Attribute sample to Path through components Rule Triggers for starting the process Rule yes yes Number of responses; this variable has two roles: it is used as a process indicator to manage the flow, it is also used as a Variable yes yes 10. 58 11. GSIM Identified Attributes information objects found quality indicator to the describe the response fraction Requirement on number of responses for provisional and final data Variable x > a times variable y a information objects applicable Definition match attributes applicable relationships applicable (Y/N) (Y/N) (Y/N) Parameter yes yes Rule yes yes Parameter yes yes 10. 59 11. Use Case 2: Harvesting Data off The Web Use case supplied by the IMF for GSIM Sprint 2 in April 2012 Introduction 124. This use case was defined based on the IMF's experience with web scraping of data from member countries' and international organizations’ web pages (currently approx. 35 robots scraping 4000 series from 85 websites). Using robots to harvest data off the web has potential to ease the task of data collection for data providers as well as data collectors. Data that are already publicly available can be automatically extracted from the web reducing the manual retrieval of data. Reuse of existing data and higher degrees of process automation play an important role in the modernization of statistical business processes and are thus considered as a use case relevant to GSIM. Statistical processes covered 125. In principle, data scraped from the web can traverse all nine phases of the statistical business process as described by the GSBPM, but phases without any special requirements in the web-scraping scenario are not included in this use case (phases 6-8). With respect to the other phases, only the relevant sub-processes are described. Test result 126. The objects identified in this use case correspond to information objects defined in GSIM v0.4. The process uses all level 2 objects except for Acquisition Program, Statistical Program, and Dissemination Program from the Activity group. Activity is not explicitly modeled, but the use case is actually an Acquisition Program. Attributes or relations are not explicitly modeled, but a number of characteristics of objects were identified and listed in section 4 for consideration as additions in the next development phase of the model. Not all of the objects identified in the use case are currently explicitly modeled and implemented in the structured and formalized way as proposed by GSIM. GSIM can help to improve transparency concerning those objects as created and transformed in the process, facilitate maintenance of these objects as well as of the process components, and to enhance reusability of process components (web scraping robots in this case). Reference metadata are not explicitly modeled in GSIM, but they may be attributes of any information object or information objects (for example (meta-) data sets) attached to any other information object (by means of a generic relationship that all information objects have). Process description 1. Specify Needs 1.1. /1.2. Determine needs for information / Consult & confirm needs 127. Web scraping is used in two different scenarios: to fill gaps in existing data collection exercises and as instrument for new data collection exercises. In the first scenario, need specification comprises the identification of gaps in an existing data collection, e.g. countries that do not report using existing data submission modes or do not report timely or not at the desired frequency. To do this, the coverage of the existing dataset is evaluated. This involves the dataset, the variables and/or statistical units used, and validation rules that determine what is missing at certain points in time. A quality report (incl. coverage, timeliness) will result from the coverage validation. The report may already be available from regular quality assessments. In that case it is just an input to this process step and does not need to be defined 60 and generated. A description of the gap to be filled by web scraping is then the specification of the information request. 128. Required information objects (examples in parentheses): Data Set (Direction of Trade production database) Variable (Exports, Imports, Imports Free on Board; broken down by Partner Country, Measurement Unit, Frequency, Time) Population Unit (Country) Process Component (Validate dataset coverage) Rule (a country’s coverage vis-à-vis all partner countries below x% in the previous 8 quarters) Product (Coverage report for Direction of Trade production database; contains a data set and a list of countries that satisfy the rule) Information Request (Country X’s quarterly Exports to and Imports from all individual Partner Countries in the previous three quarters) 129. We consider the quality report as reference metadata attached to a dataset. The report is the output product of a validation process component. In case the report is reasonably formalized, it is represented as a data set. 130. In the second scenario a new information need is expressed. 131. Required information objects: Information request (Member countries’ latest Public Sector Debt Statistics) 1.3. Establish output objectives 132. In order to describe the purpose of the new data collection, that is either to close gaps in existing datasets or to satisfy a new information need, the same information objects as in the previous process step (1.1/1.2) are used. 1.4. Identify concepts 133. In the “close gaps” scenario we assume that a formal representation of the concepts and value domains is already available and can be used to specify the constraints defining the gap (e.g. which indicators, countries, measurement units, methods (e.g. for seasonal adjustment), etc.). In IMF data collection, country is the statistical unit, but it is usually treated as a variable. 134. Required information objects: Variable (Exports, Imports, Imports Free on Board; Partner Country, Measurement Unit, Frequency, Time) Value Domain (numeric, Precision=2 decimals, Scale=Millions) Population Unit (Country) Rule (Country=X and Frequency=Q and Measurement Unit=US Dollars and Adjustment=None) 135. In the “new need” scenario a less detailed description of concepts is created in this step using less information objects. 61 136. Required information objects: Variable (Exports, Imports, Imports Free on Board; Partner Country, Measurement Unit , Frequency, Time) Population Unit (Country) 1.5. Check data availability 137. Checking data availability involves the following steps: 1. Identify web site(s) of data provider(s) (i.e. national statistical institutes, central banks, or international organizations; but could also be commercial providers) that contain the required data 2. Identify the available data format, e.g. HTML table, downloadable file (xls, csv, pdf, SDMX-ML, ...), query interface and assess whether it suits the needs (typically, a database query interface is preferred to downloadable files (with a stable structure) which are preferred to plain HTML pages, as the structure of a database is changed less frequently than the layout of an HTML page) 3. Check quality of available data, e.g. frequency of updates, timeliness, etc. 4. If the quality is not appropriate or the desired data format is not available, identify further sources and repeat steps 2 and 3 138. We considered data available on a data provider’s web site as that data provider’s product irrespective of the data format. We assumed the data format and transmission channel to be attributes of a product. The quality indicators required to evaluate the data are modeled as variables with value domains in a data set. A process component specifies how to check the quality, and a process agent performs the check. The “check data availability” process iterates until the quality of the identified data satisfies the requirements. 139. Required information objects: Product/Service (International trade table on website of Country X’s statistical office; The World Bank’s online data catalogue and Open Data API) Data set (Quality Dataset created for Product Y) Variable (Update Frequency) Value Domain (daily, weekly, monthly) Rule (Update Frequency is at least weekly; Data Format=Query Interface) Process Component (Validate quality of product) 1.6 Prepare business case 140. The preparation of a business case requires information objects from process steps 1.1 to 1.5, especially the availability check and information on the available data format and quality. 2. Design 2.1 Design outputs 141. In the scenario of enhancing an existing dataset, the data collected via web scraping is added to existing statistical outputs (i.e. dataset, publications, quality reports, etc.). This means that an existing output design is identified and reused and the part of the output to which the supplementary collection will contribute data and/or (reference) metadata (such as 62 a link to the source website) is specified. In terms of information objects, output design would be done via reference to existing products, data sets (potentially a separate one for metadata) as well as a rule specified in terms of variables and value domains that defines the part of the data sets which the scraped data will fill. We would model any additional reference metadata required as variables in another data set. In the new collection scenario, the output data set (=target data model) including output reference metadata (separate data set) as well as output products need to be defined (as for any other new data collection). In both scenarios, the intermediary outputs, e.g. the input data store (before further processing), have to be designed. In the new data collection scenario, the same objects are required but need to be newly defined instead of referenced. 142. Required information objects: Data set (Direction of Trade reference metadata database and production database, Web scraping input data store) Data provider (National Statistical Office of country A) Variable (Direct Source Link, Update Frequency) Value Domain (IMF Country and Group code list) Rule (Data area=[Country=A and Variable=Imports and Frequency=Quarterly]) Product (Direction of Trade dissemination database, Public Sector Debt Country Report) 2.2 Design variable descriptions 143. Variable design and description includes cross-domain concepts such as country, frequency, time, measurement unit. For those general concepts references to existing variable definitions are sufficient. New concepts and appropriate source, derived, and output variables and related value domains may need to be defined for new data collections. Concepts and variables at all processing levels are considered as Variables. 144. Required information objects: Variable (Country) Classification (ISO 2 country codes) Value Domain (date format=YYYY, Time between 2000 and 2030) 2.3 Design data collection methodology 145. To design a web scraping instrument, the following steps are required: 1. Determine the number and types of robots needed 2. Specify which data sources to use (data may be available on multiple websites) and how 3. web address 4. format (i.e. HTML table, linked file, query tool) 5. navigation steps to get to the data 6. Define source data model, i.e. identify relevant elements of the website 7. Define mapping from source data model to target data model (variable and value correspondence) 146. Translated into GSIM objects, this means that a “web scraping through a scheduled robot” process component is defined and the configuration parameters specified. Web 63 address and data format are considered as attributes of a product. The navigation steps are specified as rules. A mapping process component with additional rules is used to identify and map the information elements from the web site to the target data model. The mapping defines the correspondence of web site elements to variables and their values (taken from a value domain). The schedule and/or sequencing of robots are defined in a process control. 147. Required information objects: Process Component (Web scraping through a scheduled robot; Mapping) Rule (Select “PPI Statistical Bulletin”; Select “1. Output Prices: Summary NSA”; Select “Download CSV”; Column A: 1=Imports, 2=Exports) Product/Service (International trade table on website of Country X’s statistical office) Variable (Partner Country) Classification (ISO 2-digit country codes) Value Domain (numeric, Precision=2 decimals, Scale=Millions) 2.4 Design frame & sample methodology 148. The target population unit is country. It is usually treated as a variable. No sample methodology is applied. 149. Required information objects: 150. Population Unit (Country) Variable (if used instead of population unit) Classification (ISO 2-digit country codes) Rule (Country=A,B,C) The target population is typically defined as all member countries. 2.5 Design statistical processing methodology 151. The following process steps need to be followed: 1. Define validation rules (e.g. identify unmapped items) 2. Define rules for derivation of variables (requires also the values and value domains of original and derived variables) 3. Define rules for integration of data from different robots (requires target data set and dimensions along which the scraped data are combined; can be the population unit or a variable) 4. Define rules for reference metadata (variables in a separate dataset or attributes of objects) 152. Required information objects: Process Component (Validation of load process, Derive ratio of two variables, Integrate dataset into data resource) Rule (Variable A as Percent of GDP = Variable A in US Dollars / GDP in US Dollars * 100; Measurement Unit = % of GDP; Attribute Last Update Date of data set = time stamp of load time) Variable (Time, Partner Country) 64 Classification (SNA 2008 Sector Classification) Value Domain (1990 to 2030) Data Set (Public Sector Debt Dissemination Database, Economic Data Warehouse) Population Unit (Country) 2.6 Design production systems & workflow 153. The following process steps are required: 1. Define web scraping schedule 2. Specify format and location of robot output 3. Define workflow; includes a) integration of "collection via robot" into standard processes b) exception handling, e.g. no new data, unmapped item detected, robot fails c) alert system/notifications d) activity log for robots e) remedy log for required interventions in case of robot failure 154. Required information objects: Process Control (Run robot “DOT Country A” every Sunday at 11PM; If file download successful execute - then save file step – else send failure notification) Process Component (Create process log) Rule (Save downloaded file as filename.xml in location X) Data Set (format and location should be attributes of a data set) Process Metrics (Robot activity log, robot remedy log) 3. Build 3.1 Build data collection instrument 1. Implement robot(s) based on the design specified in process phase 2 2. Reuse existing robots if possible 155. Required information objects: Process Control (Run robot “DOT Country A” every Sunday at 11PM) Process Component (Create remedy log) Rule (Save downloaded file as filename.xml in location X) 3.2 Build or enhance process components/3.3. Configure workflows 156. Implement integration and other rules and configure workflow based on the design specified in process phase 2. 157. Required information objects: Same objects as in 3.1. Build data collection instrument 3.4 Test production system 1. Test robots and other components individually 2. Modify them if necessary 65 3. Re-test components (loop) 4. Test integration of components 5. Modify & re-test (loop) 158. This process step creates a test report. 159. Required information objects: All objects specified for the scraping process are subject to test/review and thus needed in this process step. The test report can be represented as process metrics or product. 3.5 Test statistical business process 1. Test entire workflow 2. Modify if required 3. Re-test workflow (loop) 160. Required information objects: Same objects as in 3.4 Test production system 3.6 Finalize production system 1. Implement final schedule and obtain approval 2. Communicate to data providers 161. Required information objects: Same objects as in 3.4 Test production system 4. Collect 4.1 Select sample 162. Not relevant for this use case 4.2 Set up collection 163. Put robots and other components into production. 164. Required information objects: Same objects as in 3.4 Test production system 4.3/4.4 Run collection/Finalize collection 165. Robots will run according to the specified schedule - 166. includes populating the activity log and failure remedy log (see 2.6) Required information objects: Same objects as in 2.6 Design production systems & workflow and 3.6 Finalize production system 66 5. Process 5.1 Integrate data 1. Integrate data and metadata from different robots 2. Integrate scraped data and metadata with data/metadata already in the system or data/metadata collected via different channels 167. Required information objects: Data Set (Direction of Trade Input Database, Economic Data Warehouse) Process Component (Integrate dataset into data resource) Variable (Partner Country, Time, Frequency) Classification (ISO country codes) Value Domain (1990-2030) 5.2. Classify & code 168. Not relevant in this scenario. 169. The coding is already done by the mapping component of the robot, typically during collection and before integration. 170. (To further modularize the processes, it should be considered to move this step from “Collection” to “Process”.) 5.3 Review, validate & edit 171. Based on validation rules specified in phase 2. This step would typically happen prior to integration. 172. Required information objects: Same as used in 2.5 Design statistical processing methodology 5.4 Impute, 5.5 Derive new variables & statistical units, 5.6 Calculate weights, 5.7 Calculate aggregates, 5.8 Finalize data files 173. No specific rules for web-scraped data; same categories as in other types of data collection 6. Analyze, 7. Disseminate, 8. Archive 174. No specific rules for web-scraped data 9. Evaluate 9.1 Gather evaluation inputs 175. Gather as evaluation input: 1. Robot activity log 2. Remedy log for required interventions in case of robot failure 176. Both logs are data sets. They can be represented as process metrics. 177. Required information objects: 67 Process metrics (activity log, remedy log) 9.2 Conduct evaluation 1. Review evaluation inputs 2. Identify failure-prone robots and analyses reasons for failure and used remedies 3. Prepare evaluation report (represented as data set or product) 178. Required information objects: Data Set (Robot evaluation metrics) Rule (Average rate of robot failures greater than 10%) Process Component (Calculate robot evaluation metrics) Product (Robot evaluation report) 9.3 Agree action plan 179. An action plan may include modification of robots, for example in terms of - 180. more stable data sources automation of remedies Required information objects: Process Control (If Rule 1 and Rule 2 satisfied then start Process Step Execution Identify potential data sources) Process Component (Validate quality of product) Rule (Update Frequency is at least weekly; Data Format=Query Interface) Product/Service (International trade table on website of Country X’s ministry of finance) 181. For example, actions may include revision of the schedule, the configuration of process components and/or rules, using a different data product (maybe of a different data provider), or adding another product as source. Any of the other previously used categories may be involved. Information objects 182. In Table B2, the IMF information objects identified in the use case are listed (first column) and mapped to the GSIM objects (third column). The second column lists attributes to be considered in the further specification of GSIM information objects. Columns 4, 5 and 6 can be completed once those levels of detail will have been specified for GSIM. 68 Table B2: Comparison of information objects Identified information objects Data set, (reference) metadata set, production data set, dissemination data set Attributes found Data location GSIM Information objects applicable format, Data set Concept, dimension, attribute, reference metadata type Variable Dimension, source data Population unit Web scraping tool, robot, mapping Process component GSBPM Definition Attributes Relationships Steps of GSBPM Test match applicable applicable where used comments (Y/N) (Y/N) (Y/N) 1.1, 1.2, 1.3, 1.5, 1.6 2.1, 2.5, 2.6 3.4, 3.5, 3.6 4.2, 4.3, 4.4 5.1, 5.3 9.3 1.1, 1.2, 1.3, 1.4, 1.5, 1.6 2.1, 2.2, 2.3, 2.4, 2.5 3.4, 3.5, 3.6 4.2, 4.3, 4.4 5.1, 5.3 9.2, 9.3 1.1, 1.2, 1.3, 1.4, 1.6 2.4, 2.5 3.4, 3.5, 3.6 4.2, 4.3, 4.4 5.3 9.3 1.1, 1.2, 1.3, 1.5, 1.6 2.3, 2.5, 2.6 3.1, 3.2, 3.3, 3.4, 69 Identified information objects Attributes found Constraint, business rule, formula, mapping rule Data report, metadata report, quality report, table, publication Data format, transmission channel, location (e.g. web address), last update date (should be generic) Requirements, business need Authoritative list, Data type, format, code list, measurement unit, production list, precision, scale source dimension, measurement unit (a dimension) GSIM Information objects applicable GSBPM Definition Attributes Relationships Steps of GSBPM Test match applicable applicable where used comments (Y/N) (Y/N) (Y/N) 3.5, 3.6 4.2, 4.3, 4.4 5.1, 5.3 9.2, 9.3 Rule 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 2.6 2.1, 2.3, 2.4, 2.5 3.1, 3.2, 3.3, 3.4, 3.5, 3.6 4.2, 4.3, 4.4 5.1, 5.3 9.2, 9.3 Product 1.1, 1.2, 1.3, 1.5, 1.6 2.1, 2.3 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.3 Information 1.1, 1.2, 1.3, 1.6 request 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.3 Value domain 1.4, 1.5, 1.6 Level 3 2.1, 2.2 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.3 70 Identified Attributes information found objects Online query Channel/format facility, SDMX web service Data provider, data source, direct source Hierarchical code list, authoritative list, dimension Schedule, business rule/logic Robot activity log, failure remedy log GSIM Information objects applicable Service GSBPM Definition Attributes Relationships Steps of GSBPM Test match applicable applicable where used comments (Y/N) (Y/N) (Y/N) 1.5, 1.6 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.3 Data provider 2.1 Level 3 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.3 Classification 2.2, 2.3, 2.4, 2.5 3.4, 3.5, 3.6 4.2, 4.3, 4.4 5.1, 5.3 9.3 Process 2.6 Control 3.1, 3.2, 3.3, 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.3 Process 2.6 Metrics 3.4, 3.5, 3.6 4.2, 4.3, 4.4 9.1, 9.3 71 Annex C: Mapping to other standards and models 183. This document begins to explore the relationships between GSIM and a number of other standards and models. These include GSBPM, SDMX, DDI and CORE. 184. For each standard or model, the differences in terms of modeling, similar terms, gaps or differences and example mappings are given. This work is in its preliminary stages and will continue to be expanded upon as the GSIM develops further. Relationship between GSBPM and GSIM Differences in terms of modeling 185. The GSBPM is a business process model comprising four levels (the statistical production process, nine phases, sub-processes within each phase, a textual description of each sub-process). The GSBPM is available as a combination of a text document and a diagram. The text contains many synonyms, homonyms, broader and narrower terms. Similar terms 186. The GSIM is designed to support the GSBPM and to cover the whole statistical process. Therefore, there are a great many similarities between the terms used in the GSBPM and the names of the information objects in GSIM. The strength of the GSBPM lies in its ample use of synonyms, homonyms, broader and narrower terms for communicating to a wide audience. However, these are not suitable for an information model which requires a much more controlled vocabulary and a much higher degree of semantic precision. 187. Future work should focus on a more detailed mapping between names of information objects in GSIM and GSBPM terms. This work has begun, as shown in Table C1. Table C1. Examples of mapping between GSIM and GSBPM GSIM Group GSIM Set Conceptual Population Unit GSIM Definition The Population Unit set contains objects that define real world phenomena that are the subject of a statistical activity. Conceptual Variable The Variable set contains objects that describe the measurement of real world phenomena that are the subject of a statistical activity. Conceptual Classification The Classification set contains objects that describe or delimit the values that can be used to measure real world phenomena that are the subject of a statistical activity. GSBPM term (details/ specialisations/ synonyms/ irregular plural forms) unit e.g. statistical unit, collection unit, legal unit variables incl. derived variables, statistical variables, characteristic classification, classification scheme, code 72 Gaps/differences 188. GSBPM gives equal focus to all phases of statistical production. GSIM in its current state gives a greater focus to information objects that are required for the data collection (acquisition) and dissemination phases e.g. the analysis phase is currently less detailed in the GSIM. 189. GSBPM does not clearly distinguish between the structure of the data file, database, data record, data extraction or data set and the contents i.e. the data. GSIM attempts to clearly distinguish between these. Relationship between SDMX and GSIM Differences in terms of modeling 190. SDMX is model-driven through the SDMX Information Model on a conceptual level. All technology specifications in SDMX are implementations of this model. SDMX-ML is the XML format for the exchange of SDMX-structured data and metadata. It is the detailed technical implementation of the conceptual model. Content-oriented guideline, including the Metadata Common Vocabulary (MCV), are available to provide users with a uniform understanding of standard statistical metadata concepts. 191. The models of GSIM (level 3) and SDMX are both expressed in a similar way. A conceptual model is available as UML class diagram. Similar constructs 192. There are strong relationships in the area of concept, variable, aggregate data, and classification. In the area of concepts, variables, and classifications GSIM is a richer model, being aligned more closely with ISO/IEC 11179 and the Neuchâtel concept/variable model. GSIM code lists are essentially similar to SDMX Code lists but GSIM also includes an explicit classification object, while this is subsumed into Code list in SDMX. In general GSIM and SDMX are both aligned with ISO/IEC 11179. The GSIM cube data structure is closely aligned with the SDMX data structure definition. The GSIM provider and provision agreement objects, relating to both data acquisition and to data dissemination are based on the SDMX objects of the same names. GSIM, as it stands, does not include category and category scheme objects to support categorization of model objects of all types but this will probably be added based on the SDMX approach. 193. Future work should focus on a detailed mapping between GSIM and SDMX objects. This work has begun. The mapping includes the detailed description of the GSIM and SDMX object and notes (for example if mapping is in some sense constraint). Table C2. Examples of mapping between GSIM and SDMX GSIM Object Provision Agreement GSIM Definition A Provision Agreement is a set of agreements that exist SDMX Object SDMX Definition ProvisionAgreement Links the Data Provider to the relevant Structure Usage (e.g. Dataflow Definition or Metadataflow Definition) SDMX Notes 73 Cube data structure definition around the for which the provider exchange of supplies data or metadata data The agreement may between a constrain the scope of the data data or metadata that can provider and be provided, by means of a statistical a Constraint. organisation. A Cube DataStructureDefiniton Data Structure Definition Data (DSD) is metadata Structure describing the structure Definition and organisation of a describes the data set, the statistical structure of concepts and attached to a dataset them code lists used where the within the data set. data is aggregated. Gaps/differences 194. GSIM is a much broader model than SDMX. SDMX has its emphasis on macrodata/aggregate data. Gaps are therefore in the areas of data collection, methods and microdata. Further missing areas are in the field of statistical processes. The GSIM Process model is significantly different from the SDMX Process object. The SDMX object on metadata inputs and outputs while the GSIM process objects aim to model statistical processes in a richer fashion. Issues/differences with terminology, definition and concepts 195. In GSIM a Category is an object that groups together real world items according to a common property, whereas SDMX uses Category in the context of categorizing metadata (and similar objects) as an aid to search and retrieval. Relationship between DDI and GSIM Differences in terms of modeling 196. DDI Lifecycle (DDI-L 3.1) is expressed in XML Schema as the detailed technical implementation. At the same time, an implicit conceptual model is based on a particular view of the data lifecycle. This is further described in two documents. A formal expression in a UML class diagram is available as working document for groups of the DDI Alliance. The conceptual model of the next generation of DDI will be expressed as UML class diagram. GSIM and DDI-L are both aligned with ISO/IEC 11179. A cross-referenced field-level documentation is available. Similar constructs 197. DDI has its emphasis on microdata in the whole data life cycle. There are strong relationships in the areas of data collection of surveys, concept, and variable. GSIM and DDI are both aligned with ISO/IEC 11179. The GSIM unit data structure object and related 74 objects are closely aligned with DDI. The instrument, control constructs, and question objects are also closely aligned with DDI. 198. The rich set of descriptions of data on the logical and physical level in GSIM is borrowed from DDI. This explains for example relationships of different logical record types and details of the storage format. The same is the case for survey instrument and related control constructs for conditional process. 199. A subset of process rules related strongly to constructs in DDI. This comprehends coding instructions that pertain to data collection or data processing overall such as handling of non-response to questions, imputation practices, suppression rules, and instructions for recodes, derivations from multiple question or variable sources. 200. Future work should focus on a detailed mapping between GSIM and DDI objects. This work has begun. The mapping includes the detailed description of the GSIM and DDI object and notes (for example if mapping is in some sense constraint). Table C3. Examples of mapping between GSIM and DDI GSIM Object Instance Variable GSIM Definition An Instance Variable is a characteristic of a unit being observed that may assume one or more of a set of values as used in a particular data resource. Classificat A ion Classification is an ensemble of one or more related lists of mutually exclusive categories. DDI 3.1 Object Variable DDI Definition Category Scheme Contains descriptions of particular categories used as question responses and in the logical product. Their relationships and code values are described in the code scheme. Describes a variable contained in the variable scheme. DDI Module Logical product DDI Notes Logical product Use as classification. CodeScheme can be used additionally if classification has codes. Maps directly. Gaps/differences 201. GSIM explicitly seeks to be more general than DDI in the data acquisition area. The DDI focus in this area is essentially on survey questionnaires while GSIM seeks to handle surveys, administrative data and registers, and other sources from business or the internet in 75 an even-handed fashion. The GSIM handling of aggregate data structures (cube data structure) is more closely aligned to the SDMX data structure definition than to the DDI ncube structure (although some aspects are borrowed from the ncube). GSIM modeling of processes is richer and more general than the DDI objects in this area. GSIM uses different terminology from DDI in modeling statistical activities and statistical collections but there are similarities that will probably allow mapping. DDI focuses on grouping the metadata for related activities into container objects and this is something that GSIM almost certainly needs. Issues/differences with terminology, definition and concepts 202. As noted above the handling of aggregate structures is more closely aligned to SDMX than DDI and the terminology is also more closely aligned to SDMX. Also as noted above, the terminology around higher-level objects for activities and programs is different. The terminology in the handling of processes is also different. DDI Codebook 203. DDI Codebook is an older version focusing on a single study from an archival perspective. It has a variable-centric view. It has a subset of objects of DDI Lifecycle. A mapping path from DDI Codebook to DDI Lifecycle is available. Relationship between SDMX and DDI and potential impacts on GSIM 204. There are a number of published documents6 that describe the relationship of SDMX and DDI on a high level. 205. The detailed mapping of GSIM to DDI and SDMX is dependent on the detailed GSIM model, the definitions of the GSIM objects, and clear relationships of the GSIM objects. A path to how to do the mapping was developed. The mapping work has only started and should be continued after Sprint2. The detailed mapping could be used as a type of review of the GSIM model if something is missing. It could be also used as review of DDI and SDMX to see where the standards have gaps. 206. Further work can be done in this area after Sprint 2. Relationship between CORE and GSIM Differences in terms of modeling 207. The CORE information model is organized as a structure consisting of 6 packages: Data set definitions, Expressions, Parameters, Rules, Service info, and Messages, respectively. Each package consists of classes and class interdependencies. 6 Exploring the relationship between DDI, SDMX and the Generic Statistical Business Process Model. Steven Vale, United Nations Economic Commission for Europe.http://www1.unece.org/stat/platform/download/attachments/57835554/EDDI+paper.pdf?version=1, http://dx.doi.org/10.3886/DDIOtherTopics01 DDI and SDMX: Complementary, Not Competing, Standards. Arofan Gregory, Pascal Heus. http://www.opendatafoundation.org/papers/DDI_and_SDMX.pdf 76 Data set definitions Param eters Expressions Rules Messages Service info Channels Figure C1. Packages in CORE information model 208. The dependencies between the packages are shown above. Note that dependencies are transitive: the Rules package also depends on the Data set definitions package, i.e. through the Expressions package. Similar constructs 209. Both CORE and GSIM have Rule as an information object. 210. Both CORE and GSIM can use processes modeled by GSBPM or by another business process model. 211. The CORE model knows of the existence of statistical information objects, but knows nothing else about them. However, CORE is designed to be able to make use of GSIM information objects. 212. Future work should focus on a detailed mapping between GSIM and CORE information objects. This work has begun. Table C4. Examples of mapping between GSIM and CORE GSIM Set Variable GSIM Definition The Variable set contains objects that describe the measurement of real world phenomena that are the subject of a statistical activity. GSIM Example CORE Definition Concept, Variable, A denotation Contextual and definition Variable, Instance (including role Variable and value type) of a column of data in a data set. CORE Example turnover (a variable) average turnover (a measure) NACE Rev. 2, second digit (a level) Execution time (a logging indicator) 77 Rule The Rule set contains objects that govern processes. An expression together with a data set definition. Upon execution of a service, the expression is evaluated when it is applied to a row of data from a data set that conforms to the data set definition. Examples of expressions: x < 100 and x > 10 if-thenelse(x>5,x,5) SUM(x) / COUNT(x) Gaps/differences 213. In CORE Figures, Time Series, Population, Unit, Variable and Value are used as categories for the layers of the model. This construct has not been used in GSIM Issues/differences with terminology, definition and concepts 214. In CORE, classifications are modeled as rectangular data sets. In GSIM classifications are modeled as hierarchical structures in accordance with the Neuchâtel model. 215. In CORE a variable is a kind of column, but a column is also more general than a variable i.e. it could be a dimension, measure, logging indicator or level (of a classification). Relationships with other standards and models Neuchâtel model 216. Classifications in GSIM are modeled based on the Neuchâtel model for classifications. GSIM does not currently include the Neuchâtel correspondence table. 217. The Neuchâtel model for variables and related concepts modeled variables, units and populations to a more detailed level than GSIM currently has. 218. Other models and standards 219. It is the intention that the following standards/models and implementation syntaxes/tools are examined in future GSIM work. ISO/IEC 11179 ISO 19115 and related standards/models Google DSPL Open data RDF based vocabularies Dublin Core Standards-based/supporting software tools (Blaise, PC Axis, etc.) 78 Annex D. GSIM Metamodel 220. Each of the objects in GSIM will be described using the following metamodel (the description of how each of the objects of metadata are themselves structured). 221. Not all of this information has been created for the purposes of GSIM v0.4; this information will be completed before release of GSIM v1.0 in December 2012. Template Object ID: Object Name: Definition: Value Domain: Explanatory Text: Synonyms: Object Version: Default Value: Attributes (repeat as needed): Property Name: Description: Cardinality: Value Domain: Relationships (repeat as needed): Relationship Name: Target Object ID: Description: Relationship type: Target Object Name: Relationship Cardinality: Constraints: Relationship Constraint Type: Relationship Constraint Type: Relationship Constraint Value: Relationship Constraint Value: 79 Example Object ID: Object Name: Definition: Value Domain: Explanatory Text: Synonyms: GSIM123 Object Version: 1.0 Variable A variable is a characteristic of a unit being observed that may assume more than one of a set of values to which a numerical measure or a category from a classification can be assigned … Compound Object Default Value: This object is used to… Properties: Property Name: Description: Cardinality: Weight Status Value Domain: Logical (True/False) Indicates whether a variable is or is not holding a value which is a weight One and only one (1:1) Relationships: Relationship Name: Measured concept Target Object GSIM-Con12345 Target Object Name: Concept ID: Description: Indicates that the Variable measures one and only one Concept … Relationship type: Measures Relationship Cardinality: One and only one (1:1) Constraints: Relationship Constraint Type: Relationship Constraint Type: Ordered Relationship Constraint Value: Relationship Constraint Value: 1 80
© Copyright 2026 Paperzz