Validation

WGM April 05/EN
WG METHODOLOGY
5 APRIL 2017
Item 1.3 of the agenda
Validation – Break-out sessions
Methodology, Business Architecture, Tools and services, VTL
Validation - Break-out sessions
1. Background
The European Statistical System agreed on improving the data validation process by pursuing the two
following medium-term objectives:
 Ensure the transparency of the validation procedures applied to the data sent to Eurostat by the
ESS Member States through a common validation policy focusing on the attribution of
validation responsibilities among the different actors in the production process of European
statistics.
 Improve the interoperability between Eurostat and Member States through the sharing and reuse of validation services across the ESS on a voluntary basis.
Achieving the objectives mentioned above requires a combination of investment along five different
dimensions: methodology, processes/governance, information standards, IT and human resources.
In this context, instruments are also being developed in the ESS. The 4 main instruments being:
1. A Methodological handbook,
2. A Business and IT Architecture,
3. Tools and services,
4. VTL (Validation and Transformation Language).
2. Objectives of the break-out sessions
The main objective of the 4 break-out sessions related to the 4 main instruments listed above is to
exchange views on practical implications for NSI of the deployment of an ESS validation framework.
Outcomes of discussions will feed the work in the ESS (at Eurostat, in the Task Force on Validation,
in the ESSnet Validat integration) and also in countries. They could also potentially be a trigger for
cooperation between countries.
3. Organisation of the break-out sessions
Four steps will be followed for the organisation of the work:
1. Plenary (10 minutes): Eurostat introduces the 4 sessions, then participants join the session of
their choice.
2. Sessions (40 minutes): participants work in the session of their choice. A spokesperson is
chosen to report to the plenary.
3. Plenary (4*5 minutes): each spokesperson reports on the conclusions of his/her session and
answers are provided to possible question.
4. Plenary (10-20 minutes): overall conclusions are derived.
In Annexes 1 to 4, you can find an introduction to each break-out session with the context, objectives
and suggested discussion points.
Eurostat
1
WGM April 2017/EN
Annex-1 (Session 1): “Methodology”
Context:
A precondition for a transparent validation process and for the identification and design of
interoperable IT services is the availability of a common framework for validation in the ESS. At the
outset of our modernisation efforts, it became clear that the terminology used to identify concepts
related to data validation varied widely across Member States and across statistical domains.
Moreover, while a large literature describing specific data validation or data editing methods is
available, few internationally recognized references providing a comprehensive conceptualization of
the data validation process itself exist.
The ESS therefore created a methodological handbook on data validation. By providing common
definitions, common classifications and common metrics for data validation, the handbook gives a
reference framework on which all further developments can be based.
The handbook on data validation can be accessed here.
Objectives of the session:
The main objective of this break-out session is to exchange views on several key methodological
issues related to data validation. Some of the issues listed below have been analysed in depth in the
methodological handbook (in particular the metrics and types of validation rules, …). Other issues
which reflect a concern for countries can also be discussed.
The outcome of discussions will feed work at Eurostat and in the Task Force on validation and the
ESSnet Validat integration (in particular for Work Packages 1 and 2 of the ESSnet that refer
respectively to the improvement of the methodological handbook and the creation of a standard
Validation report).
Suggested discussion points:
1. Description of validation rules: Eurostat is working on a standard template to express validation
rules to be agreed at domain specific working group meetings for data sent to Eurostat. The main
information considered for describing the rules is the following:
 Reference to the data structure (including code-lists used),
 Description in plain English in positive wording (what should be provided),
 Description in VTL (non-ambiguous standard language),
 Severity level,
 Examples of good and bad data.
(see also at next page an example of description of a validation rule for National Accounts)
What do you think about this information used to describe validation rules?
2. Types of validation rules: Eurostat is working on a classification of rules by type. This should
facilitate the description of the rules for statisticians and limit risks of validation gaps (cases where
essential validation rules are not carried out by any of the actors).
In your NSI, have you categorized rules by types?
In your NSI, how many types of rules can be identified and which approx. share of validation rules
can cover the standard types of validation rules?
3. Metrics: do you apply metrics to measure the performance of your data validation process?
Eurostat
2
WGM April 2017/EN
4. Validation reports: Eurostat is working together with the ESSnet Validat integration on the
standardisation of validation reports. This should help statisticians to understand the reports and
the link between the defects and the underlying data. It should also allow an automatic processing
of the report for producing metrics on the results of the validation processes.
Have you developed standard validation reports in your NSI?
Can they be used both by statisticians and computers to produce metrics?
5. Warning and justifications on the warnings: According to the Business and IT Architecture, in
case of warning detected by validation procedures (corresponding to suspicious data), statisticians
may have to either correct the data or to provide explanations. Explanations may refer to the whole
file being transmitted or to specific records or group of records. Explanations may then be used in
the Eurostat production process and may potentially lead to the dissemination of flags or
explanatory notes together with the data (to justify for instance outliers).
Have you an idea how justifications may be provided to Eurostat?
Example of description of a Validation rule for National Account statistics:
The extract below refers to a relatively complex rule described in a questionnaire which purpose was
to assess the usability of VTL by statisticians in the ESS.
For this rule, the reply by 4 statisticians (3 in countries and one at Eurostat) to the question “How far
do you agree or disagree with the statement? “I understand the validation rule expressed in VTL” was
rated on average 2.75 (between 2: “Disagree” and 3: “Neither agree nor disagree”).
Rule 4: ESA _4 (Price checks)
Link with attached document (ref: Eurostat C2/TFVALID/Apr16/005a): Checks 5 to 7
Description rule:
Includes two rules that check the relationship between "chain linked volumes" ("L" code in the PRICE dimension), "previous year's prices" ("Y"
code in the PRICE dimension) and "current prices" ("V" code in the PRICE dimension).
The first rule only concerns a given "reference year", which is found in the REF_YEAR_PRICE dimension (at each data point, this dimension
contains only the specific reference year for the dataset or empty codes). The second rule should be applied to all years.
The two rules:
a. In the "reference year" (as indicated in the TIME_PERIOD dimension, i.e. where TIME_PERIOD = "reference year"), the observation value
(in OBS_VALUE ) should be the same for "current prices" ("V" in the PRICE dimension) as for "chain linked volumes" ("L" in the PRICE
dimension).
b. Regarding all periods (years in the TIME_PERIOD), the observation value (in the OBS_VALUE dimension) of the "previous year price" in a
given year (in the TIME_PERIOD dimension) should equal the following observation values: "chain linked volumes" ("L" code in the PRICE
dimension) multiplied by the "current price" ("V" code in the PRICE dimension) in the preceding period (previous year in the TIME_PERIOD),
and divided by the "chain linked volumes" ("L" code in the PRICE dimension) in the preceding period (previous year in the TIME_PERIOD).
Expressed in an equation: Y(t)= L(t) * V(t-1) / L(t-1); where and L, V, and Y represent the observation values with the respective PRICE
dimension codes, and "t" represent any given period (any year in the TIME_PERIOD).
Refers to:
Datasets: all NAMAIN datasets
Data structure: FREQ; ADJUSTMENT; REF_AREA; COUNTERPART_AREA; REF_SECTOR; COUNTERPART_SECTOR;
ACCOUNTING_ENTRY; STO; INSTR_ASSET; INSTR_ASSET; ACTIVITY; EXPENDITURE; UNIT_MEASURE; PRICE;
TRANSFORMATION; TIME_PERIOD; REF_YEAR_PRICE; OBS_VALUE; TABLE_IDENTIFIER
Severity : Error
VTL :
Parameters:
ds_esa: the NAMAIN dataset to be validated
Approach:
(a) Extract reference year from the TIME_PERIOD dimension
(b) Define a hierarchical ruleset for the L = V (in PRICE) rule
(c) Check OBS_VALUE in the ds_esa using the hierarchical ruleset
(d) Extract three separate datasets from the ds_esa: for V, L, and Y
(e) Calculate the V(t-1) / L(t-1) part of the equation (division), and add one year to the result dataset's datapoints (so that they have correctly
Eurostat
3
WGM April 2017/EN
matching Identifiers with the rest of the variables in the next calculations for the equation)
(f) Finish the calculation for the L(t) * (V(t-1) / L(t-1)) part (multiplication)
(g) Check if the resulting Y(t)= L(t) * V(t-1) / L(t-1) equation is valid (by comparing the previous result with Y(t)).
Returns:
ds_result_4a: dataset containing datapoints where the hierarchical rule is not satisfied.
ds_result_4b: dataset containing datapoints which are involved in an incorrect relation as specified in the Y(t)= L(t) * V(t-1) / L(t-1) equation.
(Each datapoint in these datasets includes corresponding errorcode and errorlevel.)
VTL code:
reference_year:= max(ds_esa[calc TIME_PERIOD as "TIME_VAL" role MEASURE].TIME_VAL);
define hierarchical ruleset hr_current_vs_chainlinked (antecedentvariable = TIME_PERIOD, variable = PRICE) is
when TIME_PERIOD = reference_year then L = V, errorcode ( "When TIME_PERIOD = Reference year, Observation values should
be the same for current prices as for chain linked volumes") , errorlevel ( "Error" ) ;
end hierarchical ruleset ;
ds_result_4a := check (ds_esa, hr_current_vs_chainlinked);
ds_esa_v = ds_esa[ PRICE = "V" ];
ds_esa_l = ds_esa[ PRICE = "L" ];
ds_esa_y = ds_esa[ PRICE = "Y" ];
ds_ratio = { OBS_VALUE = ds_esa_v.OBS_VALUE / ds_esa_l.OBS_VALUE, TIME_PERIOD = ds_esa_v.TIME_PERIOD + 1 };
ds_end_calc = ds_esa_l * ds_ratio;
ds_result_4b := check (ds_esa_y.OBS_VALUE = ds_end_calc.OBS_VALUE, errorcode(“Inconsistency between observation values in chain
link volumes, at current Price and at previous year Price: the observation values do not comply with the Y(t)= L(t) * V(t-1) / L(t-1)
relation”), errorlevel(“Error”) ) ;
Example :
For a data set « NAMAIN_T0101_A_DK_2012_0000_Vnnnn.xxx » (Annual data table T0101 for DK 2010-2012, reference year 2010)
Good records:
FREQ
A
A
A
A
A
A
A
A
A
ADJUSTMENT
N
N
N
N
N
N
N
N
N
REF_AREA
DK
DK
DK
DK
DK
DK
DK
DK
DK
ACCOUNTING_ENTRY
B
B
B
B
B
B
B
B
B
PRICE
V
L
Y
V
L
Y
V
L
Y
TRANSFORMATION
N
N
N
N
N
N
N
N
N
STO
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
COUNTERPART_AREA
W2
W2
W2
W2
W2
W2
W2
W2
W2
INSTR_ASSET
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
TIME_PERIOD
2010
2010
2010
2011
2011
2011
2012
2012
2012
REF_SECTOR
S1
S1
S1
S1
S1
S1
S1
S1
S1
ACTIVITY
_T
_T
_T
_T
_T
_T
_T
_T
_T
REF_YEAR_PRICE
2010
2010
2010
2010
2010
2010
2010
2010
2010
COUNTERPART_SECTOR
S1
S1
S1
S1
S1
S1
S1
S1
S1
EXPENDITURE
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
UNIT_MEASURE
XDC
XDC
XDC
XDC
XDC
XDC
XDC
XDC
XDC
OBS_VALUE
1810926
1810926
1754364
1846854
1835134
1835134
1895002
1839290
1851037
TABLE_IDENTIFIER
T0101
T0101
T0101
T0101
T0101
T0101
T0101
T0101
T0101
=> In the reference year (2010), "V" PRICE has 1810926 for OBS_VALUE which equals the OBS_VALUE for "L" PRICE
=> For year 2012 (t=2012) formula Y(t)= L(t) * V(t-1) / L(t-1) gives: 1851037 = 1839290 * 1846854 / 1835134; which is correct
Bad records:
Eurostat
4
WGM April 2017/EN
FREQ
A
A
A
A
A
A
A
A
A
ADJUSTMENT
N
N
N
N
N
N
N
N
N
REF_AREA
DK
DK
DK
DK
DK
DK
DK
DK
DK
ACCOUNTING_ENTRY
B
B
B
B
B
B
B
B
B
PRICE
V
L
Y
V
L
Y
V
L
Y
TRANSFORMATION
N
N
N
N
N
N
N
N
N
STO
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
B1GQ
COUNTERPART_AREA
W2
W2
W2
W2
W2
W2
W2
W2
W2
INSTR_ASSET
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
TIME_PERIOD
2010
2010
2010
2011
2011
2011
2012
2012
2012
REF_SECTOR
S1
S1
S1
S1
S1
S1
S1
S1
S1
ACTIVITY
_T
_T
_T
_T
_T
_T
_T
_T
_T
REF_YEAR_PRICE
2010
2010
2010
2010
2010
2010
2010
2010
2010
COUNTERPART_SECTOR
S1
S1
S1
S1
S1
S1
S1
S1
S1
EXPENDITURE
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
_Z
UNIT_MEASURE
XDC
XDC
XDC
XDC
XDC
XDC
XDC
XDC
XDC
OBS_VALUE
1810926
1885600
1754364
1846854
1846155
1835134
1895002
1839290
1851037
TABLE_IDENTIFIER
T0101
T0101
T0101
T0101
T0101
T0101
T0101
T0101
T0101
=> "V" PRICE has 1810926 for OBS_VALUE in the reference year (2010), while "L" PRICE has 1885600 for OBS_VALUE in the same year,
therefore they are not equal; the data is erroneous
=> When taking, for example, year 2012 for the Y(t)= L(t) * V(t-1) / L(t-1) calculation, it will be: 1851037 = 1839290 * 1846854 / 1846155;
which is not a correct equation (the right side equals 1839986, and not 1851037)
Eurostat
5
WGM April 2017/EN
Annex-2 (session 2): “Business Architecture for ESS validation”
Context:
The purpose of the Business Architecture for ESS validation is to provide a common understanding of
how ESS validation should be conducted in the future – i.e. provide a comprehensive description of
the target state for ESS validation.
It aims at contributing to the two main objectives of the ESS Vision 2020 for validation:
 ensure the transparency of the validation procedures applied to the data sent to Eurostat by the
ESS Member States,
 share and reuse validation services across the ESS on a voluntary basis.
In particular, it:
 Describes the “To be state” of the validation business process in the ESS,
 Presents 3 possible scenarios for Member States to apply the agreed upon validation rules
before transmitting the data to Eurostat,
 Gives details about the acceptance of data by Eurostat linked to the severity level of each rule
(error, warning and information),
 Introduces a list of general validation principles. These principles provide the theoretical
underpinning for all the major design decisions made in the course of the elaboration of the
target to-be state.
The Business Architecture for ESS validation can be accessed here.
Objectives of the session:
The main objective of this break-out session is to exchange views on the understanding, the approach
and support needed for implementing the Business Architecture. Answers/suggestions to questions
could feed the work at Eurostat and in countries. For instance, answers to the first question may feed
the ESSnet Validat integration, who will be doing a cost-benefit analyses of the different scenarios.
Suggested discussion points:
1. What kind of documentation/information would be useful for Eurostat data providers to be able to
choose the most appropriate scenario?
2. Do you need support and of which kind ? (training / communication)?
3. How to implement the Business Architecture in each statistical domain?
 Which ESS body should be responsible for monitoring / following up on the implementation?
 Any specific criteria to take into account when choosing which domains to go for first?
4. Is the data acceptance process of the Business and IT architecture clear?
Eurostat
6
WGM April 2017/EN
Annex-3 (session 3): "Tools and Services"
Context:
The target validation business process will need to be supported by relevant IT services. In order to
maximize the potential reuse by ESS Member States, these services are being developed in line with
the guidelines provided by the Common Statistical production Architecture (CSPA).
Overview of the validation services being developed
The figure above provides an overview of the currently envisaged ESS validation services. The three
main services (represented by the three rectangles) are the following:
 A structural validation service, which checks that the data comply with the appropriate data
structure (e.g. correct format, correct codes, etc…). A data structure registry would supply the
structural validation service with the required information on the expected data structure. A
first version of the structural validation service was released in 2016.
 A content validation service, which goes beyond data structures and checks the consistency and
plausibility of the data themselves (e.g. aggregation checks, detection of suspicious values
etc…). A validation rule registry would supply the content validation service with the required
validation rules expressed in VTL to be applied to specific datasets. A prototype for the content
validation service is being tested in a limited number of statistical domains. A first production
version is expected to be released in early 2018.
 A Validation Rule Manager, which would allow authorised users to view, create, modify and
manage VTL validation rules stored in the validation rule registry. A first version of the
Validation Rule Manager is expected to be released in 2018.
Objectives of the session:
This break-out session is to discuss what tools and services are used for data validation in ESS
members. The objectives are the following:
 Increase understanding of the use and re-use of validation tools and services in the ESS,
 Learning from others concerning the use of validation tools and services,
 Identify needs in the area of validation tools and services,
 Discuss interest and potential maturity for sharing tools and services in the ESS.
Eurostat
7
WGM April 2017/EN
Suggested discussion points:
1. What commercial tools and services are used for data validation?
2. What custom (developed) tools and services are used for data validation?
3. What tools/services – developed by other statistical organisations like Eurostat or NSI's from other
countries – are used for data validation?
4. Which validation service would add the greatest value to you concerning production of European
Statistics?
5. Are you familiar / do you participate in initiatives like CSPA and ESS Shared Services?
The discussion will evolve about the issues with the current architecture, areas for improvement and
priorities for the future.
Eurostat
8
WGM April 2017/EN
Annex-4 (session 4): "VTL"
Context:
In the target validation business process, Member States will exchange several kinds of information
objects: data, validation rules and validation reports. A standardization of these objects is essential in
order to guarantee the transparency of the validation process and to create interoperable IT services
capable of using and producing them.
Eurostat therefore collaborated with several other organisations at national and international level to
create a standard language to express validation rules in a non-ambiguous way. The result of these
efforts is the Validation and Transformation Language (VTL). VTL allows validation and
transformation rules to be expressed at a logical level, while specific programming languages can be
used for execution, such as R, SAS, Java or SQL. VTL builds on the SDMX information model but it
can be used with any kind of structured data and data typology (micro data, aggregated data, registers,
qualitative, quantitative, etc.). Version 1.1 of VTL is scheduled to be released in the spring of 2017.
In the ESS, it is planned to use VTL:
 as a complement to plain English to describe validation rules to be agreed by domain specific
working groups,
 as input to the ESS content validation service which will be available for use as a shared or
duplicated service by ESS partners,
 as input to specific validation services or tools developed and used by ESS partners.
Objectives of the session:
The main objectives of this break-out session are to exchange views on:
 the current situation in countries on a corporate approach for validation rules, language and
infrastructure,
 the level of understanding of VTL and the scope considered to use VTL.
Suggested discussion points:
1. With regard to data validation in your NSI, have you developed:
a. a common validation language,
b. a repository of validation rules,
c. a common metadata repository (for data structures etc.),
d. typologies of validation rules,
e. standard methods?
2. How much do you (your NSI) know about VTL?
3. What could be the potential for VTL in your NSI?
a. Use VTL to pre-validate data sent to Eurostat/ECB/OECD/IMF
b. Use VTL internally, for your own data collections
c. Use VTL for communicating the validation rules with your partners (e.g. raw data
providers).
Eurostat
9
WGM April 2017/EN