Marco Di Zio

A handbook on validation methodology
Marco Di Zio
Istat
Workshop ValiDat Foundation –
Wiesbaden, 10-11 November 2015
Underlying idea of the HB
 Why a handbook on methodology for data validation?
 Standardization of language, of elements, provide common
measures for evaluation…
 establish a common reference framework and develop
metrics for evaluating DV
 The HB is composed of two main parts:
 A generic framework for data validation
 Discuss metrics to evaluate a validation procedure (tuning,
evaluating the procedure..)
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
2
Generic framework for data validation
The objective of this first section is to clarify
 What
 Why
 How
and …
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
3
Generic framework for data validation
 Clearly establish the relation with other phases of the
statistical production process and internationals standards
as
 GSBPM
 GSDEMs
 GSIM
 Describe the data validation life cycle – useful for managing
the data validation process
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
4
What is data validation… Definition
 Data Validation is an activity verifying whether or not a
combination of values is a member of a set of acceptable
combinations.
 not far from the Unece definition:
An activity aimed at verifying whether the value of a data item
comes from the given (finite or infinite) set of acceptable
values
 but essentially different…
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
5
What…
 It is a decisional procedure ending with an acceptance or
refusal of data as acceptable.
 The decisional procedure is generally based on rules
expressing the acceptable combinations of values.
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
6
Why do we perform data validation…
 The purpose of data validation is to ensure a certain level of
quality of the final data
 but quality has several aspects. We clarified which aspects
are related to DV
 Essentially the ones related the ‘structure of the data’, that
are accuracy, comparability, coherence.
 But others are connected, e.g., timelines can be seen as a
constraining factor
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
7
How to perform DV…
Two main elements
 Validation levels
 to what extent a data set has been validated
 Validation rules
 Rules are applied to data, a failure of the rule implies that
the corresponding validation level is not attained by the
data at hand (decisional process: accept/not accept)
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
8
Validation levels
They are related to the perspective of the ‘validator’ …
In the HB:
 Business perspective
 Starting form the elements characterising usually the DV
process (increasing information)
 A formal approach
 Looking a the elements characterizing a point in a
statistical setting
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
9
Validation levels: business perspective
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
10
Validation levels: formal approach
metadata aspects that are necessary to identify a data point,
 The universe U from which a statistical object originates.
(e.g., household, company,)
 The time t of selecting an element u from the current
population p(t)
 The selected element u. This determines the value of
variables X over time that may be observed.
 The variable selected for measurement.
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
11
Class
(𝑼𝝉𝒖𝑿)
Description of input
𝑠𝑠𝑠𝑠
Single data point
𝑠𝑠𝑠𝑚
Multivariate (inrecord)
𝑠𝑠𝑚𝑠
Multi-element (single
variable)
Example function
𝑥 > 0
𝑥 +𝑦 = 𝑧
𝑥𝑢 > 0
𝑢 ∈𝑠
𝑠𝑠𝑚𝑚
Multi-element
multivariate
𝑠𝑚𝑠𝑚
𝑠𝑚𝑚𝑠
Multi-measurement
multi-element
𝑚𝑠𝑚𝑚
Multi-measurement
multi-element,
multivariate
Multi-universe multielement multivariate
Condition on
aggregate of single
variable
𝑥𝜏 − 𝑥𝜈 < 𝜖
Condition on
difference between
current and previous
observation.
𝑥𝜏 + 𝑦𝜏
< 𝜖
𝑥𝜈 + 𝑦𝜈
Condition on ratio of
sums of two currently
and preciously
observed
observations.
𝑢 ∈𝑠
𝑢 ∈𝑠
𝑠𝑚𝑚𝑚
Linear restriction
Condition on ratio of
aggregates of two
variables
Multi-measurement
Multi-measurement
multivariate
Univariate comparison
with constant
𝑥𝑢
< ϵ
𝑦𝑢
𝑢 ∈𝑠
𝑢 ∈𝑠
𝑠𝑚𝑠𝑠
Description of
example
𝑥𝑢𝜏
< 𝜖
𝑥𝑢𝜈
𝑥𝑢𝜏
−
𝑢 ∈𝑠 𝑥𝑢𝜈
𝑢 ∈𝑠
𝑦𝑢𝜏
< 𝜖
𝑢 ∈𝑠 𝑦𝑢𝜈
𝑢 ∈𝑠
𝑥𝑢
< 𝜖
𝑢 ′ ∈𝑠′ 𝑦𝑢 ′
𝑢 ∈𝑠
Condition on ratio of
current and previously
observed aggregate.
Condition on
difference between
ratios of previous and
currently observed
aggregates.
Condition on ratio of
aggregates over
different variables of
different object types.
Data validation - GSDEMs
 Generic Statistical Data Editing Models
 statistical data editing composed of three different
function types: Review, Selection and Amendment
 The review functions are defined as:
Functions that examine the data to identify potential problems.
This may be by evaluating formally specified quality measures
or edit rules or by assessing the plausibility of the data in a less
formal sense, for instance by using graphical displays
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
13
Data validation - GSDEMs
 Among the GSDEMs different function categories there is
‘Review of data validity’ that is
Functions that check the validity of data values against a
specified range or a set of values and also the validity of
specified combinations of values. Each check leads to a binary
value (TRUE, FALSE)
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
14
Data Validation - GSBPM
Final
data
Raw data
Validation
Sub-phases 4.3,5.3
(Review-Validate)
Selection
Amendment
I
Sub-phase 5.4
(Edit-Impute)
Edited
Data
Validation
Sub-phase 6.2
(Validate Outputs)
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
15
Data validation life cycle
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
16
Introduction .............................................................................................................................................. 2
1
What is data validation. ............................................................................................................. 3
2
Why data validation. Relationship between validation and quality. .......... 5
3
How to perform data validation: validation levels and validation rules.. 7
4
Validation levels from a business perspective .......................................................... 8
4.1
5
Validation rules .................................................................................................................... 12
Validation levels based on decomposition of metadata .................................... 16
5.1
A formal typology of data validation functions ............................................... 18
5.2
Validation levels ................................................................................................................... 19
6 Relation between validation levels from a business and a formal
perspective .............................................................................................................................................. 20
6.1
7
Applications and examples ........................................................................................... 22
Data validation as a process ................................................................................................. 24
7.1
Data validation in a statistical production process (GSBPM) ................ 24
7.2
The informative objects of data validation (GSIM) ....................................... 27
8
The data validation process life cycle ............................................................................ 29
8.1
Design phase ........................................................................................................................... 31
8.2
Implementation phase ..................................................................................................... 32
8.3
Execution phase.................................................................................................................... 33
8.4
Review phase.......................................................................................................................... 34
References ................................................................................................................................................ 36
Appendix A: List of validation rules
38
Second part of the document: Metrics
 Evaluating validation procedure
 …next presentation…
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
18
Thanks for your attention
ValiDat foundation workshop - Wiesbaden 10-11 November 2015
19