A handbook on validation methodology Marco Di Zio Istat Workshop ValiDat Foundation – Wiesbaden, 10-11 November 2015 Underlying idea of the HB Why a handbook on methodology for data validation? Standardization of language, of elements, provide common measures for evaluation… establish a common reference framework and develop metrics for evaluating DV The HB is composed of two main parts: A generic framework for data validation Discuss metrics to evaluate a validation procedure (tuning, evaluating the procedure..) ValiDat foundation workshop - Wiesbaden 10-11 November 2015 2 Generic framework for data validation The objective of this first section is to clarify What Why How and … ValiDat foundation workshop - Wiesbaden 10-11 November 2015 3 Generic framework for data validation Clearly establish the relation with other phases of the statistical production process and internationals standards as GSBPM GSDEMs GSIM Describe the data validation life cycle – useful for managing the data validation process ValiDat foundation workshop - Wiesbaden 10-11 November 2015 4 What is data validation… Definition Data Validation is an activity verifying whether or not a combination of values is a member of a set of acceptable combinations. not far from the Unece definition: An activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values but essentially different… ValiDat foundation workshop - Wiesbaden 10-11 November 2015 5 What… It is a decisional procedure ending with an acceptance or refusal of data as acceptable. The decisional procedure is generally based on rules expressing the acceptable combinations of values. ValiDat foundation workshop - Wiesbaden 10-11 November 2015 6 Why do we perform data validation… The purpose of data validation is to ensure a certain level of quality of the final data but quality has several aspects. We clarified which aspects are related to DV Essentially the ones related the ‘structure of the data’, that are accuracy, comparability, coherence. But others are connected, e.g., timelines can be seen as a constraining factor ValiDat foundation workshop - Wiesbaden 10-11 November 2015 7 How to perform DV… Two main elements Validation levels to what extent a data set has been validated Validation rules Rules are applied to data, a failure of the rule implies that the corresponding validation level is not attained by the data at hand (decisional process: accept/not accept) ValiDat foundation workshop - Wiesbaden 10-11 November 2015 8 Validation levels They are related to the perspective of the ‘validator’ … In the HB: Business perspective Starting form the elements characterising usually the DV process (increasing information) A formal approach Looking a the elements characterizing a point in a statistical setting ValiDat foundation workshop - Wiesbaden 10-11 November 2015 9 Validation levels: business perspective ValiDat foundation workshop - Wiesbaden 10-11 November 2015 10 Validation levels: formal approach metadata aspects that are necessary to identify a data point, The universe U from which a statistical object originates. (e.g., household, company,) The time t of selecting an element u from the current population p(t) The selected element u. This determines the value of variables X over time that may be observed. The variable selected for measurement. ValiDat foundation workshop - Wiesbaden 10-11 November 2015 11 Class (𝑼𝝉𝒖𝑿) Description of input 𝑠𝑠𝑠𝑠 Single data point 𝑠𝑠𝑠𝑚 Multivariate (inrecord) 𝑠𝑠𝑚𝑠 Multi-element (single variable) Example function 𝑥 > 0 𝑥 +𝑦 = 𝑧 𝑥𝑢 > 0 𝑢 ∈𝑠 𝑠𝑠𝑚𝑚 Multi-element multivariate 𝑠𝑚𝑠𝑚 𝑠𝑚𝑚𝑠 Multi-measurement multi-element 𝑚𝑠𝑚𝑚 Multi-measurement multi-element, multivariate Multi-universe multielement multivariate Condition on aggregate of single variable 𝑥𝜏 − 𝑥𝜈 < 𝜖 Condition on difference between current and previous observation. 𝑥𝜏 + 𝑦𝜏 < 𝜖 𝑥𝜈 + 𝑦𝜈 Condition on ratio of sums of two currently and preciously observed observations. 𝑢 ∈𝑠 𝑢 ∈𝑠 𝑠𝑚𝑚𝑚 Linear restriction Condition on ratio of aggregates of two variables Multi-measurement Multi-measurement multivariate Univariate comparison with constant 𝑥𝑢 < ϵ 𝑦𝑢 𝑢 ∈𝑠 𝑢 ∈𝑠 𝑠𝑚𝑠𝑠 Description of example 𝑥𝑢𝜏 < 𝜖 𝑥𝑢𝜈 𝑥𝑢𝜏 − 𝑢 ∈𝑠 𝑥𝑢𝜈 𝑢 ∈𝑠 𝑦𝑢𝜏 < 𝜖 𝑢 ∈𝑠 𝑦𝑢𝜈 𝑢 ∈𝑠 𝑥𝑢 < 𝜖 𝑢 ′ ∈𝑠′ 𝑦𝑢 ′ 𝑢 ∈𝑠 Condition on ratio of current and previously observed aggregate. Condition on difference between ratios of previous and currently observed aggregates. Condition on ratio of aggregates over different variables of different object types. Data validation - GSDEMs Generic Statistical Data Editing Models statistical data editing composed of three different function types: Review, Selection and Amendment The review functions are defined as: Functions that examine the data to identify potential problems. This may be by evaluating formally specified quality measures or edit rules or by assessing the plausibility of the data in a less formal sense, for instance by using graphical displays ValiDat foundation workshop - Wiesbaden 10-11 November 2015 13 Data validation - GSDEMs Among the GSDEMs different function categories there is ‘Review of data validity’ that is Functions that check the validity of data values against a specified range or a set of values and also the validity of specified combinations of values. Each check leads to a binary value (TRUE, FALSE) ValiDat foundation workshop - Wiesbaden 10-11 November 2015 14 Data Validation - GSBPM Final data Raw data Validation Sub-phases 4.3,5.3 (Review-Validate) Selection Amendment I Sub-phase 5.4 (Edit-Impute) Edited Data Validation Sub-phase 6.2 (Validate Outputs) ValiDat foundation workshop - Wiesbaden 10-11 November 2015 15 Data validation life cycle ValiDat foundation workshop - Wiesbaden 10-11 November 2015 16 Introduction .............................................................................................................................................. 2 1 What is data validation. ............................................................................................................. 3 2 Why data validation. Relationship between validation and quality. .......... 5 3 How to perform data validation: validation levels and validation rules.. 7 4 Validation levels from a business perspective .......................................................... 8 4.1 5 Validation rules .................................................................................................................... 12 Validation levels based on decomposition of metadata .................................... 16 5.1 A formal typology of data validation functions ............................................... 18 5.2 Validation levels ................................................................................................................... 19 6 Relation between validation levels from a business and a formal perspective .............................................................................................................................................. 20 6.1 7 Applications and examples ........................................................................................... 22 Data validation as a process ................................................................................................. 24 7.1 Data validation in a statistical production process (GSBPM) ................ 24 7.2 The informative objects of data validation (GSIM) ....................................... 27 8 The data validation process life cycle ............................................................................ 29 8.1 Design phase ........................................................................................................................... 31 8.2 Implementation phase ..................................................................................................... 32 8.3 Execution phase.................................................................................................................... 33 8.4 Review phase.......................................................................................................................... 34 References ................................................................................................................................................ 36 Appendix A: List of validation rules 38 Second part of the document: Metrics Evaluating validation procedure …next presentation… ValiDat foundation workshop - Wiesbaden 10-11 November 2015 18 Thanks for your attention ValiDat foundation workshop - Wiesbaden 10-11 November 2015 19
© Copyright 2026 Paperzz