Evaluation of VTL from different perspectives (Tjalling Gelsema)

ESSnet on Validation:
A study of VTL
Wiesbaden, November 10-11 2015
Contents
–
–
–
–
A study of VTL: rough outline of scope
Correctness of VTL – examples
Completeness of VTL – examples
Some general conclusions
2
Aspects & disciplines
Rough outline of scope
Does the VTL specification define
an unambiguous language without internal contradictions?
correctness
Does the VTL specification proposal
sufficiently (formally) address
syntax, typing and semantic
aspects?
completeness
methodology
WP2
T15
language design
T15
T15
inf. modeling
T15
(re-)usability
feasibility
safety
T15
inf. technology
T15
T15
stat. practice
Is the VTL inf. model
‘immune to’ the
operators of VTL?
Can VTL be sufficiently
(informally) understood
from the specification
proposal?
T15
Is VTL usable from a
statistical practice point
of view?
Is it feasible to build
parsers, compilers, type
checkers, executors etc.
for VTL?
3
Correctness: VTL’s operations
VTL’s operations act on both data sets and data structures.
The ideal situation is:
The reality however is that some output data sets are not
specified by their output data structures (cf. GSIM).
Correctness: VTL’s operations in reality
An example is given by the `get’ operator with `union’
mode outputs the following:
K1
K2
K3
M1
1
A
X
5
2
B
Y
7
yields:
union
with
K1
K2
K3
M1
1
A
X
6
2
B
Y
7
3
C
Z
9
K1
K2
K3
M1
1
A
X
5
2
B
Y
7
2
B
Y
7
1
A
X
6
3
C
Z
9
Semantics of the ‘=’ operator
By allowing data sets that are not properly described by
their data structure (cf. GSIM) VTL gives unreliable results
for the ‘=’ operator. The semantics of ‘dsl=dsr’ is:
K1
K2
M1
K1
K2
M1
K1
K2
M1_COND.
1
A
5
1
A
5
1
A
true
1
A
7
1
A
7
1
A
false
2
B
7
2
B
7
1
A
false
1
A
true
2
B
true
=
yields:
Semantics of the ‘=’ operator
but
K1
K2
M1_COND.
1
A
true
K1
K2
M1
K1
K2
M1
1
A
true
1
A
5
1
A
5
1
A
true
1
A
5
1
A
5
1
A
true
2
B
7
2
B
7
2
B
true
K1
K2
M1
K1
K2
M1
K1
K2
M1_COND.
1
A
5
1
A
5
1
A
true
1
A
5
2
B
7
1
A
true
2
B
7
2
B
true
=
yields:
and
=
yields:
Completeness: outline of study
• Input: Handbook on Validation & survey;
• Handbook: validation rule is a Boolean-valued function
acting on data sets. (This suggests a logic for describing
such a function as a validation rule);
• Survey: many nontrivial examples of validation rules –
mostly expressed in a(n informal) style;
• Study: can a selected number (covering a wide range of
validation rules) of them be expressed as VTLstatements?
• Also: what is the effect of translation?
Example: quantification
From survey: “If person X is daughter in law of the
household head, then there should exist another person Y
who is the husband of the person X and son of the
household head.” Consider the data set
person-id
hh-id
rel-to-head
spouse-id
1
1
1
4
2
1
4
3
3
1
3
2
4
1
2
1
1
2
1
2
...
...
...
...
Example: quantification
The validation rule on the previous slide can be expressed
compactly using predicate logic (from relational calculus):
forall x: IF x.relation_to_head = 4
THEN exists y:
x.household-id = y.household-id AND
x.spouse-id = y.person-id AND
y.relation_to_head = 3
Example: quantification, VTL translation
1)DSfilter := DS[filter relation_to_head = 4]
2)DSmerge := merge(DS "DSx", DS "DSy",
on
(DSy#household_id = DSx#household_id
and DSy#spouse_id = DSx#person_id
and DSy#relation_to_head = 3
and DSx#relation_to_head = 4),
return
(DSx#household_id as "household_id",
DSx#person_id as "person_id"))
3)DSnot exists := DSfilter not_exists_in DSmerge
4)DScount := DSnot exists[calc 1 as "id" role "identifier"][keep (id, person_id)]
[aggregate count (person_id)] = 0
A Study of VTL: some conclusions
• VTL does not comply with GSIM;
• This leads to undesirable results: poor semantics of
the equality operator;
• Wide range of rules is expressible using VTL;
• The translation to VTL of a validation rule is almost
never trivial; essence of a rule is lost in translation;
• VTL is essentially a SQL-like language for the
specification of the steps needed to do validation,
instead of the specification of validation rules;
• This raises the question of the business case for
VTL.