Repair

Data Repairing
Giorgos Flouris, FORTH
December 11-12, 2012, Luxembourg
PART I:
Problem Statement and
Proposed Solution
(D2.2)
Slide 2 of x
Validity as a Quality Indicator

Validity is an important quality indicator
◦
◦
◦

Encodes context- or application-specific requirements
Applications may be useless over invalid data
Binary concept (valid/invalid)
Two steps to guarantee validity (repair
process):
1. Identifying invalid ontologies (diagnosis)
 Detecting invalidities in an automated manner
 Subtask of Quality Assessment
2. Remove invalidities (repair)
 Repairing invalidities in an automated manner
 Subtask of Quality Enhancement
Slide 3 of x
Diagnosis
Expressing validity using validity rules
over an adequate relational schema
 Examples:

◦ Properties must have a unique domain
◦ p Prop(p)  a Dom(p,a)
◦ p,a,b Dom(p,a)  Dom(p,b)  (a=b)
◦ Correct classification in property instances
◦ x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a)
◦ x,y,p,a P_Inst(x,y,p)  Rng(p,a)  C_Inst(y,a)

Diagnosis reduced to relational queries
Slide 4 of x
Example
Ontology O0
Class(Sensor), Class(SpatialThing), Class(Observation)
geo:location
Prop(geo:location)
Sensor
Dom(geo:location,Sensor)
Rng(geo:location,SpatialThing)
Observation
Inst(Item1), Inst(ST1)
P_Inst(Item1,ST1,geo:location)
C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)
Item1
Correct classification in property instances
 x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a)
SpatialThing
Schema
Data
ST1

Item1 geo:location ST1
P_Inst(Item1,ST1,geo:location)O0
Sensor is the domain of geo:location Dom(geo:location,Sensor)O0
Item1 is not a Sensor
C_Inst(Item1,Sensor)O0
Remove P_Inst(Item1,ST1,geo:location)
Remove Dom(geo:location,Sensor)
Add C_Inst(Item1,Sensor)
Slide 5 of x
Preferences for Repair
–
Which repairing option is best?
◦ Ontology engineer determines that via
preferences
Specified by ontology engineer
beforehand
 High-level “specifications” for the ideal
repair
 Serve as “instructions” to determine the
preferred (optimal) solution

Slide 6 of x
Preferences (On Ontologies)
O1
Score: 3
O2
Score: 4
O0
O3
Score: 6
Slide 7 of x
Preferences (On Deltas)
O1
O2
-P_Inst
(Item1,ST1,
geo:location)
Score: 2
O3
O0
Score: 4
Score: 5
-Dom
(geo:location,
Sensor)
+C_Inst
(Item1,Sensor)
Slide 8 of x
Preferences

Preferences on ontologies are result-oriented

Preferences on deltas are impact-oriented

Properties of preferences
◦ Consider the quality of the repair result
◦ Ignore the impact of repair
◦ Popular options: prefer newest/trustable information,
prefer a specific ontological structure
◦ Consider the impact of repair
◦ Ignore the quality of the repair result
◦ Popular options: minimize schema changes, minimize
addition/deletion of information, minimize delta size
◦
◦
◦
◦
Preferences on ontologies/deltas are equivalent
Quality metrics can be used for stating preferences
Metadata on the data can be used (e.g., provenance)
Can be qualitative or quantitative
Slide 9 of x
Generalizing the Approach

For one violated constraint
1. Diagnose invalidity
2. Determine minimal ways to resolve it
3. Determine and return preferred (optimal) resolution

For many violated constraints
◦
◦

Problem becomes more complicated
More than one resolution steps are required
Issues:
1. Resolution order
2. When and how to filter non-optimal solutions?
3. Constraint (and resolution) interdependencies
Slide 10 of x
Constraint Interdependencies

A given resolution may:
◦
◦

Optimal resolution unknown ‘a priori’
◦
◦

Cause other violations (bad)
Resolve other violations (good)
Cannot predict a resolution’s ramifications
Exhaustive, recursive search required
(resolution tree)
Two ways to create the resolution tree
◦
◦
Globally-optimal (GO) / locally-optimal (LO)
When and how to filter non-optimal
solutions?
Slide 11 of x
Resolution Tree Creation (GO)
–
–
Find all minimal resolutions
for all the violated
constraints, then find the
optimal ones
Globally-optimal (GO)
◦
◦
◦
◦
Find all minimal resolutions
for one violation
Explore them all
Repeat recursively until valid
Return the optimal leaves
Optimal repairs
(returned)
Slide 12 of x
Resolution Tree Creation (LO)
–
–
Find the minimal and
optimal resolutions for one
violated constraint, then
repeat for the next
Locally-optimal (LO)
◦
◦
◦
◦
Find all minimal resolutions
for one violation
Explore the optimal one(s)
Repeat recursively until valid
Return all remaining leaves
Optimal repair
(returned)
Slide 13 of x
Comparison (GO versus LO)

Characteristics of GO
◦ Exhaustive
◦ Less efficient:
large resolution trees
◦ Always returns optimal
repairs
◦ Insensitive to constraint
syntax
◦ Does not depend on
resolution order

Characteristics of LO
◦ Greedy
◦ More efficient:
small resolution trees
◦ Does not always return
optimal repairs
◦ Sensitive to constraint
syntax
◦ Depends on resolution
order
Slide 14 of x
PART II:
Complexity Analysis and
Performance Evaluation
(D2.2)
Slide 15 of x
Algorithm and Complexity
Detailed complexity analysis for GO/LO
and various different types of constraints
and preferences
 Inherently difficult problem

◦ Exponential complexity (in general)
◦ Exception: LO is polynomial (in special cases)

Theoretical complexity is misleading as to
the actual performance of the algorithms
Slide 16 of x
Performance in Practice

Performance in practice
◦ Linear with respect to ontology size
◦ Linear with respect to tree size
 Types of violated constraints (tree width)
 Number of violations (tree height) – causes the
exponential blowup
 Constraint interdependencies (tree height)
 Preference (for LO): affects pruning (tree width)

Further performance improvement
◦ Use optimizations
◦ Use LO with restrictive preference
Slide 17 of x
Evaluation Parameters

Evaluation
1.
2.
3.
4.
5.

Effect of ontology size (for GO/LO)
Effect of tree size (for GO/LO)
Effect of violations (for GO/LO)
Effect of preference (relevant for LO only)
Quality of LO repairs
Evaluation results support our claims:
◦
◦
Linear with respect to ontology size
Linear with respect to tree size
Slide 18 of x
(logscale)
Effect of Ontology Size
Diagnosis
GO Repair 16 Violations
LO Repair 16 Violations
LO Repair 26 Violations
GO Repair 26 Violations
10000.00
Execution Time (msec)
1000.00
100.00
10.00
1.00
0.10
0.01
0.00
500
5000
Triples (x1000)
20000
(logscale)
Slide 19 of x
Effect of Tree Size (1/2)
1400
1200
GO Execution Time (sec)
1000
800
1M
5M
10M
600
15M
20M
400
200
0
0
30
60
90
120
150
Nodes (x
Nodes
(x106 )6)
10
Slide 20 of x
Effect of Tree Size (2/2)
40
35
LO Execution Time (sec)
30
25
1M
20
5M
10M
15M
15
20M
10
5
0
0
2000
4000
6000
8000
10000
# Nodes
Slide 21 of x
Effect of Violations (1/2)
1400
1200
GO Execution Time (sec)
1000
800
1M
5M
10M
600
15M
20M
400
200
0
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
# Violations
Slide 22 of x
Effect of Violations (2/2)
40
35
LO Execution Time (sec)
30
25
1M
5M
20
10M
15M
15
20M
10
5
0
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
# Violations
Slide 23 of x
Effect of Preference (LO)
Slide 24 of x
Quality of LO Repairs (1/2)
7
CCD KB
# Pref. Rep. Deltas
6
5
4
GO∩LO
3
GO\LO
2
LO\GO
1
0
0
3
6
9
12
# Violations
15
18
21
Max(
)
Slide 25 of x
Quality of LO Repairs (2/2)
CCD KB
1400
# Pref. Rep. Deltas
1200
1000
800
GO∩LO
600
GO\LO
400
LO\GO
200
0
0
3
6
9
12
# Violations
15
18
21
Min(
)
Slide 26 of x
PART III:
Application of Repairing
in a Real Setting
(D4.4)
Slide 27 of x
Objectives and Main Idea

Evaluate repairing method in a real LOD
setting
◦ Using resources from WP4
◦ Using provenance-related preferences
Validate the utility of WP4 resources for a
data quality benchmark
 Evaluate the usefulness of provenance,
recency etc as metrics/preferences for
quality assessment and repair

Slide 28 of x
Setting

User seeks information on Brazilian cities
◦ Fuses Wikipedia dumps from different
languages

Guarantees maximal coverage, but may
lead to conflicts
◦ E.g., city with two different population counts
ES
FR
EN
GE
PT
Slide 29 of x
Main Tasks

Assess the quality of the resulting dataset
◦ Quality assessment framework

Repair the resulting dataset
◦ Using the aforementioned repairing method
◦ Evaluate the use of provenance-related
preferences
 Prefer most recent information
 Prefer most trustworthy information
Slide 30 of x
Contributions

Contributions:
◦ Define 5 different metrics based on provenance
◦ Each metric is used as:
 Quality assessment metric (to assess quality)
 Repairing preference (to “guide” the repair)
◦ Evaluate them in a real setting
Slide 31 of x
Experiments (Setting)

Setting
◦ Fused 5 Wikipedias: EN, PT, SP, GE, FR
◦ Distilled information about Brazilian cities

Properties considered:
◦ populationTotal
◦ areaTotal
◦ foundingDate

Validity rules: properties must be functional
◦ Repaired invalidities (using our metrics)
◦ Checked quality of result
◦ Dimensions: consistency, validity, conciseness,
completeness and accuracy
Slide 32 of x
Metrics for Experiments (1/2)
PREFER_PT: select conflicting
information based on its source
(PT>EN>SP>GE>FR)
2. PREFER_RECENT: select conflicting
information based on its recency (most
recent is preferred)
3. PLAUSIBLE_PT: ignore “irrational” data
(population<500, area<300km2,
founding date<1500AD) otherwise use
PREFER_PT
1.
Slide 33 of x
Metrics for Experiments (2/2)
WEIGHTED_RECENT: select based on
recency, but in cases where the records
are almost equally recent, use source
reputation (if less than 3 months apart,
use PREFER_PT, else use
PREFER_RECENT)
5. CONDITIONAL_PT: define source
trustworthiness depending on data
values (prefer PT for small cities with
population<500.000, prefer EN for the
rest)
4.
Slide 34 of x
Consistency, Validity

Consistency
◦ Lack of conflicting triples
◦ Guaranteed to be perfect (by the repairing
algorithm), regardless of preference

Validity
◦ Lack of rule violations
◦ Coincides with consistency for this example
◦ Guaranteed to be perfect (by the repairing
algorithm), regardless of preference
Slide 35 of x
Conciseness, Completeness

Conciseness
◦ No duplicates in the final result
◦ Guaranteed to be perfect (by the fuse process),
regardless of preference

Completeness
◦
◦
◦
◦
Coverage of information
Improved by fusion
Unaffected by our algorithm
Input completeness = output completeness,
regardless of preference
◦ Measured to be at 77,02%
Slide 36 of x
Accuracy
Most important metric for this experiment
 Accuracy

◦ Closeness to the “actual state of affairs”
◦ Affected by the repairing choices

Compared repair with the Gold Standard
◦ Taken from an official and independent data
source (IBGE)
Slide 37 of x
Accuracy Evaluation
en.dbpedia
pt.dbpedia
…
Instituto Brasileiro de
Geografia e Estatística
(IBGE)
fr.dbpedia
Fuse/Repair
dbpedia:areaTotal
dbpedia:populationTotal
dbpedia:foundingDate
dbpedia:areaTotal
dbpedia:populationTotal
dbpedia:foundingDate
Gold
Standard
integrated
data
Compare
Accuracy
Slide 38 of x
Accuracy Examples

City of Aracati
◦
◦
◦
◦

Population: 69159/69616 (conflicting)
Record in Gold Standard: 69159
Good choice: 69159
Bad choice: 69616
City of Oiapoque
◦
◦
◦
◦
Population: 20226/20426 (conflicting)
Record in Gold Standard: 20509
Optimal approximation choice: 20426
Sub-optimal approximation choice: 20226
Slide 39 of x
Accuracy Results
Slide 40 of x
Accuracy of Input and Output
Slide 41 of x
Publications

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides.
Declarative Repairing Policies for Curated KBs. In
Proceedings of the 10th Hellenic Data Management
Symposium (HDMS-11), 2011.

Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon,
Pablo N. Mendes, Irini Fundulaki. Using Provenance for
Quality Assessment and Repair in Linked Open Data. In
Proceedings of the Joint Workshop on Knowledge Evolution
and Ontology Dynamics (EvoDyn-12), 2012.

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides.
Preference-Based Repairing of RDF(S) DBs. Under review
in TODS Journal.
Slide 42 of x
BACKUP SLIDES
Slide 43 of x
Repair


Removing invalidities by changing the
ontology in an adequate manner
General concerns:
1. Return a valid ontology
–
Strict requirement
2. Minimize the impact of repair upon the data
–
Make minor, targeted modifications that repair
the ontology without changing it too much
3. Return a “good” repair
–
Emulate the changes that the ontology
engineer would do for repairing the ontology
Slide 44 of x
Inference
Inference expressed using validity rules
 Example:

◦ Transitivity of class subsumption
◦ a,b,c C_Sub(a,b)  C_Sub(b,c)  C_Sub(a,c)

In practice we use labeling algorithms
◦ Avoid explicitly storing the inferred knowledge
◦ Improve efficiency of reasoning
Slide 45 of x
Quality Assessment

Quality = “fitness for use”
◦ Multi-dimensional, multi-faceted, context-dependent

Methodology for quality assessment
◦ Dimensions
 Aspects of quality
 Accuracy, completeness, timeliness, …
◦ Indicators
 Metadata values for measuring dimensions
 Last modification date (related to timeliness)
◦ Scoring Functions
 Functions to quantify quality indicators
 Days since last modification date
◦ Metrics
 Measures of dimensions (result of scoring function)
 Can be combined
Slide 46 of x