D4.4_EvoDyn

Using Provenance for Quality
Assessment and Repair in
Linked Open Data
Giorgos Flouris, Yannis Roussakis,
Maria Poveda-Villalon,
Pablo N. Mendes, Irini Fundulaki
Publication at EvoDyn-12
Slide 1 of 21
Setting and General Idea
Linked Open Data cloud
Uncontrolled
Vast
Unstructured
Dynamic
Datasets interrelated, fused etc
Quality problems emerge
Assessment (measure quality)
Repair (improve quality)
Slide 2 of 21
Motivating Example
User seeks information on Brazilian cities
Fuses Wikipedia dumps from different
languages
Guarantees maximal coverage, but may
lead to conflicts
E.g., city with two different population counts
ES
FR
EN
GE
PT
Slide 3 of 21
Main Tasks
Assess the quality of the resulting dataset
Framework for associating data with its quality
Repair the resulting dataset
By removing one of the conflicting values (i.e.,
one of the conflicting population counts)
How to determine which value to keep?
Solution: use heuristics
Here, we evaluate the use of provenancerelated heuristics
Prefer most recent information
Prefer most trustworthy information
Slide 4 of 21
Contributions
Emphasis on provenance
Assessment metrics (done)
Heuristics for repair (done, but does not
support metadata information)
Contributions:
Extend repair algorithm to support heuristics on
metadata
Define 5 different metrics based on provenance
Used for both assessment and repair
Evaluate them in a real setting
Slide 5 of 21
Quality Assessment
Quality = “fitness for use”
Multi-dimensional, multi-faceted, context-dependent
Methodology for quality assessment
Dimensions
Aspects of quality
Accuracy, completeness, timeliness, …
Indicators
Metadata values for measuring dimensions
Last modification date (related to timeliness)
Scoring Functions
Functions to quantify quality indicators
Days since last modification date
Metrics
Measures of dimensions (result of scoring function)
Can be combined
We use this framework to define our metrics
Slide 6 of 21
Quality Repair (Setting)
Focus on validity (quality dimension)
Encodes context- or application-specific
requirements
Applications may be useless over invalid data
Binary concept (valid/invalid)
Generic
Slide 7 of 21
Quality Repair (Rules)
Rules determine validity
Expressive
Disjunctive Embedded Dependencies (DEDs)
Cause interdependencies
Resolution causes/resolves other violations
Difficult to foresee ramifications of repairing
choices
User cannot make the selection alone
Slide 8 of 21
Quality Repair (Preferences)
Selection is done automatically, according
to a set of user-defined specifications
Which repairing option is best?
Ontology engineer determines that via
preferences
Specified by ontology engineer beforehand
High-level “specifications” for the ideal repair
Serve as “instructions” to determine the
preferred solution for repair
Highly expressive
Slide 9 of 21
Quality Repair (Extensions)
Existing work on repair is limited
Provenance cannot be considered for
preferences
Assessment metrics based on provenance
cannot be exploited
Extensions are needed (and provided)
Metadata (including provenance) can be used in
preferences
Preferences can apply on both repairs and
repairing options
Formal details omitted (see paper)
Slide 10 of 21
Experiments (Setting)
Setting taken from the motivating example
Fused 5 Wikipedias: EN, PT, SP, GE, FR
Distilled information about Brazilian cities
Properties considered:
populationTotal
areaTotal
foundingDate
Validity rules: properties must be functional
Repaired invalidities (using our metrics)
Checked quality of result
Dimensions: consistency, validity, conciseness,
completeness and accuracy
Slide 11 of 21
Metrics for Experiments (1/2)
1. PREFER_PT: select conflicting
information based on its source
(PT>EN>SP>GE>FR)
2. PREFER_RECENT: select conflicting
information based on its recency (most
recent is preferred)
3. PLAUSIBLE_PT: ignore “irrational” data
(population<500, area<300km2,
founding date<1500AD) otherwise use
PREFER_PT
Slide 12 of 21
Metrics for Experiments (2/2)
4. WEIGHTED_RECENT: select based on
recency, but in cases where the records
are almost equally recent, use source
reputation (if less than 3 months apart,
use PREFER_PT, else use
PREFER_RECENT)
5. CONDITIONAL_PT: define source
trustworthiness depending on data
values (prefer PT for small cities with
population<500.000, prefer EN for the
rest)
Slide 13 of 21
Consistency, Validity
Consistency
Lack of conflicting triples
Guaranteed to be perfect (by the repairing
algorithm), regardless of preference
Validity
Lack of rule violations
Coincides with consistency for this example
Guaranteed to be perfect (by the repairing
algorithm), regardless of preference
Slide 14 of 21
Conciseness, Completeness
Conciseness
No duplicates in the final result
Guaranteed to be perfect (by the fuse process),
regardless of preference
Completeness
Coverage of information
Improved by fusion
Unaffected by our algorithm
Input completeness = output completeness,
regardless of preference
Measured to be at 77,02%
Slide 15 of 21
Accuracy
Most important metric for this experiment
Accuracy
Closeness to the “actual state of affairs”
Affected by the repairing choices
Compared repair with the Gold Standard
Taken from an official and independent data
source (IBGE)
Slide 16 of 21
Accuracy Evaluation
en.dbpedia
pt.dbpedia
…
Instituto Brasileiro de
Geografia e Estatística
(IBGE)
fr.dbpedia
Fuse/Repair
dbpedia:areaTotal
dbpedia:populationTotal
dbpedia:foundingDate
dbpedia:areaTotal
dbpedia:populationTotal
dbpedia:foundingDate
Gold
Standard
integrated
data
Compare
Accuracy
Slide 17 of 21
Accuracy Examples
City of Aracati
Population: 69159/69616 (conflicting)
Record in Gold Standard: 69159
Good choice: 69159
Bad choice: 69616
City of Oiapoque
Population: 20226/20426 (conflicting)
Record in Gold Standard: 20509
Optimal approximation choice: 20426
Sub-optimal approximation choice: 20226
Slide 18 of 21
Accuracy Results
Slide 19 of 21
Accuracy of Input and Output
Slide 20 of 21
Conclusion
Quality assessment and repair of LOD
Evaluated a set of sophisticated,
provenance-inspired metrics for:
Assessing quality
Repairing conflicts
Used in a specific experimental setting
Results are necessarily application-specific
THANK YOU!
Slide 21 of 21