Using Provenance for Quality Assessment and Repair in Linked Open Data Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki Publication at EvoDyn-12 Slide 1 of 21 Setting and General Idea Linked Open Data cloud Uncontrolled Vast Unstructured Dynamic Datasets interrelated, fused etc Quality problems emerge Assessment (measure quality) Repair (improve quality) Slide 2 of 21 Motivating Example User seeks information on Brazilian cities Fuses Wikipedia dumps from different languages Guarantees maximal coverage, but may lead to conflicts E.g., city with two different population counts ES FR EN GE PT Slide 3 of 21 Main Tasks Assess the quality of the resulting dataset Framework for associating data with its quality Repair the resulting dataset By removing one of the conflicting values (i.e., one of the conflicting population counts) How to determine which value to keep? Solution: use heuristics Here, we evaluate the use of provenancerelated heuristics Prefer most recent information Prefer most trustworthy information Slide 4 of 21 Contributions Emphasis on provenance Assessment metrics (done) Heuristics for repair (done, but does not support metadata information) Contributions: Extend repair algorithm to support heuristics on metadata Define 5 different metrics based on provenance Used for both assessment and repair Evaluate them in a real setting Slide 5 of 21 Quality Assessment Quality = “fitness for use” Multi-dimensional, multi-faceted, context-dependent Methodology for quality assessment Dimensions Aspects of quality Accuracy, completeness, timeliness, … Indicators Metadata values for measuring dimensions Last modification date (related to timeliness) Scoring Functions Functions to quantify quality indicators Days since last modification date Metrics Measures of dimensions (result of scoring function) Can be combined We use this framework to define our metrics Slide 6 of 21 Quality Repair (Setting) Focus on validity (quality dimension) Encodes context- or application-specific requirements Applications may be useless over invalid data Binary concept (valid/invalid) Generic Slide 7 of 21 Quality Repair (Rules) Rules determine validity Expressive Disjunctive Embedded Dependencies (DEDs) Cause interdependencies Resolution causes/resolves other violations Difficult to foresee ramifications of repairing choices User cannot make the selection alone Slide 8 of 21 Quality Repair (Preferences) Selection is done automatically, according to a set of user-defined specifications Which repairing option is best? Ontology engineer determines that via preferences Specified by ontology engineer beforehand High-level “specifications” for the ideal repair Serve as “instructions” to determine the preferred solution for repair Highly expressive Slide 9 of 21 Quality Repair (Extensions) Existing work on repair is limited Provenance cannot be considered for preferences Assessment metrics based on provenance cannot be exploited Extensions are needed (and provided) Metadata (including provenance) can be used in preferences Preferences can apply on both repairs and repairing options Formal details omitted (see paper) Slide 10 of 21 Experiments (Setting) Setting taken from the motivating example Fused 5 Wikipedias: EN, PT, SP, GE, FR Distilled information about Brazilian cities Properties considered: populationTotal areaTotal foundingDate Validity rules: properties must be functional Repaired invalidities (using our metrics) Checked quality of result Dimensions: consistency, validity, conciseness, completeness and accuracy Slide 11 of 21 Metrics for Experiments (1/2) 1. PREFER_PT: select conflicting information based on its source (PT>EN>SP>GE>FR) 2. PREFER_RECENT: select conflicting information based on its recency (most recent is preferred) 3. PLAUSIBLE_PT: ignore “irrational” data (population<500, area<300km2, founding date<1500AD) otherwise use PREFER_PT Slide 12 of 21 Metrics for Experiments (2/2) 4. WEIGHTED_RECENT: select based on recency, but in cases where the records are almost equally recent, use source reputation (if less than 3 months apart, use PREFER_PT, else use PREFER_RECENT) 5. CONDITIONAL_PT: define source trustworthiness depending on data values (prefer PT for small cities with population<500.000, prefer EN for the rest) Slide 13 of 21 Consistency, Validity Consistency Lack of conflicting triples Guaranteed to be perfect (by the repairing algorithm), regardless of preference Validity Lack of rule violations Coincides with consistency for this example Guaranteed to be perfect (by the repairing algorithm), regardless of preference Slide 14 of 21 Conciseness, Completeness Conciseness No duplicates in the final result Guaranteed to be perfect (by the fuse process), regardless of preference Completeness Coverage of information Improved by fusion Unaffected by our algorithm Input completeness = output completeness, regardless of preference Measured to be at 77,02% Slide 15 of 21 Accuracy Most important metric for this experiment Accuracy Closeness to the “actual state of affairs” Affected by the repairing choices Compared repair with the Gold Standard Taken from an official and independent data source (IBGE) Slide 16 of 21 Accuracy Evaluation en.dbpedia pt.dbpedia … Instituto Brasileiro de Geografia e Estatística (IBGE) fr.dbpedia Fuse/Repair dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate Gold Standard integrated data Compare Accuracy Slide 17 of 21 Accuracy Examples City of Aracati Population: 69159/69616 (conflicting) Record in Gold Standard: 69159 Good choice: 69159 Bad choice: 69616 City of Oiapoque Population: 20226/20426 (conflicting) Record in Gold Standard: 20509 Optimal approximation choice: 20426 Sub-optimal approximation choice: 20226 Slide 18 of 21 Accuracy Results Slide 19 of 21 Accuracy of Input and Output Slide 20 of 21 Conclusion Quality assessment and repair of LOD Evaluated a set of sophisticated, provenance-inspired metrics for: Assessing quality Repairing conflicts Used in a specific experimental setting Results are necessarily application-specific THANK YOU! Slide 21 of 21
© Copyright 2026 Paperzz