Data Repairing Giorgos Flouris, FORTH December 11-12, 2012, Luxembourg PART I: Problem Statement and Proposed Solution (D2.2) Slide 2 of x Validity as a Quality Indicator Validity is an important quality indicator ◦ ◦ ◦ Encodes context- or application-specific requirements Applications may be useless over invalid data Binary concept (valid/invalid) Two steps to guarantee validity (repair process): 1. Identifying invalid ontologies (diagnosis) Detecting invalidities in an automated manner Subtask of Quality Assessment 2. Remove invalidities (repair) Repairing invalidities in an automated manner Subtask of Quality Enhancement Slide 3 of x Diagnosis Expressing validity using validity rules over an adequate relational schema Examples: ◦ Properties must have a unique domain ◦ p Prop(p) a Dom(p,a) ◦ p,a,b Dom(p,a) Dom(p,b) (a=b) ◦ Correct classification in property instances ◦ x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a) ◦ x,y,p,a P_Inst(x,y,p) Rng(p,a) C_Inst(y,a) Diagnosis reduced to relational queries Slide 4 of x Example Ontology O0 Class(Sensor), Class(SpatialThing), Class(Observation) geo:location Prop(geo:location) Sensor Dom(geo:location,Sensor) Rng(geo:location,SpatialThing) Observation Inst(Item1), Inst(ST1) P_Inst(Item1,ST1,geo:location) C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing) Item1 Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a) SpatialThing Schema Data ST1 Item1 geo:location ST1 P_Inst(Item1,ST1,geo:location)O0 Sensor is the domain of geo:location Dom(geo:location,Sensor)O0 Item1 is not a Sensor C_Inst(Item1,Sensor)O0 Remove P_Inst(Item1,ST1,geo:location) Remove Dom(geo:location,Sensor) Add C_Inst(Item1,Sensor) Slide 5 of x Preferences for Repair – Which repairing option is best? ◦ Ontology engineer determines that via preferences Specified by ontology engineer beforehand High-level “specifications” for the ideal repair Serve as “instructions” to determine the preferred (optimal) solution Slide 6 of x Preferences (On Ontologies) O1 Score: 3 O2 Score: 4 O0 O3 Score: 6 Slide 7 of x Preferences (On Deltas) O1 O2 -P_Inst (Item1,ST1, geo:location) Score: 2 O3 O0 Score: 4 Score: 5 -Dom (geo:location, Sensor) +C_Inst (Item1,Sensor) Slide 8 of x Preferences Preferences on ontologies are result-oriented Preferences on deltas are impact-oriented Properties of preferences ◦ Consider the quality of the repair result ◦ Ignore the impact of repair ◦ Popular options: prefer newest/trustable information, prefer a specific ontological structure ◦ Consider the impact of repair ◦ Ignore the quality of the repair result ◦ Popular options: minimize schema changes, minimize addition/deletion of information, minimize delta size ◦ ◦ ◦ ◦ Preferences on ontologies/deltas are equivalent Quality metrics can be used for stating preferences Metadata on the data can be used (e.g., provenance) Can be qualitative or quantitative Slide 9 of x Generalizing the Approach For one violated constraint 1. Diagnose invalidity 2. Determine minimal ways to resolve it 3. Determine and return preferred (optimal) resolution For many violated constraints ◦ ◦ Problem becomes more complicated More than one resolution steps are required Issues: 1. Resolution order 2. When and how to filter non-optimal solutions? 3. Constraint (and resolution) interdependencies Slide 10 of x Constraint Interdependencies A given resolution may: ◦ ◦ Optimal resolution unknown ‘a priori’ ◦ ◦ Cause other violations (bad) Resolve other violations (good) Cannot predict a resolution’s ramifications Exhaustive, recursive search required (resolution tree) Two ways to create the resolution tree ◦ ◦ Globally-optimal (GO) / locally-optimal (LO) When and how to filter non-optimal solutions? Slide 11 of x Resolution Tree Creation (GO) – – Find all minimal resolutions for all the violated constraints, then find the optimal ones Globally-optimal (GO) ◦ ◦ ◦ ◦ Find all minimal resolutions for one violation Explore them all Repeat recursively until valid Return the optimal leaves Optimal repairs (returned) Slide 12 of x Resolution Tree Creation (LO) – – Find the minimal and optimal resolutions for one violated constraint, then repeat for the next Locally-optimal (LO) ◦ ◦ ◦ ◦ Find all minimal resolutions for one violation Explore the optimal one(s) Repeat recursively until valid Return all remaining leaves Optimal repair (returned) Slide 13 of x Comparison (GO versus LO) Characteristics of GO ◦ Exhaustive ◦ Less efficient: large resolution trees ◦ Always returns optimal repairs ◦ Insensitive to constraint syntax ◦ Does not depend on resolution order Characteristics of LO ◦ Greedy ◦ More efficient: small resolution trees ◦ Does not always return optimal repairs ◦ Sensitive to constraint syntax ◦ Depends on resolution order Slide 14 of x PART II: Complexity Analysis and Performance Evaluation (D2.2) Slide 15 of x Algorithm and Complexity Detailed complexity analysis for GO/LO and various different types of constraints and preferences Inherently difficult problem ◦ Exponential complexity (in general) ◦ Exception: LO is polynomial (in special cases) Theoretical complexity is misleading as to the actual performance of the algorithms Slide 16 of x Performance in Practice Performance in practice ◦ Linear with respect to ontology size ◦ Linear with respect to tree size Types of violated constraints (tree width) Number of violations (tree height) – causes the exponential blowup Constraint interdependencies (tree height) Preference (for LO): affects pruning (tree width) Further performance improvement ◦ Use optimizations ◦ Use LO with restrictive preference Slide 17 of x Evaluation Parameters Evaluation 1. 2. 3. 4. 5. Effect of ontology size (for GO/LO) Effect of tree size (for GO/LO) Effect of violations (for GO/LO) Effect of preference (relevant for LO only) Quality of LO repairs Evaluation results support our claims: ◦ ◦ Linear with respect to ontology size Linear with respect to tree size Slide 18 of x (logscale) Effect of Ontology Size Diagnosis GO Repair 16 Violations LO Repair 16 Violations LO Repair 26 Violations GO Repair 26 Violations 10000.00 Execution Time (msec) 1000.00 100.00 10.00 1.00 0.10 0.01 0.00 500 5000 Triples (x1000) 20000 (logscale) Slide 19 of x Effect of Tree Size (1/2) 1400 1200 GO Execution Time (sec) 1000 800 1M 5M 10M 600 15M 20M 400 200 0 0 30 60 90 120 150 Nodes (x Nodes (x106 )6) 10 Slide 20 of x Effect of Tree Size (2/2) 40 35 LO Execution Time (sec) 30 25 1M 20 5M 10M 15M 15 20M 10 5 0 0 2000 4000 6000 8000 10000 # Nodes Slide 21 of x Effect of Violations (1/2) 1400 1200 GO Execution Time (sec) 1000 800 1M 5M 10M 600 15M 20M 400 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 # Violations Slide 22 of x Effect of Violations (2/2) 40 35 LO Execution Time (sec) 30 25 1M 5M 20 10M 15M 15 20M 10 5 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 # Violations Slide 23 of x Effect of Preference (LO) Slide 24 of x Quality of LO Repairs (1/2) 7 CCD KB # Pref. Rep. Deltas 6 5 4 GO∩LO 3 GO\LO 2 LO\GO 1 0 0 3 6 9 12 # Violations 15 18 21 Max( ) Slide 25 of x Quality of LO Repairs (2/2) CCD KB 1400 # Pref. Rep. Deltas 1200 1000 800 GO∩LO 600 GO\LO 400 LO\GO 200 0 0 3 6 9 12 # Violations 15 18 21 Min( ) Slide 26 of x PART III: Application of Repairing in a Real Setting (D4.4) Slide 27 of x Objectives and Main Idea Evaluate repairing method in a real LOD setting ◦ Using resources from WP4 ◦ Using provenance-related preferences Validate the utility of WP4 resources for a data quality benchmark Evaluate the usefulness of provenance, recency etc as metrics/preferences for quality assessment and repair Slide 28 of x Setting User seeks information on Brazilian cities ◦ Fuses Wikipedia dumps from different languages Guarantees maximal coverage, but may lead to conflicts ◦ E.g., city with two different population counts ES FR EN GE PT Slide 29 of x Main Tasks Assess the quality of the resulting dataset ◦ Quality assessment framework Repair the resulting dataset ◦ Using the aforementioned repairing method ◦ Evaluate the use of provenance-related preferences Prefer most recent information Prefer most trustworthy information Slide 30 of x Contributions Contributions: ◦ Define 5 different metrics based on provenance ◦ Each metric is used as: Quality assessment metric (to assess quality) Repairing preference (to “guide” the repair) ◦ Evaluate them in a real setting Slide 31 of x Experiments (Setting) Setting ◦ Fused 5 Wikipedias: EN, PT, SP, GE, FR ◦ Distilled information about Brazilian cities Properties considered: ◦ populationTotal ◦ areaTotal ◦ foundingDate Validity rules: properties must be functional ◦ Repaired invalidities (using our metrics) ◦ Checked quality of result ◦ Dimensions: consistency, validity, conciseness, completeness and accuracy Slide 32 of x Metrics for Experiments (1/2) PREFER_PT: select conflicting information based on its source (PT>EN>SP>GE>FR) 2. PREFER_RECENT: select conflicting information based on its recency (most recent is preferred) 3. PLAUSIBLE_PT: ignore “irrational” data (population<500, area<300km2, founding date<1500AD) otherwise use PREFER_PT 1. Slide 33 of x Metrics for Experiments (2/2) WEIGHTED_RECENT: select based on recency, but in cases where the records are almost equally recent, use source reputation (if less than 3 months apart, use PREFER_PT, else use PREFER_RECENT) 5. CONDITIONAL_PT: define source trustworthiness depending on data values (prefer PT for small cities with population<500.000, prefer EN for the rest) 4. Slide 34 of x Consistency, Validity Consistency ◦ Lack of conflicting triples ◦ Guaranteed to be perfect (by the repairing algorithm), regardless of preference Validity ◦ Lack of rule violations ◦ Coincides with consistency for this example ◦ Guaranteed to be perfect (by the repairing algorithm), regardless of preference Slide 35 of x Conciseness, Completeness Conciseness ◦ No duplicates in the final result ◦ Guaranteed to be perfect (by the fuse process), regardless of preference Completeness ◦ ◦ ◦ ◦ Coverage of information Improved by fusion Unaffected by our algorithm Input completeness = output completeness, regardless of preference ◦ Measured to be at 77,02% Slide 36 of x Accuracy Most important metric for this experiment Accuracy ◦ Closeness to the “actual state of affairs” ◦ Affected by the repairing choices Compared repair with the Gold Standard ◦ Taken from an official and independent data source (IBGE) Slide 37 of x Accuracy Evaluation en.dbpedia pt.dbpedia … Instituto Brasileiro de Geografia e Estatística (IBGE) fr.dbpedia Fuse/Repair dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate Gold Standard integrated data Compare Accuracy Slide 38 of x Accuracy Examples City of Aracati ◦ ◦ ◦ ◦ Population: 69159/69616 (conflicting) Record in Gold Standard: 69159 Good choice: 69159 Bad choice: 69616 City of Oiapoque ◦ ◦ ◦ ◦ Population: 20226/20426 (conflicting) Record in Gold Standard: 20509 Optimal approximation choice: 20426 Sub-optimal approximation choice: 20226 Slide 39 of x Accuracy Results Slide 40 of x Accuracy of Input and Output Slide 41 of x Publications Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011. Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki. Using Provenance for Quality Assessment and Repair in Linked Open Data. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn-12), 2012. Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF(S) DBs. Under review in TODS Journal. Slide 42 of x BACKUP SLIDES Slide 43 of x Repair Removing invalidities by changing the ontology in an adequate manner General concerns: 1. Return a valid ontology – Strict requirement 2. Minimize the impact of repair upon the data – Make minor, targeted modifications that repair the ontology without changing it too much 3. Return a “good” repair – Emulate the changes that the ontology engineer would do for repairing the ontology Slide 44 of x Inference Inference expressed using validity rules Example: ◦ Transitivity of class subsumption ◦ a,b,c C_Sub(a,b) C_Sub(b,c) C_Sub(a,c) In practice we use labeling algorithms ◦ Avoid explicitly storing the inferred knowledge ◦ Improve efficiency of reasoning Slide 45 of x Quality Assessment Quality = “fitness for use” ◦ Multi-dimensional, multi-faceted, context-dependent Methodology for quality assessment ◦ Dimensions Aspects of quality Accuracy, completeness, timeliness, … ◦ Indicators Metadata values for measuring dimensions Last modification date (related to timeliness) ◦ Scoring Functions Functions to quantify quality indicators Days since last modification date ◦ Metrics Measures of dimensions (result of scoring function) Can be combined Slide 46 of x
© Copyright 2026 Paperzz