1 Last update: 19 October 2016 Knowledge and the Web – Data quality issues Bettina Berendt KU Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2016-17-1stsemester/kaw Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 1 2 About your project proposals General remarks (now) Specific remarks (tomorrow in exercise session) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 2 3 Agenda Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 3 4 Recall: Different types of “Open” Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 4 5 But why open? (from http://opendefinition.org/) Do you see a statement in this definition that does not appear substantiated? Can you give 3 reasons why it may be true? Can you give 3 reasons why it may be false? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 5 6 PS: “Modify” … does not necessarily mean that everybody should be able to modify the original data It does mean that you can take it and modify your copy. Specifically, the opendefinition.org’s Content is licensed under a CC Attribution 4.0 International License. Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 6 7 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 7 8 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 8 9 More about the Wikipedia case https://www.scientificamerican.com/article/wikipedia-editorswoo-scientists-to-improve-content-quality/ Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 9 Our brainstorming (Task: check whether each is contained in Zaveri et al.‘s list!) 10 readability (format, ...) missing data different units inaccurate data, false data (intention?) duplicate data format/type/value (ex. string instead of numbers) context description missing language verifiability? outdated, non-constant, lack of data-creation timestamp how much info can you get out of the data (how connected is it)? easy availailability? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 10 From questions students asked last year: (1) Quality dimensions 11 Is there any gain or information that can be won by companies that are not providing services on the web given that services provided by e.g. dbPedia are not stable. To what extent should they be able to thrust the quality and accessibility of these services. Or do you think that the semantic web should be seen as just a source of information but not as a possible asset to other commercial institutions/industry/.... How is usually evaluated the credibility of entries, in real life application, especially for user generated content? If some data is modified, what will happen to all the links linked to the data? Will they be updated or not? What is the difficulty to update all related links? What will happen if it only adds new links but not delete or update old links? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 11 12 From questions: (2) “Inconsistencies between datasets“ If I want to reuse an already developed ontology is there any general methodology to choose one? How much impact does it has the incoming and outgoing links? Are there other parameters that should be considered? In an application using Linked Data how do you handle disagreement and contradictory information about an entity? In the text “Linked Data: Evolving the Web into a Global Data Space” by Heath and Bizer, it is stated that one of the properties of the “Web of Data” is that it is able to represent disagreement and contradictory information. How is contradictory information a good thing and how can the user know which information is trustworthy? What if people maliciously add wrong data? How do we know which data is correct/incorrect? Who checks this? Who or what determines if a link is valid. What can you do if there is a wrong link in the current LOD set ? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 12 13 From your questions: (3) Issues affected by these questions If we are going to make a user friendly search engine to navigate through linked data, what is the biggest problem that we need to solve? What will affect the search speed and the accuracy of search result? How to improve that? How to prevent user to get lost while searching. Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 13 14 Agenda Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 14 15 What is data quality? http://www.dqglossary.com/data%20quality_.html (a collection of definitions) Data quality - The totality of features and characteristics of data that bears on their ability to satisfy a given purpose [...] Glossary of Quality Assurance Terms, Hanford.gov, 26 August 2009 11:49:56 Data quality - The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. Government of British Columbia, 26 August 2009 11:48:58 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 15 16 Agenda Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 16 What is data quality in LOD? 17 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 17 18 A wonderful example of good research! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 18 What is data quality in LOD? 19 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 19 20 Contextual Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 20 What is data quality in LOD? 21 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 21 22 Intrinsic (1) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 22 23 Intrinsic (2) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 23 What is data quality in LOD? 24 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 24 25 Accessibility Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 25 What is data quality in LOD? 26 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 26 27 Representational (1) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 27 28 Representational (2) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 28 What is data quality in LOD? 29 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 29 30 Dataset dynamicity Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 30 What is data quality in LOD? 31 (*: newly introduced for LD) Zaveri et al., 2012 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 31 32 Trust (1) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 32 33 Trust (2) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 33 34 Agenda Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 34 35 PROV – the W3C Provenance Specifications W3C Working Group Note 30 April 2013 http://www.w3.org/TR/prov-overview/ Tutorial at ESWC 2013 by Paul Groth, Jun Zhao, and Olaf Hartig: http://www.w3.org/2001/sw/wiki/ESWC2013ProvTutorial Shown in class: introduction Also interesting: PROV-O (PPTs can be found on the tutorial page) Book: http://www.provbook.org/ (accessible from within KU Leuven) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 35 36 ProvStore (https://provenance.ecs.soton.ac.uk/store/) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 36 37 A case of provenance Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 37 38 Formalizing provenance: a high-level view Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 38 39 More about provenance … … by our invited speaker Tom De Nies on 23 Nov. Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 39 40 Inconsistencies between different data sources Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 40 41 What to do when you can retrieve/derive A and The anarchistic solution: Ex falso quod libet The careful solution: „don‘t know“ (retract both) The pragmatic solution: choose one ̚ A (Note: The classification of solutions is NOT standard! ;-) ) Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 41 42 Inconsistency Resolution Strategies Slide adapted from Bizer (2008) Pass it on. Pass conflicting values to the user and let him/her decide. Take the information If value is missing in dataset 1, use value from dataset 2. Trust your friends Prefer information from certain sources. Cry with the wolves Choose most common value. Meet in the middle Take the average of all values. Keep up to date Use the newest value. SeeAlso: Bleiholder and Naumann: Conflict Handling Strategies in an Integrated Information System. WWW2006. Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 42 43 „Prefer information from certain sources“ 1. How do I know what the source is? 2. How do I know which sources are better than others? 3. What do I do with this information about sources? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 43 44 Q1. How do I know what the source is? Provenance Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 44 45 Q2. How do I know which sources are better than others? How do I know this In real life? In reading scientific papers? On the Web? Will look at two approaches: „democratic“ – example voting „meritocratic“ – example PageRank Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 45 46 PageRank, “the basis of Google” Slide based on Karimzadehgan (2007) Notion of “Hub” and “Authority” Authority: Highly referenced pages Hub: Pages containing good reference lists Intuition: a high-quality site is one that has many high-quality sites linking to it Two algorithms developed at the same time: Kleinberg HITS: hubs and authorities Brin & Page PageRank: authorities “Rediscovery” of a work from bibliometrics: Pinsky & Narin (1976) Many re-uses of this idea in different domains (information retrieval, text summarization, social Web, …) 7/28/2017 Introduction to Information Retrieval Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 46 46 47 Exploiting Inter-Document Links Slide based on Karimzadehgan (2007) Description (“anchor text”) Links indicate the utility of a doc Hub What does a link tell us? Authority 47 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 47 48 PageRank: The “random surfer” model Slide based on Karimzadehgan (2007) Probability q of randomly jumping to that page PageRank ( A) q (1 q) * 1 k Page A 7/28/2017 PageRank Ai C Ai Pages pointing to A Introduction to Information Retrieval Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 48 48 49 PageRank and relevance Slide based on Karimzadehgan (2007) “Random surfer” selects a page, keeps clicking links until “bored”, then randomly selects another page. PageRank(A) is the probability that such a user visits A q is the probability of getting bored at a page PageRank matrix can be computed offline Google takes into account both the relevance of the page and PageRank (and many other things of course) Relevance is computed from the text and other features (of course, a proprietary and evolving scheme). 7/28/2017 Introduction to Information Retrieval Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 49 49 50 Q3. What do I do with this information about sources? Example Bonatti et al. 2011 1. Compute the PageRank of sources (domains or documents) Recall the LOD cloud: 2. Rank of a triple = sum over the ranks of sources containing this triple 3. Rank of an inference = minimum over rank of tripels needed to make this inference 4. Identify minimum set of triples that cause the inconsistency 5. Remove the minimum-ranked triples to restore consistency Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 50 51 Provenance plus other metrics for quality “repair“ Flouris et al. Using Provenance for Quality Assessment and Repair in Linked Open Data Proc. Of EvoDyn-2012 – Knowledge Evolution and Ontology Dynamics 2012 Ceur-ws.org/Vol-890 Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 51 52 Agenda Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 52 53 Task What are potential data quality issues with the following sources? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 53 Recall: Dimensions of data quality in LOD Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 54 54 55 ‘Where can I find socialnetwork datasets for my research on …?’ Ex.: https://snap.stanford.e du/data/#socnets In what sense(s) could these data be of limited/no use for the research Q? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 55 56 Drug users in Oakland heatmap and PredPol predictions "predictive policing" program. Police car laptops will display maps showing locations where crime is likely to occur, based on data-crunching algorithms […] [The algorithms] replace more basic trendspotting and gut feelings about where crimes will happen and who will commit them with […] objective analysis. Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 56 57 Demographics in policed areas, and a simulation model Left: Top: Number of days with targeted policing for drug crimes in areas flagged by PredPol analysis of Oakland police data. Middle: Targeted policing for drug crimes, by “race”. Bottom: Estimated drug use by “race”. Right: Top: Number of drug arrests made by Oakland police department, 2010. (1) West Oakland, (2) International Boulevard. Bottom: Estimated number of drug users, based on 2011 National Survey on Drug Use and Health Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 57 58 Questions "predictive policing" program. Police car laptops will display maps showing locations where crime is likely to occur, based on data-crunching algorithms […] [The algorithms] replace more basic trendspotting and gut feelings about where crimes will happen and who will commit them with […] objective analysis. What does “objectivity” mean here? Where’s this “objectivity”, and where is it not? What concept are the police officers using PredPol trying to capture, and with what measures? What concept(s) are they actually capturing? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 58 59 To be continued in the exercise session! Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 59 60 Outlook Data quality, esp. in (L)OD: hopes, concerns, tests What is data quality? Dimensions of LOD quality Provenance and inconsistencies Task: be a data-quality detective! Semantic technologies in practice incl. industry Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 60 61 Next lecture Invited talk Aad Versteden, Tenforce Semantic Technologies in Practice He proposed the following exercise as a preparation: Every day, try this once. Find some information, any information. This can be the nutritional value of the milk you drink in the morning, or the content of a newspaper. Think about who probably owns the data, who published the data, and who could make use of the data. Are this always three parties? Are they sometimes less, sometimes more? Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 61 62 Main references • Zaveri, A., Rula, A., Maurino, A., Pietrobron, R., Lehmann, J., & Auer, S. (2012). Quality assessment methodologies for Linked Open Data. A systematic literature review and conceptual framework. (Figures in this slideset are from a prior verison of the publication in the Semantic Web Journal, taken from www.semantic-web-journal.net/system/files/swj414.pdf • http://www.provbook.org • Flouris et al. (2012). Using Provenance for Quality Assessment and Repair in Linked Open Data. Proc. Of EvoDyn-2012 – Knowledge Evolution and Ontology Dynamics 2012. http://ceur-ws.org/Vol-890 • Piero A. Bonatti, Aidan Hogan, Axel Polleres, Luigi Sauro: Robust and scalable Linked Data reasoning incorporating provenance and trust annotations. J. Web Sem. 9(2): 165-201 (2011). http://sw.deri.org/~aidanh/docs/saor_ann_final.pdf • Lum, K. and Isaac, W. (2016), To predict and serve? Significance, 13: 14–19. http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2016.00960.x/full Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching 62
© Copyright 2026 Paperzz