PPT - KU Leuven

1
Last update: 19 October 2016
Knowledge and the Web –
Data quality issues
Bettina Berendt
KU Leuven,
Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/2016-17-1stsemester/kaw
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
1
2
About your project proposals
General remarks (now)
Specific remarks (tomorrow in exercise session)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
2
3
Agenda
Data quality, esp. in (L)OD: hopes, concerns, tests
What is data quality?
Dimensions of LOD quality
Provenance and inconsistencies
Task: be a data-quality detective!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
3
4
Recall: Different types of “Open”
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
4
5
But why open? (from http://opendefinition.org/)
Do you see a
statement in this
definition that
does not appear
substantiated?
Can you give 3
reasons why it
may be true?
Can you give 3
reasons why it
may be false?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
5
6
PS: “Modify”
… does not necessarily mean that everybody should be able to
modify the original data
It does mean that you can take it and modify your copy.
Specifically, the opendefinition.org’s Content is licensed under a
CC Attribution 4.0 International License.
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
6
7
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
7
8
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
8
9
More about the Wikipedia case
https://www.scientificamerican.com/article/wikipedia-editorswoo-scientists-to-improve-content-quality/
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
9
Our brainstorming
(Task: check whether each is contained in Zaveri et al.‘s list!)
10
readability (format, ...)
missing data
different units
inaccurate data, false data (intention?)
duplicate data
format/type/value (ex. string instead of numbers)
context description missing
language
verifiability?
outdated, non-constant, lack of data-creation timestamp
how much info can you get out of the data (how connected is it)?
easy availailability?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
10
From questions students asked last year:
(1) Quality dimensions
11
Is there any gain or information that can be won by companies
that are not providing services on the web given that services
provided by e.g. dbPedia are not stable. To what extent should
they be able to thrust the quality and accessibility of these
services. Or do you think that the semantic web should be seen
as just a source of information but not as a possible asset to
other commercial institutions/industry/....
How is usually evaluated the credibility of entries, in real life
application, especially for user generated content?
If some data is modified, what will happen to all the links linked
to the data? Will they be updated or not? What is the difficulty to
update all related links? What will happen if it only adds new
links but not delete or update old links?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
11
12
From questions: (2) “Inconsistencies between datasets“
If I want to reuse an already developed ontology is there any general
methodology to choose one? How much impact does it has the
incoming and outgoing links? Are there other parameters that should be
considered?
In an application using Linked Data how do you handle disagreement
and contradictory information about an entity?
In the text “Linked Data: Evolving the Web into a Global Data Space” by
Heath and Bizer, it is stated that one of the properties of the “Web of
Data” is that it is able to represent disagreement and contradictory
information. How is contradictory information a good thing and how can
the user know which information is trustworthy? What if people
maliciously add wrong data?
How do we know which data is correct/incorrect? Who checks this?
Who or what determines if a link is valid. What can you do if there is a
wrong link in the current LOD set ?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
12
13
From your questions: (3) Issues affected by these questions
If we are going to make a user friendly search engine to navigate
through linked data, what is the biggest problem that we need to
solve? What will affect the search speed and the accuracy of
search result? How to improve that? How to prevent user to get
lost while searching.
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
13
14
Agenda
Data quality, esp. in (L)OD: hopes, concerns, tests
What is data quality?
Dimensions of LOD quality
Provenance and inconsistencies
Task: be a data-quality detective!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
14
15
What is data quality?
http://www.dqglossary.com/data%20quality_.html
(a collection of definitions)
Data quality - The totality of features and characteristics of data
that bears on their ability to satisfy a given purpose [...]
Glossary of Quality Assurance Terms, Hanford.gov, 26 August
2009 11:49:56
Data quality - The state of completeness, validity, consistency,
timeliness and accuracy that makes data appropriate for a
specific use.
Government of British Columbia, 26 August 2009 11:48:58
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
15
16
Agenda
Data quality, esp. in (L)OD: hopes, concerns, tests
What is data quality?
Dimensions of LOD quality
Provenance and inconsistencies
Task: be a data-quality detective!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
16
What is data quality in LOD?
17
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
17
18
A wonderful example of good research!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
18
What is data quality in LOD?
19
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
19
20
Contextual
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
20
What is data quality in LOD?
21
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
21
22
Intrinsic (1)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
22
23
Intrinsic (2)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
23
What is data quality in LOD?
24
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
24
25
Accessibility
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
25
What is data quality in LOD?
26
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
26
27
Representational (1)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
27
28
Representational (2)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
28
What is data quality in LOD?
29
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
29
30
Dataset dynamicity
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
30
What is data quality in LOD?
31
(*: newly introduced for LD) Zaveri et al., 2012
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
31
32
Trust (1)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
32
33
Trust (2)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
33
34
Agenda
Data quality, esp. in (L)OD: hopes, concerns, tests
What is data quality?
Dimensions of LOD quality
Provenance and inconsistencies
Task: be a data-quality detective!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
34
35
PROV – the W3C Provenance Specifications

W3C Working Group Note 30 April 2013
http://www.w3.org/TR/prov-overview/

Tutorial at ESWC 2013 by Paul Groth, Jun Zhao, and Olaf
Hartig:
http://www.w3.org/2001/sw/wiki/ESWC2013ProvTutorial


Shown in class: introduction

Also interesting: PROV-O

(PPTs can be found on the tutorial page)
Book: http://www.provbook.org/ (accessible from within KU
Leuven)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
35
36
ProvStore (https://provenance.ecs.soton.ac.uk/store/)
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
36
37
A case of provenance
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
37
38
Formalizing provenance: a high-level view
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
38
39
More about provenance …
… by our invited speaker Tom De Nies on 23 Nov.
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
39
40
Inconsistencies between different data sources
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
40
41
What to do when you can retrieve/derive A and

The anarchistic solution: Ex falso quod libet

The careful solution: „don‘t know“ (retract both)

The pragmatic solution: choose one
̚
A
(Note: The classification of solutions is NOT standard! ;-) )
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
41
42
Inconsistency Resolution Strategies
Slide adapted
from Bizer (2008)
Pass it on.
 Pass conflicting values to the user and let him/her decide.
Take the information
 If value is missing in dataset 1, use value from dataset 2.
Trust your friends
 Prefer information from certain sources.
Cry with the wolves
 Choose most common value.
Meet in the middle
 Take the average of all values.
Keep up to date
 Use the newest value.
SeeAlso: Bleiholder and Naumann: Conflict Handling Strategies in an Integrated
Information System. WWW2006.
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
42
43
„Prefer information from certain sources“
1.
How do I know what the source is?
2.
How do I know which sources are better than others?
3.
What do I do with this information about sources?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
43
44
Q1. How do I know what the source is?
Provenance
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
44
45
Q2. How do I know which sources are better than others?
How do I know this

In real life?

In reading scientific papers?

On the Web?

Will look at two approaches:

„democratic“ – example voting

„meritocratic“ – example PageRank
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
45
46
PageRank, “the basis of Google”
Slide based on
Karimzadehgan (2007)
Notion of “Hub” and “Authority”

Authority: Highly referenced pages

Hub: Pages containing good reference lists

Intuition: a high-quality site is one that has many high-quality
sites linking to it
Two algorithms developed at the same time:

Kleinberg HITS: hubs and authorities

Brin & Page PageRank: authorities

“Rediscovery” of a work from bibliometrics: Pinsky & Narin
(1976)

Many re-uses of this idea in different domains (information
retrieval, text summarization, social Web, …)
7/28/2017
Introduction to Information Retrieval
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
46
46
47
Exploiting Inter-Document Links
Slide based on
Karimzadehgan (2007)
Description
(“anchor text”)
Links indicate the utility of a doc
Hub
What does a link tell us?
Authority
47
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
47
48
PageRank: The “random surfer” model
Slide based on
Karimzadehgan (2007)
Probability q of
randomly jumping
to that page
PageRank ( A)  q  (1  q) * 1
k
Page A
7/28/2017
PageRank  Ai 
C  Ai 
Pages pointing to A
Introduction to Information Retrieval
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
48
48
49
PageRank and relevance
Slide based on
Karimzadehgan (2007)
“Random surfer” selects a page, keeps clicking links until “bored”,
then randomly selects another page.

PageRank(A) is the probability that such a user visits A

q is the probability of getting bored at a page
PageRank matrix can be computed offline
Google takes into account both the relevance of the page and
PageRank (and many other things of course)
Relevance is computed from the text and other features (of course, a
proprietary and evolving scheme).
7/28/2017
Introduction to Information Retrieval
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
49
49
50
Q3. What do I do with this information about sources?
Example Bonatti et al. 2011
1.
Compute the PageRank of
sources (domains or documents)
Recall the LOD cloud:
2.
Rank of a triple = sum over the
ranks of sources containing this
triple
3.
Rank of an inference = minimum
over rank of tripels needed to
make this inference
4.
Identify minimum set of triples
that cause the inconsistency
5.
Remove the minimum-ranked
triples to restore consistency
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
50
51
Provenance plus other metrics for quality “repair“
Flouris et al.
Using Provenance for Quality Assessment and Repair in Linked
Open Data
Proc. Of EvoDyn-2012 – Knowledge Evolution and Ontology
Dynamics 2012
Ceur-ws.org/Vol-890
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
51
52
Agenda
Data quality, esp. in (L)OD: hopes, concerns, tests
What is data quality?
Dimensions of LOD quality
Provenance and inconsistencies
Task: be a data-quality detective!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
52
53
Task
What are potential data quality issues with the following
sources?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
53
Recall: Dimensions of data quality in LOD
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
54
54
55
‘Where can I
find socialnetwork
datasets for
my research
on …?’
Ex.:
https://snap.stanford.e
du/data/#socnets
 In what
sense(s) could
these data be of
limited/no use for
the research Q?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
55
56
Drug users in Oakland heatmap and PredPol predictions
"predictive policing" program. Police car laptops will display maps showing
locations where crime is likely to occur, based on data-crunching algorithms […]
[The algorithms] replace more basic trendspotting and gut feelings about where
crimes will happen and who will commit them with […] objective analysis.
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
56
57
Demographics in policed areas, and a simulation model
Left:
Top: Number of days with
targeted policing for drug
crimes in areas flagged
by PredPol analysis of
Oakland police data.
Middle: Targeted policing
for drug crimes, by “race”.
Bottom: Estimated drug
use by “race”.
Right:
Top: Number of drug
arrests made by Oakland
police department, 2010.
(1) West Oakland, (2)
International Boulevard.
Bottom: Estimated
number of drug users,
based on 2011 National
Survey on Drug Use and
Health
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
57
58
 Questions
"predictive policing" program. Police car laptops will display
maps showing locations where crime is likely to occur, based on
data-crunching algorithms […] [The algorithms] replace more
basic trendspotting and gut feelings about where crimes will
happen and who will commit them with […] objective analysis.

What does “objectivity” mean here?

Where’s this “objectivity”, and where is it not?

What concept are the police officers using PredPol trying to
capture, and with what measures?

What concept(s) are they actually capturing?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
58
59
To be continued in the exercise session!
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
59
60
Outlook
Data quality, esp. in (L)OD: hopes, concerns, tests
What is data quality?
Dimensions of LOD quality
Provenance and inconsistencies
Task: be a data-quality detective!
Semantic technologies in practice incl. industry
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
60
61
Next lecture
Invited talk
Aad Versteden, Tenforce
Semantic Technologies in Practice
He proposed the following exercise as a preparation:
Every day, try this once. Find some information, any
information. This can be the nutritional value of the milk you
drink in the morning, or the content of a newspaper. Think about
who probably owns the data, who published the data, and who
could make use of the data. Are this always three parties? Are
they sometimes less, sometimes more?
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
61
62
Main references
•
Zaveri, A., Rula, A., Maurino, A., Pietrobron, R., Lehmann, J., & Auer, S.
(2012). Quality assessment methodologies for Linked Open Data. A
systematic literature review and conceptual framework. (Figures in this
slideset are from a prior verison of the publication in the Semantic Web
Journal, taken from www.semantic-web-journal.net/system/files/swj414.pdf
•
http://www.provbook.org
•
Flouris et al. (2012). Using Provenance for Quality Assessment and Repair in
Linked Open Data. Proc. Of EvoDyn-2012 – Knowledge Evolution and
Ontology Dynamics 2012. http://ceur-ws.org/Vol-890
•
Piero A. Bonatti, Aidan Hogan, Axel Polleres, Luigi Sauro: Robust and
scalable Linked Data reasoning incorporating provenance and trust
annotations. J. Web Sem. 9(2): 165-201 (2011).
http://sw.deri.org/~aidanh/docs/saor_ann_final.pdf
•
Lum, K. and Isaac, W. (2016), To predict and serve? Significance, 13: 14–19.
http://onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2016.00960.x/full
Berendt: Knowledge and the Web, 2016, http://www.cs.kuleuven.be/~berendt/teaching
62