Extracting core knowledge
from Linked Data
Valentina Presutti, Lora Aroyo, Alessandro Adamou,
Balthasar Schopman, Aldo Gangemi, Guus Schreiber
Second International Workshop on Consuming Linked Data (COLD2011)
Bonn, Germany, 2011-10-23
Outline
•
•
•
•
Objectives, Motivation, and results
The Knowledge Architecture of a dataset
Empirical analysis on three datasets
Building prototypical queries through the
knowledge architecture
• Conclusion and ongoing work
Objectives
• Our initial goal was to design
recommendation strategies based on
linked data
– E.g. for TV programs
• Before we could focus on defining such
strategies we met some issues…
Mmmmhhh…
What
knowledge is
in there?
Music
Brainz
Jamendo
How to query
them?
What
vocabularies
do they use?
Is there any
way to inspect
them
automatically?
Linked
MDB
John
Peel
Sessions
Last.fm
BBC
Programs
BBC
Music
Additional objectives emerged
• Studying how to facilitate linked data
consumption
– By supporting selection of datasets (e.g. for
reuse)
– By enabling automatic query building
• Studying how to perform empirical analysis of
a dataset conceptual design
– How are vocabularies actually used?
– What are the central types/properties?
– What is the core knowledge of a dataset?
Additional objectives emerged
• Studying how to facilitate linked data
consumption
– By supporting selection of datasets (e.g. for
reuse)
– By enabling automatic query building
• Studying how to perform empirical analysis of
a dataset conceptual design
– How are vocabularies actually used?
– What are the central types/properties?
– What is the core knowledge of a dataset?
Results
• A method for extracting the main knowledge components
of a LD dataset
mo:track
mo:Record
mo:Track
Knowledge Pattern mo:track
mo:Track
mo:Record
mo:available_as
foaf:maker
Path
mo:MusicArtist
foaf:name
dc:title
mo:format
mo:Playlist
mo:Record
mo:Signal
mo:MusicArtist
mo:MusicArtist
mo:MusicArtist
mo:track
mo:published_as
foaf:made
foaf:made
foaf:made
mo:available_as
mo:Playlist
Cluster of paths
mo:Track
mo:Track
mo:Record
mo:Record
mo:Record
mo:available_as
mo:available_as
mo:available_as
mo:available_as
mo:available_as
mo:Playlist
mo:Playlist
mo:Torrent
mo:ED2K
mo:Playlist
dc:format
dc:format
dc:format
dc:format
dc:format
rdfs:Literal
rdfs:Literal
rdfs:Literal
rdfs:Literal
rdfs:Literal
Results
• A method for extracting the main knowledge components
of a LD dataset
• An ontology for representing and storing them i.e. the
Knowledge Architecture
inDataset
PropertyUsageInDataset
Dataset
inDataset
isPropertyUsageOf
hasPathElement
PathOccurrencesInDataset
PathElement
hasPEObjectType
hasPath
hasProperty
mapsToKP
Type
Path
Property
KnowledgePattern
CentralType
CentralProperty
Results
• A method for extracting the main knowledge components
of a LD dataset
• An ontology for representing and storing them i.e.
Knowledge Architecture
• A procedure for building prototypical queries without prior
knowledge of a dataset vocabulary
Jamendo
Music
Brainz
Linked
MDB
John Peel
Sessions
Last.fm
BBC
Programs
BBC
Music
CONSTRUCT {
?t a mo:Track . ?t dc:title ?t1 . ?t mo:available_as ?t2 .
?t2 dc:format ?f . ?t mo:license ?t3 . ?t mo:track_number ?t4 .
?s a mo:Signal . ?s mo:published_as ?t . ?r a mo:Record .
?r mo:track ?t .
}
WHERE {
?t a mo:Track . ?t dc:title ?t1 . {
{OPTIONAL { ?t mo:available_as ?t2 . ?t2 dc:format ?f }}
UNION {OPTIONAL { ?t mo:license ?t3 }}
UNION {OPTIONAL { ?t mo:track_number ?t4 }}
UNION {OPTIONAL { ?s a mo:Signal . ?s mo:published_as
?t }}
UNION {OPTIONAL { ?r a mo:Record . ?r mo:track ?t }}
}
}
Principles and practices
•
•
•
•
•
Extraction of paths and usage stats
Centrality of types and properties
Emerging Knowledge Patterns
Path clustering
Construction of a dataset knowledge
architecture
Path identification
(Jamendo)
mo:MusicArtist
mo:Signal
rdfs:Resource
rdfs:Literal
mo:Record
mo:Track
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Path identification (length 1)
(Jamendo)
mo:MusicArtist
foaf:maker
mo:Signal
rdfs:Resource
rdfs:Literal
mo:Record
mo:Track
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Path identification (length 2)
(Jamendo)
mo:MusicArtist
mo:Signal
mo:published_as
rdfs:Resource
rdfs:Literal
mo:Record
mo:track_number
mo:Track
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Path identification (length 3)
(Jamendo)
mo:MusicArtist
mo:recorded_as
mo:Signal
Path Element
Position 2
mo:published_as
rdfs:Resource
rdfs:Literal
mo:Record
dc:title
mo:Track
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Centrality (types)
(Jamendo)
mo:MusicArtist
mo:Signal
mo:published_as
rdfs:Resource
rdfs:Literal
mo:Record
mo:track
mo:Track
dc:title
mo:track_number
mo:available_as
mo:license
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Centrality (properties)
(Jamendo)
mo:MusicArtist
mo:recorded_as
mo:Signal
mo:published_as
rdfs:Resource
rdfs:Literal
mo:Record
mo:Track
dc:title
mo:track_number
mo:available_as
mo:license
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Centrality (properties)
(Jamendo)
mo:MusicArtist
mo:Signal
mo:recorded_as
length 3 path cluster
mo:available_as
mo:Playlist
mo:published_as
rdfs:Resource
mo:recorded_as
mo:published_as
mo:license
foaf:Document
rdfs:Literal
rdfs:Resource
mo:Record
mo:Signal
mo:Track
dc:title
mo:Track
mo:track_number
dc:title
mo:track_number
rdfs:Literal
mo:available_as
mo:license
foaf:Document
PREFIX mo : http://purl.org/ontology/mo/MusicArtist
mo:Playlist
Emerging Knowledge Pattern
mo:Track
mo:Playlist
mo:Record
mo:track
mo:available_as
mo:MusicArtist
mo:ED2K
mo:available_as
foaf:maker
mo:available_as
tags:taggedWithTag
mo:image
dc:date
dc:title
dc:description
tags:Tag
rdfs:Literal
mo:Torrent
Dataset knowledge architecture
inDataset
PropertyUsageInDataset
Dataset
inDataset
isPropertyUsageOf
PathElement
PathOccurrences
InDataset
hasPathElement
hasPath
hasPEObjectType hasProperty
mapsToKP
Type
Path
Property
KnowledgePattern
CentralType
CentralProperty
Knowledge Architecture Indicators
Measure
Computation
#Triples
Sum all PropertyUsageInDataset triples.
#Properties
Σ (u | a(u, PropertyUsageInDataset))
We use
taxonomical information
for eliminating
redundancies
|(union set of types
related to PathElement
by
#Types
Paths with l > 3
either hasPathElementSubjectType or
)| with
arehasPathElementObjectType
concatenation of paths
l<=3
#Paths
#paths of length l for l = 2...4
#Path Occurrences
Σ values of hasNumberOfOccurrences of paths
of length n, n = 2...4.
Property usage in paths
(# property occurrences in paths of length l) ÷ (#
properties in dataset), l= 2…4
Type betweenness
# paths l=2 with t as an object at position 1
Property betweenness
# paths l=3 with p at position 2
#Triples by property
v | hasNumberOfTriples (u,v) . a(u,
PropertyUsageInDataset) .
hasProperty(u,p)
Use cases
Dataset
nTriples
nProps
nTypes
Jamendo
JPeel
LMDB
1,047,950
271,369
6,147,978
24
24
221
11
9
53
Jamendo
John Peel Sessions
LinkedMDB
Different in size
Same or related domains
Custom/standard/popular vocabularies
Measure
nPath
Dataset
Jamendo
Jpeel
LMDB
L=2
L=3
33
56
546
L=4
31
65
1,665
26
73
3,757
nPathOcc
Jamendo
Jpeel
LMDB
999,052
1,452,645
2,259,097
1,948,999 14,447,400 1,240,815,607
25,765,513 184,950,315 1,402,705,472
Property usage in paths
Jamendo
Jpeel
LMDB
1.000
0.917
0.847
0.917
0.834
0.747
0.834
0.792
0.747
Analysis on Jamendo
mo:Lyrics
mo:MusicArtist
tags2:Tag
mo:Record
mo:Torrent
mo:Lyrics
tags2:Tag
mo:ED2K
mo:Torrent
time:Interval
mo:ED2K
mo:Signal
time:Interval
mo:Playlist
mo:Signal
mo:MusicArtist
mo:Track
mo:Track
mo:Playlist
mo:Record
0
20000
40000
60000
80000
100000 120000
(a) Instances per type
foaf:maker
foaf:made
mo:track
mo:time
mo:published_as
tags2:taggedWithTag
mo:available_as
0
2
4
6
8
10
12
(b) Type betweenness
tags2:taggedWithTag
mo:time
mo:track
mo:published_as
mo:available_as
foaf:maker
foaf:made
0
20000
40000
60000
80000
100000 120000 140000
(c) Triples per property
0
2
4
6
8
(d) Property betweenness
10
12
Building mo:Track query
• The query triple patterns are derived from
a selection of path elements
PE ={} %set of path elements
Building mo:Track query
• Select a central type
PE ={} %set of path elements
C=
mo:Track
Building mo:Track query
• Get its Knowledge Pattern
PE ={} %set of path elements
C=
mo:Track
mo:track_number
KP(C)=
mo:Track
mo:license
dc:title
mo:available_as
Building mo:Track query
• Feed the set of path elements
PE ={mo:track_number, mo:license, dc:title, mo:available_as}
C=
mo:Track
mo:track_number
KP(C)=
mo:Track
mo:license
dc:title
mo:available_as
Building mo:Track query
• Identify central properties in KP(mo:Track)
PE ={mo:track_number, mo:license, dc:title, mo:available_as}
C=
mo:Track
mo:track_number
KP(C)=
mo:Track
mo:license
dc:title
mo:available_as
Building mo:Track query
• Get its cluster of paths of length 3
PE ={mo:track_number, mo:license, dc:title, mo:available_as}
C=
mo:Track
mo:track_number
KP(C)=
mo:Track
mo:license
dc:title
mo:available_as
mo:Record
mo:Signal
mo:MusicArtist
mo:MusicArtist
mo:MusicArtist
mo:track
mo:published_as
foaf:made
foaf:made
foaf:made
mo:Track
mo:Track
mo:Record
mo:Record
mo:Record
mo:available_as
mo:available_as
mo:available_as
mo:available_as
mo:available_as
mo:Playlist
mo:Playlist
mo:Torrent
mo:ED2K
mo:Playlist
dc:format
dc:format
dc:format
dc:format
dc:format
rdfs:Literal
rdfs:Literal
rdfs:Literal
rdfs:Literal
rdfs:Literal
Building mo:Track query
• and feed PE: we collect 3 additional path
elements
PE ={mo:track_number, mo:license, dc:title, mo:available_as, mo:track*,
mo:published_as*, dc:format** }
C= mo:Track
mo:Track
mo:track_number
KP(C)=
mo:Track
mo:license
Mo:Signal
mo:track_number
mo:track
mo:license
mo:Track
mo:published_as
dc:title
mo:available_as
dc:title
mo:available_as
dc:format
mo:Record
mo:Signal
mo:MusicArtist
mo:MusicArtist
mo:MusicArtist
mo:track
mo:published_as
foaf:made
foaf:made
foaf:made
mo:Track
mo:Track
mo:Record
mo:Record
mo:Record
mo:available_as
mo:available_as
mo:available_as
mo:available_as
mo:available_as
mo:Playlist
mo:Playlist
mo:Torrent
mo:ED2K
mo:Playlist
dc:format
dc:format
dc:format
dc:format
dc:format
rdfs:Literal
rdfs:Literal
rdfs:Literal
rdfs:Literal
rdfs:Literal
Resulting query
CONSTRUCT {
?t a mo:Track . ?t dc:title ?t1 . ?t mo:available_as ?t2 .
?t2 dc:format ?f . ?t mo:license ?t3 . ?t mo:track_number ?t4 .
?s a mo:Signal . ?s mo:published_as ?t . ?r a mo:Record .
?r mo:track ?t .
}
WHERE {
?t a mo:Track . ?t dc:title ?t1 . {
{OPTIONAL { ?t mo:available_as ?t2 . ?t2 dc:format ?f }}
UNION {OPTIONAL { ?t mo:license ?t3 }}
UNION {OPTIONAL { ?t mo:track_number ?t4 }}
UNION {OPTIONAL { ?s a mo:Signal . ?s mo:published_as ?t
}}
UNION {OPTIONAL { ?r a mo:Record . ?r mo:track ?t }}
}
}
Conclusion,
ongoing and future work
Conclusion
• Linked Data sets as connected knowledge components (KA)
http://www.ontologydesignpatterns.org/ont/lod-analysis-properties.owl
• Empirical analysis on three LD datasets from the LD cloud
http://stlab.istc.cnr.it/stlab/LOD-Analysis-Intro
• A procedure for building prototypical queries when the ontologies are unknown
Ongoing and future work
• KA has been used for extracting KPs from Wikipedia wikilinks
–
–
•
•
•
•
•
“Encyclopedic Knowledge Patterns from Wikipedia links” on Wednesday
Aemoo: exploratory search demo based on KPs @SWC
Improving KA expressivity and effectiveness
Refining the procedure for extracting KPs by exploiting centrality
Refining and evaluating automatic query building based on cognitive-soundness
Comparing KA of a dataset with the “top-down” ontologies used in it
Aligning emerging KPs to general KPs: supporting alignment between datasets and
error discovering
© Copyright 2026 Paperzz