2017-03-24-Sparql optimisation and Triplestore use cases Ganesh

SPARQL Optimisation and
Semantic store use cases
Ganesh Selvaraj
[email protected]
1
Semantic web
The Semantic Web is an extension of the Web
through standards by the World Wide Web
Consortium (W3C). The standards promote common
data formats and exchange protocols on the Web,
most fundamentally the Resource Description
Framework (RDF).
2
Semantic web
3
Semantic web
AAA - “Anyone can say Anything about Any
topic”
Which means schema less or semistructured
schema.
4
Triplestore
A semantic database to store semantic web data in triples.
Subject
Ganesh
Predicate
LivesIn
Object
Auckland
5
Schema free - challenges
Semantic web’s (or any schema free datastores) biggest issue is its main
advantage -> AAA
Ganesh
LivesIn
Auckland
Ganesh
ResidesIn
Auckland
6
Adding meaning about schema
Ontologies are used to add meaning to semantic data. Also this needs to be
interpreted beforehand.
Reasoners are used to infer more facts based on assertions.
ResidesIn
SameAs
LivesIn
7
Sample RDF Data
Ganesh livesIn Auckland.
Ganesh likes cars.
John likes cars.
John likes bikes.
John likes surfing.
Ganesh friendof John.
8
SPARQL
9
SPARQL
SPARQL (a recursive acronym for SPARQL
Protocol and RDF Query Language) is an RDF
query language, used to retrieve data from
triplestores.
1.
2.
3.
SQL like syntax and capabilities (join, filter, aggregate etc).
Has pattern matching capabilities.
Has 4 capabilities -> Select, Ask, Construct, Describe.
10
SPARQL Example
Select ?x,?y {
?x LivesIn “Auckland”.
?x friendof ?y.
}
11
Query Optimisation
Query Optimization is defined as the process of
reducing the response time (the time elapsed
from the moment a query started its execution
until the time it returns the result) for a query.
12
Query optimisation Challenges
One of the hardest problems in query optimization is to
accurately estimate the costs of alternative query plans.
Optimizers cost query plans using a mathematical model
of query execution costs that relies heavily on estimates
of the cardinality, or number of tuples, flowing through
each edge in a query plan
13
Join order optimisation
The order in which the joins are executed in a query is called
a join execution order. Suitable reordering of joins (join-order
optimization) in a query can reduce the query response time
by several orders of magnitude.
14
Join order optimisation
Query version1:
Select ?x { ?x Gender Male. -->pattern A
?x hasEmail [email protected] -->pattern B
}
Query version2:
Same Query with re-ordered joins,
Select ?x { ?x hasEmail [email protected] -->pattern B
?x Gender Male. -->pattern A
}
15
Join order optimisation
16
Cost estimation
Simply it can be assumed as number of results a particular
pattern might result in.
For example;
costOf(?x Gender Male) > costOf(x hasEmail
[email protected]).
17
PdStore query evaluation
ATMO, PDStore uses index nested loop joins to evaluate
queries.
As per join order optimisation, in a nested loop, a low costing
pattern has to be executed before a high costing pattern.
18
Pdstore - LSO
LSO -> Learning Statistics Optimiser is a hybrid cost and
heuristics based optimiser used in PdStore.
19
Cost Model
In a QEP, cost model describes how a cost for a query or part
of query is generated and stored for later use.
20
SPARQL Cost model Challenges
In relational databases, the cost model is usually developed
against the schemata, but the schema-relaxed nature of RDF
and unpredictable join paths (due to the absence of key
constraints) in SPARQL complicate the cost model for RDF.
A cost model comprising the combination of all RDF terms in
a triple pattern would be sufficient but may result in huge
statistics. Furthermore, creating and maintaining such
exhaustive statistical data in an often changing web-scale
scenario would be very costly in terms of time, hardware and
other resources
21
LSO Cost model
Predicate driven cost model, as almost 77% or
more queries have known predicates.
22
Cost information per predicate
Example
Ganesh likes 1000
John likes 5000
Dave likes 2000
Kate likes 10000
23
Abstract Triple Pattern
The different possible combinations of bound and
unbound values in a triple pattern, as a concept
not associated with any data in particular, we call
that abstract triple pattern. In a triple pattern,
apart from S,P and O, an unbound (un- known
query variable) value is denoted as X.
24
Abstract Triple pattern
The possible abstract triple pattern combinations are:
●
●
●
●
●
●
●
●
XXX: All three RDF terms are unbound.
XPX: Only the Predicate is bound.
SXX: Only the Subject is bound.
XXO: Only the Object is bound.
SXO: Only the Predicate is unbound.
SPX: Only the Object is unbound.
XPO: Only the Subject is unbound.
SPO: All three RDF terms are bound.
25
LSO Cost model
Subject
Predicate
Object
Predicate
AbstractTriplePatter
nType
Cost
hasEmail
XPO
1
hasEmail
SPX
3
26
Learning query cost from query Execution
27
Heuristics to estimate cost
Sel(XXX) > Sel (XXO, SXX, XPX) > Sel (SPX, XPO) >
Sel(SXO) > Sel(SPO)
28
Benchmarking Results - LUBM
29
30
31
LSO vs Jena
32
use case 1 - Recommendation engine
33
Use Case - Paper
Smart insights - ESWC 2014
https://pdfs.semanticscholar.org/107a/6e4a63884bbcc4b730d2d3190ff32290fdd0.
pdf
34
Thank You
35