The Odyssey Approach for Optimizing Federated SPARQL Queries

The Odyssey Approach for Optimizing Federated
SPARQL Queries
Gabriela Montoya1 , Hala Skaf-Molli2 , and Katja Hose1
arXiv:1705.06135v2 [cs.DB] 18 May 2017
1
Aalborg University, Denmark {gmontoya,khose}@cs.aau.dk
2
Nantes University, France [email protected]
Abstract. Answering queries over a federation of SPARQL endpoints requires
combining data from more than one data source. Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii)
there is only limited access to statistics about schema and instance data of remote sources. To overcome these challenges, most federated query engines rely
on heuristics to reduce the space of possible query execution plans or on dynamic
programming strategies to produce optimal plans. Nevertheless, these plans may
still exhibit a high number of intermediate results or high execution times because
of heuristics and inaccurate cost estimations. In this paper, we present Odyssey,
an approach that uses statistics that allow for a more accurate cost estimation for
federated queries. Our experimental results show that Odyssey produces query execution plans that are better in terms of data transfer and execution time than stateof-the-art optimizers. Our experiments using the FedBench benchmark show execution time gains of at least one order of magnitude on average.
1
Introduction
Federated SPARQL query engines [1, 4, 7, 13, 17] answer SPARQL queries over a federation of SPARQL endpoints. Query optimization is a particularly complex and challenging task [6] in a federated setting. The query optimizer minimizes processing and
communication costs by selecting only relevant sources for a query. It decomposes the
query into subqueries, and produces a query execution plan with good join ordering and
physical operators. With limited access to statistics, however, most federated query engines rely on heuristics [1,17] to reduce the huge space of possible plans or on dynamic
programming [5, 7] to produce optimal plans. However, these plans may still exhibit
a high number of intermediate results or high execution times because of inadequate
heuristics or inaccurate estimations of cost functions [8].
In this paper, we propose Odyssey, a cost-based query optimization approach for
federations of SPARQL endpoints. Odyssey defines statistics for representing entities
inspired by [11] and statistics for representing links among datasets while guaranteeing
result completeness. In a federated setting, computing statistics naturally requires access to more than one dataset. To reduce the overhead, Odyssey uses entity summaries
to identify links among datasets. This comes at the risk of losing some accuracy in the
link identification but still guarantees that no links will be missed during query optimization, i.e., there is a small risk that more sources are queried than strictly necessary
but the query result will be complete.
Odyssey uses the computed statistics to estimate the sizes of intermediate results and
dynamic programming to produce an efficient query execution plan with a low number
of intermediate results. In summary, this paper makes the following contributions:
• Concise statistics of adequate granularity representing entities and describing links
among datasets while guaranteeing result completeness.
• A lightweight technique to compute federated statistics in a federated setup that
relies on entity summaries to concisely represent entities and find links among
datasets.
• A query optimization algorithm based on dynamic programming using our statistics
to ensure finding the best plan.
• Extensive evaluation using a well-accepted standard benchmark for federated
query processing [16], comparison against a broad range of state-of-the-art related
work [5, 7, 14, 17]. The results show the Odyssey’s superiority with a speed-up of
up to 33 times and a reduction of transferred data of up to 182 times on average.
This paper is organized as follows. Section 2 presents related work. Section 3 describes
the Odyssey approach and algorithms. Section 4 discusses our experimental results.
Finally, conclusions and future work are outlined in Section 5.
2
Related Work
Query optimization in state-of-the-art federated query engines, such as FedX [17] and
ANAPSID [1], relies on heuristics. For instance, FedX [17] integrates the variable
counting heuristic [18], where relative selectivity of triple patterns is computed according to the presence of constants and variables in the triple patterns. Subjects are considered to be more selective than objects and objects more selective than predicates. Rare
joins are the most selective (predicate-object or subject-predicate), subject-subject joins
are more selective than subject-object joins and object-object joins. These heuristics are
lightweight but might not lead to the best query execution plan [18]. To find an optimal
plan DARQ [13], SPLENDID [7], LHD [19], and SemaGrow [5] rely on dynamic programming. However, given the number of alternative query plans for SPARQL queries
with many triple patterns, dynamic programming is very expensive [8]. Another important factor of query optimization is source selection. Several approaches [1,7,14,17,19]
therefore try to determine the relevance of a source by sending ASK queries, which
increases the costs for a single query but might amortize in large federations for an
overlapping query load. Another technique is to estimate whether combining the data
of multiple sources can lead to any join results, e.g., by computing the intersection of
the sources’ URI authorities [14] or detailed statistics [9,12]. In Section 4, we show that
even if the optimization time of Odyssey can be slightly superior to the above mentioned
approaches, Odyssey obtains plans that are executed faster than the other approaches.
Statistics also enable using cardinality estimations for query optimization, e.g., by
estimating the selectivity and cardinality of subqueries so that the sizes of intermediate
results transferred between sources can be minimized. Most available statistics [3] use
the Vocabulary of Interlinked Datasets VOID [2], which describes statistics at dataset
level (e.g., the number of triples, different objects and subjects), at the property level
(e.g., for each predicate the number of triples with that predicate and the number of
different subjects and objects in those triples), and at the class level (e.g., the number
of instances of each class). However, approaches based on VOID and other statistics,
such as Q-Trees [9] and PARTrees [12], share the drawback of missing the best query
execution plans because of errors in estimating cardinalities caused by relying on assumptions that often do not hold for arbitrary RDF datasets [11], e.g., a uniform data
distribution and that the results of triple patterns are independent.
Characteristic sets (CS) [11] aim at solving this problem in centralized systems by
capturing statistics about sets of entities having the same set of properties (a star-shaped
query will typically involve more than one triple pattern and hence one join for each of
these properties). This information can then be used to accurately estimate the cardinality and join ordering of star-shaped queries. Typically, any set of joined triple patterns
in a query can be divided in connected star-shaped subqueries. Subqueries in combination with the predicate that links them, define a characteristic pair (CP) [8, 10]. Statics
about such CPs can then be used to estimate the selectivity of two star-shaped subqueries. Such cardinality estimations can be combined with dynamic programming on
a reduced space of alternative query plans. Whereas existing work on characteristic sets
and pairs was developed for centralized environments, this paper proposes a solution
generalizing these principles for federated environments.
3
The Odyssey Approach
Inspired by the latest advances in statistics for centralized systems [8, 10, 11], Odyssey
uses statistics about individual datasets to derive detailed statistics for federations that
can be used to optimize SPARQL queries over federations of SPARQL endpoints.
Hence, in the following, we first describe the foundations of our statistics on individual
datasets (Section 3.1) and then propose a novel method for computing such statistics in
a federated environment (Section 3.2). As the detailed statistics cause too much overhead in a federated setup, we propose a method for reducing the sizes of the statistics
(Section 3.3). Finally, we present the Odyssey approach for query optimization and its
main steps (Section 3.4): source selection, join ordering, and query decomposition.
3.1 Dataset Statistics and Optimization
Star-Shaped Subqueries To estimate the cardinality and costs of star-shaped subqueries3 , we exploit the principle that entities sharing the same set of properties are
similar. In this context, we refer to the set of an entity’s properties as its characteristic set
(CS). For instance, in DBpedia 3.5.1 the entity dbr:Gary Goetzman has the characteristic set C1 ={dbo:birthDate, foaf:name, rdf:type, dbo:activeYearsStartYear, rdfs:label,
skos:subject}. In total, 260 entities share this set of properties and therefore CS C1 .
For each CS C, we compute statistics, i.e., the number of entities sharing C
(count(C)) and the number of triples with predicate p (occurrences(p, C)). Listing 1.1
shows the statistics for the above mentioned example CS C1 . Entities of C1 occur on
average in 1 triple with property dbo:birthDate and 3.94 triples with property rdf:type.
Listing 1.1: Statistics for CS C1
{ count : 260 ,
elems :
{{ pred : dbo : b i r t h D a t e , ocurrences : 260 } ,
{ pred : f o a f : name , ocurrences : 326 } ,
{ pred : r d f : type , ocurrences : 1023 } ,
3
Set of triple patterns in a BGP that share the same subject (object).
{ pred : dbo : a c t i v e Y e a r s S t a r t Y e a r , ocurrences : 260 } ,
{ pred : r d f s : l a b e l , ocurrences : 260 } ,
{ pred : skos : s u b j e c t , ocurrences : 1336 }}}
For a star-shaped query, only CSs including all of the query’s predicates are relevant as
entities that only satisfy a subset of these predicates cannot contribute to the answer.
Listing 1.2: Find directors birth date and year they started directing
SELECT DISTINCT ? d i r e c t o r WHERE {
? d i r e c t o r dbo : b i r t h D a t e ? date .
( tp1 )
? d i r e c t o r dbo : a c t i v e Y e a r s S t a r t Y e a r ?sy . ( t p 2 )
? d i r e c t o r f o a f : name ?name
( tp3 )
}
For star-shaped queries asking for the set of unique entities described by some properties (query with DISTINCT modifier), the exact number of answers can be determined precisely (no estimation). For example, the cardinality of the query given in
Listing 1.2 can be obtained by adding up the count(C) of all CSs containing the predicates dbo:birthDate, dbo:activeYearsStartYear, and foaf:name. In DBpedia 3.5.1, there
are 7,059 CSs that include these three predicates, and the total number of entities with
these CSs is 83,438. Formally, the number of entities cardinality(P) for a given set of
properties P is computed as:
cardinality(P ) =
X
count(R)
(1)
P ⊆R∧R=cs(e)
For queries without the DISTINCT modifier, we need to account for duplicates by considering the number of triples with predicate pi ∈ P that an entitiy is associated with
on average:
estimatedCardinality(P ) = cardinality(P ) ∗
Y ocurrences(pi , P )
cardinality(P )
p ∈P
(2)
i
In DBpedia 3.5.1, there are 7,059 CSs relevant for the query in Listing 1.2 with
83,438 entities as answers. These 83,438 entities have 109,830 triples with predicate foaf:name, 83,448 with predicate dbo:birthDate, and 110,460 with predicate
dbo:activeYearsStartYear. If the same query is considered without the DISTINCT mod83,448
110,460
ifier, we estimate: 83, 438 ∗ 109,830
83,438 ∗ 83,438 ∗ 83,438 = 145, 417 matching entities in
the result, which is very close to the real number (149,440).
Once the relevant CSs for a query have been identified, they can be used to find the
join order minimizing the sizes of intermediate results. For the query in Listing 1.2, we
start by estimating the cardinalities for each subquery with two out of the three triple
patterns using formula 1: { tp1, tp2 }: 98,281, { tp1, tp3 }: 209,731, and { tp2, tp3
}: 127,712. The triple pattern not included in the cheapest subquery ({ tp1, tp2 }) is
executed last (tp3). We proceed recursively with the cheapest subquery and determine
the cardinalities for its subsets: { tp1 }: 232,608 and { tp2 }: 143,004. Again, the triple
pattern not included in the cheapest subquery, i.e., tp1, will be executed last of the
currently considered set of triple patterns. As a result, we will execute the join between
tp2 and tp1 first and afterwards compute the join with tp3. We also get the order in
which the triple patterns should be evaluated for the first join: first tp2 and then tp1.
Arbitrary Queries To estimate the cardinality for queries with more complex shapes,
we need to consider the connections (links) between entities with different CSs. Entity
dbr:Evan Almighty, for example, is linked to dbr:Gary Goetzman and dbr:Tom Hanks
via predicate dbo:producer by triples, such as (dbr:Evan Almighty,
dbo:producer, dbr:Tom Hanks).
The links between CSs via predicates can formally be described by Characteristic
Pairs (CPs) whose statistics capture the number of links between a pair of CSs (Ci and
Cj ) using a particular predicate p (count(Ci , Cj , p)). For example, given the CSs of
dbr:Gary Goetzman, dbr:Tom Hanks, and dbr:Evan Almighty as C1 , C2 , C3 , then the
number of links via predicate dbo:producer are described by: (C3 , C1 , dbo:producer)
and (C3 , C2 , dbo:producer).
Listing 1.3: Find films with director and budget
SELECT DISTINCT ? f i l m ? d i r e c t o r WHERE {
? f i l m dbo : r u n t i m e ? r u n t i m e .
( tp1 )
? f i l m dbo : d i r e c t o r ? d i r e c t o r .
( tp2 )
? f i l m dbo : budget ? budget .
( tp3 )
? d i r e c t o r dbo : b i r t h D a t e ? date .
( tp4 )
? d i r e c t o r dbo : a c t i v e Y e a r s S t a r t Y e a r ?sy . ( t p 5 )
? d i r e c t o r f o a f : name ?name
( tp6 )
}
The number of unique results (pairs of entities, query with DISTINCT modifier) can be
exactly computed (not estimated) using the formula:
X
cadinality(S1 , S2 , p) =
count(T1 , T2 , p)
(3)
S1 ⊆T1 ∧S2 ⊆T2
For the query in Listing 1.3, this is Σc1∧c2 count(T1 , T2 , dbo:director),
where c1={dbo:runtime, dbo:director, dbo:budget} ⊆ T1 and c2={dbo:birthDate,
dbo:activeYearsStartYear, foaf:name} ⊆ T2 . For this query, DBpedia 3.5.1 contains
1,509 characteristic pairs linking entities from two characteristic sets (with 2,144 and
7,059 CSs, respectively) via predicate dbo:director.
If a query does not involve the DISTINCT modifier, result cardinality is estimated
based on the predicate occurrences of the predicates in the CSs:
X
estimatedCadinality(S1 , S2 , p) =
S1 ⊆T1 ∧S2 ⊆T2
∗
Y
pi ∈S2
(count(T1 , T2 , p) ∗
Y
pi ∈S1 −{p}
(
ocurrences(pi , T1 )
)
count(T1 )
ocurrences(pi , T2 )
))
(
count(T2 )
(4)
Cardinality estimation considers the average number of triples on a per query predicate
basis in the relevant CPs. Selectivity of predicate p is not considered in the product
because it is already considered by count(T1 , T2 , p).
Assuming that the order of joins within star-shaped subqueries has already been determined based on the CSs as described above, we treat each star-shaped subquery as a
single meta-node to reduce complexity. We compute the cardinalities of the meta-nodes
using the statistics on CPs and use dynamic programming to determine the optimal join
order that minimizes the sizes of intermediate results.
Although the presentation in this section focuses on subject-subject joins, the same
principle can be applied to other types of joins, e.g., object-object.
LM DB (1)
ge
langua
f ilm : 28350
DBpedia (2)
linguo : en
.
.
.
sameAs
producer
dbr : Evan Almighty
director
dbr : Evan Almighty
urce
link so
il : 49900
link targ
et
e
birthDat
dbr : Gary Goetzman
sub ject
photos : Evan Almighty
cs1,1 = {language, ..., sameAs}
cs1,2 = {link source, link target}
subjects(1) = {(cs1,1 , {f ilm : 28350, ...}), ...}
1952 − 11 − 06
.
.
.
dbc : American f ilm producer
CSs and CP s at the sources
a)
cp1,1 = (cs1,2 , cs1,1 , link source)
...
objects(1)={ (cs1,1 , { (language, { linguo:en, ...}),
(sameAs, { dbr:Evan Almighty, ...}),... }
.
.
.
dbr : T om Shadyac
f ilm : 28350
.
.
.
dbr : Gary Goetzman
cs2,1 = {producer, ..., director}
cs2,2 = {birthDate, ..., subject}
cp2,1 = (cs2,1 , cs2,2 , producer)
...
Entity Descriptions
subjects(2) = {(cs2,1 , {dbr : Evan Almighty, ...}), ...}
b)
objects(2)={ (cs2,1 , { (producer, { dbr:Gary Goetzman, ...}),
(director, { dbr:Tom Shadyac, ...}),... }
(cs1,1 , cs2,1 , sameAs)
F ederated CP s (and CSs)
c)
Fig. 1: Federated Computation
of Statistics
1
3.2 Federated Statistics and Optimization
In general, triples describing an entity are typically part of the same dataset so that most
CSs can be computed independently from each other over each dataset4 . Federated CPs
involving entities from multiple datasets, however, are key elements in the evaluation
of federated queries as they describe the links between entities that cross-dataset joins
rely on. It is therefore very important that the computation of federated statistics never
misses a possible link between datasets because failing to detect a link could lead to incomplete answers during query processing. Other requirements are that the information
obtained from the sources should be light, i.e., considerably smaller than the complete
datasets, and that it is possible to compute the statistics efficiently.
Whereas single dataset statistics can be computed once and provided by the sources
in the same way they currently provide VOID statistics, federated CSs and CPs require
more effort and centralized knowledge about all entities in the considered datasets. A
naive way to compute these federated statistics is evaluating expensive SPARQL queries
with FILTER expressions with NOT EXISTS, but it can take weeks for a dataset with
thousands of CSs. Hence, the sources need to share information about subjects and
objects in their local datasets with the federated query engine. Such information can,
for instance, be obtained efficiently while computing the standard dataset statistics and
then shared with a central component (federated query engine).
The federated query engine can then use this information to compute federated CPs.
Consider, for instance, the two datasets DBpedia and LMDB in Fig. 1; the sources compute statistics about the CS entities (subjects) and their links to other entities (objects)
4
Federated CSs describing entities across multiple datasets are very rare. In FedBench, for
instance, they affects less than 0.5% of all CSs.
(Fig.1a)). Entity film:28350 has properties { language,...,sameAs}. Hence, it is part of
the entities with CS cs1,1 in subjects. Entity dbr:Evan Almighty is the value of predicate
sameAs for entity film:28350 so that it is part of the set of entities linked through predicate sameAs to CS cs1,1 in objects (Fig.1b)). The overlap between the sets in objects
for DBpedia and the sets in subjects for LMDB represent links from dataset LMDB to
dataset DBpedia. In our example, entities film:28350 and dbr:Evan Almighty in datasets
LMDB and DBpedia are connected through predicate sameAs constituting the federated
CP (cs1,1 , cs2,1 , sameAs) (Fig.1c)).
Algorithm 1 Compute Federated CPs Algorithm
Input: Pre-computed statistics about two datasets entities
Output: A set of federated CPs (CPs)
The number of pairs of entities within each CP (count)
1: function ComputeFedCPs(objects, subjects)
2:
CPs ← { }
3:
count ← { }
4:
for cs1 ∈ domain(objects) do
5:
for p ∈ domain(objects(cs1)) do
6:
entities ← objects(cs1)(p)
7:
for cs2 ∈ domain(subjects)
T do
8:
entities ← entities
subjects(cs2)
S
9:
CPs ← CPs { (cs1, cs2, p) }
10:
count((cs1, cs2, p)) ← count((cs1, cs2, p)) + cardinality(entities)
11:
end for
12:
end for
13:
end for
14:
return CPs, count
15: end function
Algorithm 1 describes in more detail how to compute federated CPs only based on
the pre-computed statistics. First, all common entities in objects and subjects statistics
are identified in line 8. These common entities represent links between CS cs1 and cs2
with predicate p and are captured by a federated CP (line 9). The count of this federated
CP is increased by the number of common entities (line 10).
The algorithm for computing federated CSs follows a similar principle by considering the subjects shared by different datasets and is omitted due to space limitations.
Additionally, links between more than two datasets can be incrementally obtained by
combining two federated CPs/CSs.
Listing 1.4: Find movies with sequel, director, and budget
SELECT ? f i l m ?movie WHERE {
? f i l m dbo : budget ? budget .
? f i l m dbo : d i r e c t o r ? d i r e c t o r .
?movie owl : sameAs ? f i l m .
?movie lmdb : sequel ?seq
}
Federated CPs can be used for cardinality estimation and join ordering using the same
principles as described in Section 3.1. Consider a federation composed by DBpedia
(160,061 CSs) and LMDB (8,466 CSs) with 22,592 federated CPs. Table 1, shows the
number of relevant CSs and federated CPs for DBpedia, LMDB and example query in
Listing 1.4 (asking for movies with information about sequels, directors, and budget),
5,627
6,011
we can use formula 4 to estimate the result cardinality: 237 ∗ 5,573
∗ 5,573
∗ 774
708 = 282.
This is very close to the real cardinality (293). These federated CPs can be computed
by Algorithm 1, if instead we use SPARQL queries we will require more than a billion
expensive queries and 40 years if each query were answer in a second. Notice that
arbitrary queries can also be handled by decomposing them into connected star-shaped
subqueries, and estimating the cardinality of each pair of subqueries, while queries with
variables as predicates can be early detected and rely on existing optimizers.
Table 1: DBpedia and LMDB CSs and Federated CPs (FCP) that are relevant for the
query in Listing 1.4
# CS # entities # triples(predicate=budget) #triples (predicate=director)
2,472
5,573
5,527
6,011
LMDB
# CS # entities # triples (predicate=sequel) # triples (predicate=sameAs)
501
708
774
1,416
LMDB → DBpedia # FCP # entity pairs # triples(predicate=sameAs)
233
237
237
DBpedia
3.3 Reducing the Sizes of the Statistics
Accurate and complete statistics might be too expensive to store and efficiently use
during query processing. In addition, requiring to transfer large volumes of statistics to
the federated query engine imposes a heavy burden on the sources and the federated
query engine. Therefore, we need to reduce the sizes of our statistics described in the
previous sections. We start by reducing the sizes of the entity descriptions provided
by the sources (subjects and objects in the previous section). To reduce the sizes, we
build upon approaches developed for summarizing sets of triples [9, 12]. One key consideration is to take the “type” of literals and IRIs into consideration. For instance, for
http://dbpedia.org/resource/Denmark, the “type” is http://dbpedia.org/resource.
In this paper, we build upon PARTrees [12], which use Radix Trees to summarize
the different types of IRIs, and Q-Trees [9] to summarize triples with a given type of
subject, predicate, and object. We extended this approach so that instead of summarizing triples, it can summarize entities instead. Hence, instead of using the Radix Tree to
encode the different types, we use this structure as basis to summarize all the entities
described in a dataset. At the leaves of the tree, a Q-Tree summarizes hash code representations of the IRI suffixes. As in [12], we include sets of least significant bytes at the
Q-Tree leaves to increase their accuracy. Inspired by [14], we use the authority of the
IRIs instead of applying heuristics to derive the type from a given IRI.
To ease the computation of federated statistics, the set of least significant bytes is
partitioned according to the CSs of the entities they correspond to. Computation costs
are greatly reduced by pruning large portions of the tree and comparing only a few pairs
of least significant byte sets. An important feature of these summaries is that entities
present in more than one dataset are always detected.
The information encoded in both structures subjects and objects used by Algorithm 1 can be represented by one Radix Tree, storing different components for each
structure (subjects and objects) in one Q-Tree. The subject component includes the set
of least significant bytes for each CS within the same “type”, while the object component also considers the predicate that connects this object with the corresponding CS.
These summaries are typically considerably lighter than the entity summaries discussed in Section 3.2. On the other hand, this might come at the expense of accuracy.
For FedBench’s DBpedia 3.5.1 subset, a dataset with 43,126,772 triples that occupies
6.1GB, the subjects and objects summaries described in Section 3.2 occupy 650MB and
750MB, respectively5 . This represents 22.46% of the dataset storage space. While the
PARTree inspired summary occupies only 68M. This represents 1% of the dataset storage space. Regarding the quality, this summary allows for computing all the federated
CSs and CPs. Furthermore, to reduce the resources used by these statistics, we have reduced the number of CSs as suggested in [8,11] to 10,000. Only the CSs that are shared
by a greater number of entities are kept, and the others are merged into existing CSs if
possible, for instance combining its statistics with the ones of the smallest superset or
splitting it into subsets that be combined with other CSs. This may reduce the accuracy
of the query cardinality estimation, but it allows to bound the resources used to store
and access these statistics.
Some SPARQL query optimization approaches such as [15] use lighter summaries
(MIPs). However, in our context, MIPs could miss some federated CPs. For instance,
MIPs allow to compute only 13% of the federated CPs for DBpedia. While our chosen
summaries allow to compute 100%. Therefore, we have decided to favor summaries
that can always detect links between the datasets when they exists.
We can cope with dataset updates in two ways. For datasets that are rarely updated the Q-Tree representing the entities of the “type” affected by the updates can be
re-computed. For datasets that are often updated, Q-Trees should support removal of
entities, this can be easily done by storing the multiplicity of each least significant byte,
so they are removed only if all the entities with that least significant byte have been
removed from the dataset.
SELECT DISTINCT ∗ WHERE {
? f dbo : budget ?b .
? f dbo : d i r e c t o r ?d .
?m owl : sameAs ? f .
?m lmdb : sequel ?s .
?d dbo : b i r t h D a t e ?bd .
?d dbo : a c t i v e Y e a r s S t a r t Y e a r ?sy
}
(a) Query QF
./
( tp1 )
( tp2 )
( tp3 )
( tp4 )
( tp5 )
( tp6 )
./
./
tp1
./
@DBpedia
tp2
@LMDB
./
tp6
tp4
tp3
tp5
(b) Optimized Plan
Fig. 2: Query QF and its Optimized Plan
3.4 Optimizing Federated Queries
Query optimization in Odyssey can logically be divided into the following steps: i) preprocessing and source selection, ii) join order optimization, iii) subquery optimization,
and iv) query completion.
Preprocessing and source selection Identifying sources that can contribute to the
query result, or pruning sources that cannot, is a very important task that future steps
build upon. Hence, we first parse the query and identify the star-shaped subqueries that
5
Implementation based in Java’s HashSet and HashMap was used to measure their sizes.
it consists of. Then, we use the CSs to identify sources provding data for these subqueries. Afterwards, we consider the links between the star-shaped subqueries and use
CPs to identify which sources can still contribute to the result. Note that Odyssey is
designed in a way that it will not miss any relevant sources (no false negatives).
Join order optimization Once we have identified the set of relevant sources, we can
estimate cardinalities of subqueries and find the best join order – in our implementation, we compute the cardinalities already in the previous step. We first optimize the
order of joins and triple patterns within each star-shaped subquery as explained in Section 3.1. Afterwards, we optimize the order of joins between these subqueries as explained in Section 3.2 using dynamic programming and a cost function. As the number
of subqueries is usually considerably lower than the number of triple patterns, applying dynamic programming becomes affordable. In our current implementation, the cost
function is solely defined on the cardinalities of intermediate results and how many
results need to be transferred between endpoints during execution. This favors query
plans with selective subqueries. This cost function assumes that all endpoints have the
same characteristics. We can easily extend this cost function by additional parameters
that can be fine-tuned to represent the characteristics of each endpoint individually, e.g.,
communication delays, response times, etc.
Subquery optimization Finally, we optimize the SPARQL queries that are actually
sent to the endpoints and try to minimize their number. For instance, we combine all
triple patterns and logical subqueries from the previous step into a single SPARQL
query to a particular endpoint whenever possible.
Query completion In this step, Odyssey will take care of combining the partial results
received from the remote endpoints and compute the final result to the original query.
birt
r
cto
dire
sam
?m
eAs
?f
bu
?d
hDa
te
?sy
nodes
aYS
Y
?bd
dg
et
se
qu
el
?b
?s
?star1
?star2
cardinality
cost
{?star1}
1, 548
1, 548
{?star2}
6, 057
6, 057
{?star3}
125, 003
125, 003
{?star1, ?star2}
417
1, 965
{?star2, ?star3}
979
979
{?star1, ?star3}
1.9e8
1.9e8
{?star1, ?star2, ?star3}
68
1, 047
?star3
Fig. 3: Query optimization
1
Example Consider query QF as illustrated in Fig. 2a and the FedBench federation described in Table 2. A state-of-the-art federated query engine, such as FedX, uses ASK
queries to determine the set of relevant sources: 8 sources are relevant for triple pattern
tp3 and one source for each of the other triple patterns. In contrast, Odyssey can benefit from our advanced statistics and in particular the information about links between
datasets. Hence, it determines that instead of 8 sources, actually only one can provide
data that will contribute to a result: because only source LMDB provides values for the
variable ?movie that can be joined with the values obtained for ?movie in tp4. As illustrated in Fig. 2b, Odyssey decomposes QF into 3 star-shaped subqueries, determines
the join order within these subqueries as well as the optimal join order between them.
It then finds a query plan that combines two logical subqueries into a single SPARQL
query that is sent to DBpedia.
Fig. 3 further illustrates the optimization of query QF (Fig. 2a). Its three star-shaped
subqueries are identified and collapsed into meta-nodes (?star1-?star3). The cardinalities for nodes ?star1-star3 are computed using CS statistics and formula (1). The cardinalities for pairs of nodes are computed using CP statistics . Then, the dynamic programming algorithm is run and the cheapest plan according to cost function is identified.
Fig. 2b presents the plan obtained by the DP algorithm for query in Fig. 2a.
Table 2: FedBench [16] dataset statistics: number of distinct triples (DT), predicates
(P), VOID computation time (CT, in s) and sizes (S, in Kb), entity summaries (ES)
computation time (CT in s) and sizes (S in Mb), characteristic sets (CSs), characteristic
pairs (CPs), CSs and CPs CT in s, and federated characteristic sets (FCSs) and pairs
(FCPs)
Dataset
ChEBI
KEGG
Drugbank
DBpedia subset
Geonames
Jamendo
SWDF
LMDB
NYTimes
4
#DT
4,772,706
1,090,830
517,023
42,855,253
107,949,927
1,049,647
103,595
6,147,916
335,119
# P VOID CT(S) ES CT(S)
28
74 (4.7)
60.95 (17)
21
13 (3.8)
19.36 (3.1)
119
7 (23)
4.8 (3.4)
1,063 1,456 (182) 286.82 (68)
26 39,807(4.6) 3,520.77 (40)
26
16(5.2)
17.51 (4.5)
118
2(29)
2.53 (1.4)
222
378(45)
53.28 (31)
36
4(6.1)
4.9 (3.8)
# CS
# CP CS,CP(CT) #FCP #FCS
978
9,958
22.15
19,360 4,042
67
239
9.6
13,822 4,042
3,419 12,589
5.89
103,070 0
10,000 1,069,431
328
6,583 1,107
673
7,707
11,397.93 322,672 2,566
42
190
9.57
1,259
0
547
6,713
2.25
17,557 1,107
8,466 94,188
34.83
359,340 0
47
158
58,724
3.96 2,566
Evaluation
In this section, we present the results of our experimental study that compares our
approach, Odyssey, with state-of-the-art federated query engines: HiBISCuS (FedXHiBISCuS, cold and warm cache) [14], SemaGrow [5], FedX (cold and warm
cache) [17], and SPLENDID [7]. We also compare Odyssey with alternative implementations using dynamic programming in combination with VOID statistics (DP-VOID).
Datasets and queries: We use the real datasets and queries proposed in the FedBench
benchmark [16]. Queries are divided into three groups Linked Data (LD1-LD11), Cross
Domain (CD1-CD7), and Life Science (LS1-LS7). They have 2-7 triple patterns and
star and hybrid shapes. They have between 1 and 9,054 answers. Basic statistics about
the datasets are listed in Table 2. We ran each query 10 times and report the averages
over last 9 runs. Standard deviation is included as error bars on the plots.
Implementation: Odyssey is implemented in Java using the Jena library to parse and
transform queries into queries with SPARQL 1.1 service clauses. Our implementation,
available at https://github.com/gmontoya/federatedOptimizer, uses
the FedX 3.1 framework with deactiviated native optimization to execute the generated
query plans.
Hardware configuration: All experiments were run in a virtual machine server with 8
processors, 64GB of RAM and CPU 2294.250 MHz. The server hosted nine Virtuoso
endpoints, one for each dataset in Table 2.
Statistics computation: As DBpedia has a very high number of CSs (160,061), we
reduced them into 10,000 by merging (as explained in Section 3.3) without significant
changes in the quality of estimations. Details on creation times and sizes of statistics
are listed in Table 2, for space reasons the federated CSs and CPs computation times
and sizes are omitted, but FCPs were computed in at most 2.87m (Geonames-DBpedia,
4.6MB) and FCSs were computed in at most 34s (DBpedia-SWDF, 54KB).
Statistical analysis: The Wilcoxon signed rank test [20] for paired non-uniform data
is used to study the significance of the improvements on performance obtained when
Odyssey query optimization is used6 . It computes a probability value (p-value) that can
be used to decide whether an approach significantly outpferforms another one; if the
p-value is lower than 0.05, we can conclude that Odyssey significantly outperforms the
other approaches.
Evaluation metrics: i) Optimization time (OT): is the elapsed time since the query
is issued until the optimized query plan is produced, ii) number of selected sources
(NSS): is the number of sources that have been selected to answer a query, iii) number
of subqueries (NSQ): is the number of subqueries that are included in the query plan,
iv) execution time (ET): is the time elapsed since the query evaluation starts until the
complete answer is produced. We used a timeout of 1,800 seconds, v) number of transferred tuples (NTT): is the number of tuples transferred from all the endpoints to the
query engine during query evaluation,
Result completeness: All approaches produce the complete result set for all non-timed
out queries, except SPLENDID for query LS7.
Optimization Time (ms)
Odyssey
HiBISCuS−Warm
HiBISCuS−Cold
SemaGrow
FedX−Warm
FedX−Cold
Splendid
DP−VOID
102
100
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7
Fig. 4: Optimization Time in ms (log scale). CD1 and LS2 have variable predicates and
Odyssey relies on FedX to find plan
4.1 Experimental Results
Optimization time Fig. 4 shows the optimization time (OT) for the studied approaches.
Because of the detailed statistics and dynamic programming, one might expect Odyssey
to suffer from a considerable overhead in optimization time. As our experimental results
6
The Wilcoxon signed rank test was computed using the R project (http://www.r-project.org/).
show, however, Odyssey’s query planner is competitive to most other approaches with
a slight advantage for FedX-Warm and DP-VOID as these systems have very simple
optimizers and therefore little overhead; for instance, Odyssey is up to 36 times slower
(DP-VOID) and up to 11 times faster (SemaGrow) than other approaches on average.
P-values lower than 0.05 confirm that Odyssey optimization is significantly faster than
SemaGrow and SPLENDID.
# Selected Sources
Odyssey
HiBISCuS−Warm
HiBISCuS−Cold
SemaGrow
FedX−Warm
FedX−Cold
Splendid
DP−VOID
102
101
100
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7
Fig. 5: Number of Selected Sources (log scale)
Number of selected sources As Fig. 5 shows, Odyssey selects only small number of
relevant sources; for instance, at least two times less and up to 16 times less (VOIDDP) on average. P-values lower than 0.05 confirm that Odyssey selects significantly
fewer sources than all the other approaches. For a small number of queries, Odyssey
selects a higher number of sources than the optimum because our approach sometimes
overestimates that set of relevant sources – but on the other hand it never misses any
relevant sources.
# Sub−queries
Odyssey
HiBISCuS−Warm
HiBISCuS−Cold
SemaGrow
FedX−Warm
FedX−Cold
Splendid
DP−VOID
102
101
100
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7
Fig. 6: Number of Subqueries (log scale)
Number of subqueries As Fig. 6 shows, Odyssey is considerably better than other
approaches regarding the number of subqueries, at least three times less and up to 32
times less (VOID-DP) on average. P-values lower than 0.05 confirm that Odyssey decomposes the queries in significantly less subqueries than all the other approaches. This
confirms that Odyssey has correctly identified and exploited cases for which it is advantageous to combine subqueries. The only query for which other approaches have used
fewer subqueries is query LD7, this is caused by the higher number of sources selected
by Odyssey in comparison to FedX and SPLENDID, which could benefit from ASK
queries and identify exclusive joins that could be executed directly at the remote endpoint.
HiBISCuS−Cold
SemaGrow
FedX−Warm
FedX−Cold
Splendid
DP−VOID
104
TIMEOUT
TIMEOUT
INCOMPLETE
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
102
TIMEOUT
Execution Time (ms)
Odyssey
HiBISCuS−Warm
100
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7
Fig. 7: Execution Time in ms (log scale)
Execution time Several approaches failed to answer all queries before the timeout:
Splendid (1 query), SemaGrow (3 queries), and VOID-DP (10 queries). Even when
considering only those queries that completed before the timeout, Odyssey is on average
126 times faster than Splendid, 33 times faster than SemaGrow, and 24 times faster than
VOID-DP. Fig. 7 shows the execution times (ET) for the studied approaches. Odyssey is
on average at least 14 times faster than HiBISCuS-Cold/Warm and FedX-Cold/Warm.
P-values lower than 0.05 confirm that Odyssey is significantly faster than HiBISCuSCold/Warm and FedX-Cold.
HiBISCuS−Cold
SemaGrow
FedX−Warm
FedX−Cold
Splendid
DP−VOID
ABORTED
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
TIMEOUT
102
TIMEOUT
104
TIMEOUT
Number of Transferred Tuples
Odyssey
HiBISCuS−Warm
106
100
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7
Fig. 8: Number of Transferred Tuples during Query Execution (log scale)
Number of transferred tuples Fig. 8 shows the number of transferred tuples (NTT)
for the studied approaches. Odyssey transfers comparable of fewer tuples than other
approaches: at least 1.15 times less (SemaGrow) on average. P-values lower than 0.05
confirms that Odyssey transfers significantly fewer tuples than DP-VOID and FedXCold/Warm. In contrast to other appraoches, Odyssey not only reduces the number of
requests sent to the endpoints but also avoids non-selective queries, which significantly
reduces network traffic and the local load at the endpoints.
Odyssey
FedX−Cold−Odyssey
Odyssey−FedX−Cold
FedX−Warm
FedX−Cold
104
102
ABORTED
Execution Time (ms)
106
100
LD1 LD2 LD3 LD4 LD5 LD6 LD7 LD8 LD9 LD10LD11 CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7
Fig. 9: Execution Time in ms (log scale)
4.2 Combining Odyssey with Existing Optimizers
We have also integrated Odyssey techniques into an existing optimizer (FedX) and obtained:
– Odyssey-FedX-Cold/Warm, which relies on characteristic sets and characteristics
pairs to select sources and decompose the query but uses FedX join ordering and
– FedX-Cold/Warm-Odyssey, which relies on the FedX optimizer for source selection
but uses Odyssey for query decomposition and join ordering.
Fig. 9 compares the execution times of these two implementations with Odyssey, FedXCold and FedX-Warm.
In most cases the combined approaches are considerably faster than native FedX.
In a few cases, however, their execution time can increase considerably. In these cases,
FedX uses the heuristic to execute subqueries with more than one triple pattern first,
which does not work well if the query also contains subqueries with a single highly
selective triple pattern. On average, the combined approaches are twice and three times
faster than FedX-Cold. P-values lower than 0.05 confirm that FedX-Cold-Odyssey is
significantly faster than FedX-Cold and FedX-Warm.
For query LD7, Odyssey is slightly slower than FedX whereas FedX-Cold-Odyssey
is considerably faster. For this query it happens that the advantages of both Odyssey
and FedX coincide, i.e., we can take advantage of the good join ordering by Odyssey
but also of the additional pruning based on ASK queries by FedX.
Even if Odyssey’s optimization time can be higher than existing approaches’,
Odyssey produces better plans composed by less subqueries and less selected sources
per triple pattern without compromising the answer completeness. Benefits of these
features have been evidenced with significantly faster execution times and less
transferred data from endpoints to the federated query engine.
5
Conclusion
In this paper, we have proposed Odyssey, an approach for optimizing federated
SPARQL queries based on statistics. Statistics detail information about the data provided by remote endpoints as well as the links between them. This enables more accurate cost estimations, query optimization, and selection of relevant sources. Our extensive experimental evaluation shows that Odyssey produces query execution plans that
are better in terms of data transfer and execution time than state-of-the-art optimizers.
In our future work, we plan to further improve Odyssey by considering in which situations exactly it is worthwhile to use additional aspects of other optimizers, such as ASK
queries and associated statistics.
References
1. M. Acosta, M. Vidal, T. Lampo, J. Castillo, and E. Ruckhaus. ANAPSID: An Adaptive
Query Processing Engine for SPARQL Endpoints. In ISWC’11, pages 18–34, 2011.
2. K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. Describing Linked Datasets. In
LDOW’09, 2009.
3. C. B. Aranda, A. Hogan, J. Umbrich, and P. Vandenbussche. SPARQL Web-Querying Infrastructure: Ready for Action? In ISWC’13, pages 277–293, 2013.
4. C. Basca and A. Bernstein. Querying a messy web of data with avalanche. J. Web Sem.,
26:1–28, 2014.
5. A. Charalambidis, A. Troumpoukis, and S. Konstantopoulos. SemaGrow: optimizing federated SPARQL queries. In SEMANTICS’15, pages 121–128, 2015.
6. O. Görlitz and S. Staab. Federated Data Management and Query Optimization for Linked
Open Data. In New Directions in Web Data Management, pages 109–137. 2011.
7. O. Görlitz and S. Staab. SPLENDID: SPARQL Endpoint Federation Exploiting VOID Descriptions. In COLD’11, 2011.
8. A. Gubichev and T. Neumann. Exploiting the query structure for efficient join ordering in
SPARQL queries. In EDBT’14, pages 439–450, 2014.
9. A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data summaries
for on-demand queries over linked data. In WWW’10, pages 411–420, 2010.
10. M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. Extended characteristic sets: Graph indexing for sparql query optimization. In ICDE’17.
11. T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF
queries with multiple joins. In ICDE’11, pages 984–994, 2011.
12. F. Prasser, A. Kemper, and K. A. Kuhn. Efficient distributed query processing for autonomous RDF databases. In EDBT’12, pages 372–383, 2012.
13. B. Quilitz and U. Leser. Querying distributed RDF data sources with SPARQL. In ESWC,
pages 524–538, 2008.
14. M. Saleem and A. N. Ngomo. HiBISCuS: Hypergraph-Based Source Selection for SPARQL
Endpoint Federation. In ESWC, pages 176–191, 2014.
15. M. Saleem, A. N. Ngomo, J. X. Parreira, H. F. Deus, and M. Hauswirth. DAW: DuplicateAWare Federated Query Processing over the Web of Data. In ISWC’13, pages 574–590,
2013.
16. M. Schmidt, O. Görlitz, P. Haase, G. Ladwig, A. Schwarte, and T. Tran. FedBench: A
Benchmark Suite for Federated Semantic Data Query Processing. In ISWC’11, pages 585–
600, 2011.
17. A. Schwarte, P. Haase, K. Hose, R. Schenkel, and M. Schmidt. FedX: Optimization Techniques for Federated Query Processing on Linked Data. In ISWC’11, pages 601–616, 2011.
18. M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds. SPARQL basic graph
pattern optimization using selectivity estimation. In WWW’08, pages 595–604, 2008.
19. X. Wang, T. Tiropanis, and H. C. Davis. LHD: optimising linked data query processing using
parallelisation. In WWW’13, 2013.
20. F. Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in Statistics,
pages 196–202. Springer, 1992.

Download Report

The Odyssey Approach for Optimizing Federated SPARQL Queries

Paperzz.com

Your Paperzz