a Potential Way to Enhance the Reasoning Process in

Advances in Applied Information Science
Rules Engineering: a Potential Way to
Enhance the Reasoning Process in Linked data
M. El Koutbi1, A. Salah2, I. Khriss3
ENSIAS, Mohamed V – Souissi University, Rabat, Morocco
2
Computer science department, Quebec University at Montreal, UQAM
3
Computer science department, Quebec University at Rimouski, UQAR
[email protected], [email protected], [email protected]
1
Abstract:-Linked data are very similar to deductive databases where reasoning is supported via different
entailment regimes. This process of reasoning and inferring new data from existing ones may have a great
impact on the system performance. This paper tries to enhance this process by prospecting optimal ways of
applying the entailment regime rules on RDF data stores. Three strategies were proposed and compared for
different entailment regimes using different data stores. The results of this study shows that rules dependency
can have a good impact when the set of applied rules of the regime are not strongly linked. In case of strong
dependency, the order of applying rules will have no impact on the execution time since several cycles may
exist between the rules.
Keywords: Linked data, reasoning process, materialization, entailment regimes, rules dependency, etc.
directions: Storage & Querying, Extraction &
Fusing, Enrichment & Repair, Data quality
Assessment, Browsing & Exploration, Trust &
Privacy, etc.
In this paper, we study the impact of applying
entailment regimes on RDF stores by focusing more
on enhancing the performance of the reasoning
process. We aim at finding an optimal order of
applying rules to rapidly infer data within the RDF
store for each regime. In the following section, we
will present some related work. Section 3 presents a
brief overview of our approach and section 4
discusses in more details our proposed strategies for
rules engineering that will be tested on the RDFS
entailment regime. Some recommendations will be
given in section 5 that concludes this paper.
1. Introduction
The Semantic Web isn't just about putting data on
the web, but about making links, so that a person or
a machine can explore the web of data. The idea of
Linked Data based on RDF data triples gives some
solutions to this issue. Linked Data do not represent
any particular standard; we talk about a set of
recommended techniques, rules and languages [3];
which lead to the publication of data in a way that
can be manipulated automatically by machines.
Each real-world entity should be uniquely described
by a URL identifier; these identifiers can be
distinguished by HTTP to obtain information about
them. These entity representations should be
interlinked together to form a global open data
cloud. In the case of RDF stores, data is modeled as
triples conforming to a concept of subject-predicateobject. Graphs are an alternative way of viewing
these triples as vertices stand for subjects and
objects, edges represent the triples themselves and
are labeled by predicates. At the implementation
level we can publish RDF triples in the form of a
RDF/XML syntax and also publish RDFS schemata
and OWL ontologies restraining the allowed content
of such RDF data. Recently, a significant effort was
deployed not only in theoretical research, but also in
the amount of Linked Data stores available. Since
RDF triples are modeled as graph data, we cannot
directly adopt the existing solutions from relational
databases or XML technologies. Thus, several
research questions remain open in different
ISBN: 978-1-61804-113-5
2. Related Work
In this section, we will summarize some research
papers related to the reasoning process that tries to
optimize the system performance by exploring rules
dependencies.
Kaoudi et al. [1] present the Atlas architecture, a
peer-to-peer system for storing, updating and
querying RDF(S) data. Research in Atlas has
resulted in novel distributed RDFS reasoning
algorithms, and efficient strategies for RDF query
processing on top of DHTs (Distributed Hash Table)
which cover one-time and continuous queries. The
paper focused only on the one time queries and
describes the Atlas architecture composed of a
190
Advances in Applied Information Science
reasoning component where they implement
forward and backward chaining algorithms for
RDFS reasoning in Atlas. Forward chaining means
that all inferred triples are pre-computed and stored
in the a priori network. Backward chaining is a goaldriven algorithm starting from a given request and
deriving all possible answers in a top-down way.
In [2], the authors present a method for extracting
RDF(S) sub-ontology. To decrease the time of
generating closure, a parallel reasoning algorithm is
presented. The algorithm consists of a simple loop
that iterates over the set of entailment rules and
terminates when no new statements have been
derived in the last iteration. The authors divide the
RDFS ontology in two parts: the main body (user
defined concepts) and the restricted body. The main
body consists of the classes and the relations
between the classes; the restricted body consists of
the attributes about relations. Hence the ontology
extraction is executed on the main body of the
ontology.
In [5], the authors present MapReduce which is a
programming model introduced by Google for large
data processing. The execution of a MapReduce
program applies two user-specified functions, map
and reduce, to the input data. The map function
processes the input and outputs some intermediate
key/value pairs. These pairs are partitioned
according to the key and each partition is processed
by a reduce function.
R1
R2
R5
R6
Figure 3: Example of rules dependency graph.
Based on this dependency graph, some heuristics
have been proposed in order to enhance the
performance of the reasoning process:
- Heuristic 1: Execute first rules that have an
huge number of out edges, this will give as
result the ruler order (r1, r2, r5, r3, r4, r6);
- Heuristic 2: Execute first rules that have an
small number of in edges, this will give as
result the ruler order (r1, r2, r5, r6, r4, r3);
- Heuristic 3: Execute first rules that respect
the precedence in the dependency graph,
this will give as result the ruler order (r1, r2,
r5, r3, r6, r4).
In the following section, we will apply and discuss
these heuristics on the RDFS entailment regime to
study their impact on the reasoning process.
4. Rules engineering
3. Overview of the approach
To illustrate our approach, we will focus in this
paper on the RDFS entailment regime. Different
strategies for rules are proposed and applied on
different RDF public stores in order to assess
their impact on the reasoning process. Based on the
last W3C recommendation [3], the list of defined
RDFS rules can be summarized in table1.
The result of a given rule Ri can eventually be
used as an antecedent of another rule Rj. This will
imply a dependency between Ri and Rj. After a
detailed study of rules dependency in the RDFS
regime, we summarize the dependency results in
table 2. The number 1 at the intersection of a
row Ri and a column Rj of the matrix means that the
rule Ri triggers the rule Rj. The *character indicates
that after triggering, duplicate or not, new
information will be generated in the RDF store. For
example the intersection between the line R2 and
the column R4a is marked with a star because the
result
of
R2
uuu rdf:type xxx
will
generate uuu rdf:type
rdfs:Resource. This
information is
not
new;
it
was
In general, the reasoning process within linked
data has only as input a set of rules R to apply to a
given store S. S’ = entailment(S, R). The way of
applying these rules is important and can have a
great impact on machine resources like CPU and
RAM. We propose to add a new entry to the process
which is the strategy of applying rules. So, S’ =
entailment(S, R, Strategy). This strategy will
indicate the best and optimal order for applying
rules.
In this paper, we will present some of these
strategies based on the dependency graph of rules.
This graph can be manually or automatically
calculated for a given set or a sub-set of rules. Rules
will represent nodes on the dependency graph and
oriented edges will show the relation between two
rules Ri
Rj which means that the results of the
rule Ri can trigger the rule Rj. The figure 3 shows
an example of rules graph dependency that contains
six rules.
ISBN: 978-1-61804-113-5
R4
R3
191
Advances in Applied Information Science
normally generated by the antecedent of the rule R2
which has the form uuu aaa yyy.
The OUT column for a rule Ri (a line of table 2)
calculates the number of triggered rules by Ri. A
high value means that the rule Ri impacts a large
number of rules and must be launched before
them. The OUT metric, corresponding to the
heuristic 1, represents the number of outgoing
edges for a given rule. The best way to apply RDFS
rules based on this metric is to use an descending
order. So, rules having a high value of OUT will be
invoked first.
The last line IN of table 2 calculates the sum
of the rules that trigger a given rule Ri. A high
value for this metric means that the rule Ri
should be executed at last. The IN metric,
corresponding to the heuristic 2, represents the
number of incoming arcs to the rule Ri on the
dependency graph. An ascending order must be
chosen to enhance rules application performance.
The optimal strategy PRE will normally follow
the order on the dependency graph. If there exits an
arc between Ri and Rj then Ri should be invoked
before Rj. Since the graph is cyclical and
highly connected, the
optimal
solution will
be difficult to find and several optimal
solutions may even exist. To alleviate this strong
dependency for the regime RDFS, we derive a
simplified dependency matrix (Table 3) where we
preserve
only links that
produces new
information in the
data
store. Irrelevant
information, such
as duplicate
or
information regarding RDFS
meta-model will
therefore be deleted, this information was marked
by an * in Table 2.
From this new lightened matrix, we notice
that the rules R1 and R4ab are not called by any
other rule and therefore can be invoked at first.
Taking as root the rule R1, we can easily compute
the minimal spanning tree from the rules
dependency graph. This spanning tree will give
us an optimal execution
order. We
can also take R4a or R4b as root and hence derive
other optimal solutions. We have opted for the
scenario that have the rule R1 as root and which
produce
the
following
optimal
order:
R1, R4ab, R2, R3, R7, R13, R9, R12, R10, R8, R6,
R5, R11.
We have performed some experiments for the
four strategies,
namely: SEQ representing the
sequential
order,
IN,
OUT
and PRE; to
compare them and to find the best solution for rules
application. Rules’ orders for these different
strategies are summarized in table 4.
ISBN: 978-1-61804-113-5
Our tests have been done using Sesame 2.6.4
with persistence support given by the native store.
Sesame is an open source RDF database with
support for querying and reasoning. It’s one of the
most compliant tool [Valle, 2011] with the last
version of Sparql 1.1. Sesame supports RDFS
inference and other entailment regimes such as
OWL-Horst by coupling external reasoners. Sesame
provides also an infrastructure for defining custom
inference rules.
RDFS rules were implemented using the
construct Sparql queries as given in the following:
R1: Construct {?y ?p _:nnn} WHERE {?x ?p ?y FILTER isLiteral(?y) }
R2: Construct {?u rdf:type ?x} WHERE { {?a rdfs:domain ?x}. {?u ?a
?v }. }
R3: Construct {?v rdf:type ?x} WHERE { {?a rdfs:range ?x}. {?u ?a ?v
}. }
R4: Construct {?u rdf:type rdfs:resource . ?v rdf:type rdfs:resource }
WHERE { {?u ?a ?v }. }
R5: Construct {?u rdf:type rdfs:resource . ?v rdf:type rdfs:resource }
WHERE { {?u ?a ?v }. }
R6: Construct {?u rdfs:subPropertyOf ?u } WHERE { ?u rdf:type
rdf:Property }
R7: Construct {?u ?b ?y } WHERE { {?a rdfs:subPropertyOf ?b} . {?u
?a ?y}. }
R8: Construct {?u rdfs:subClassOf rdfs:resource} WHERE { ?u
rdf:type rdfs:Class }
R9: Construct { ?v rdf:type ?x } WHERE { {?u rdfs:subClassOf ?x} .
{?v rdf:Type ?u} .}
R10: Construct { ?u rdfs:subClassOf ?u } WHERE { ?u rdf:Type
rdfs:Class }
R11: Construct { ?u rdfs:subClassOf ?x } WHERE { {?u rdfs:subClassOf
?v} . {?v rdfs:subClassOf ?x} . }
R12: Construct { ?u rdfs:subPropertyOf rdfs:Member } WHERE { ?u
rdf:Type rdfs:ContainerMembershipProperty }
R13: Construct { ?u rdfs:subClassOf rdfs:Literal } WHERE { ?u
rdf:Type rdfs:Datatype }
Experiments were made on a laptop having a
processor of
1.3 GHz and 2
GB of
RAM.
Measurements have been done using data stores
with different sizes: 1X=314ko, 2X, 4X and 8X.
These data sets correspond to XML published data
of the national library of France [data.bnf.fr].
For space reasons, we will only present the
results of SEQ and PRE strategies. Experiment
results of the two strategies are summarized in
table5 where values represent the execution time in
milliseconds (ms), and Figures 3 and 4.
We note that when applying the rules for the
two strategies SEQ and PRE (also the other two
strategies; IN and OUT), rules R1, R4 and R5 are
those that are time consuming.
192
Advances in Applied Information Science
SEQ Strategy
Size
Rules 1X
R1
1252
R2
46
R3
36
R4
1249
R5
40
R6
35
R7
52
R8
37
R9
38
R10
34
R11
37
R12
35
R13
42
Total 2933
Size
2X
2310
72
60
2464
64
59
57
62
63
56
62
57
58
5444
Size Size Size
4X
8X
1X
3922 8071 1336
130
260
39
120
235
41
4687 10491 42
131
290 1279
130
290
44
125
246
38
122
278
40
126
320
37
120
252
38
121
244
38
123
283
46
120
317
35
9977 21577 3053
PRE Strategy
2364
58
65
62
2525
62
58
59
57
62
61
59
59
5551
Size Size
4X
8X
3965 7904
129
352
122
236
4669 9950
125
244
122
304
121
243
120
274
122
303
121
245
123
242
127
236
120
278
9986 20811
Table5: Execution time in ms for SEQ and PRE
strategies.
Figure 3: Execution Time for RDFS rules in SEQ
strategy.
5. Conclusion
In this paper, we proposed a set of
strategies for rules’
application
for different
regimes RDFS and OWL. The first results show
that these strategies may have an interest in the case
where the set of rules is not strongly dependent. In
the
case
of RDFS rules, the
four strategies have shown similar
behavior due
to the strong dependence between the RDFS rules.
In case of the OWL entailment regime using a subset of weakly linked rules, the PRE strategy has
given better results.
We plan to propose and test other heuristics in
order to enhance more the reasoning process. We
believe that the order of application of rules
will also depend on the nature of the data in the
store. We project to define a new metric that
combines the dependency of rules with data stores
statistics and study their impacts on the reasoning
process. We also project to answer the following
question: in which conditions a given strategy will
perform better than the others?
Figure 4: Execution Time for RDFS rules in PRE strategy.
This can be explained by the fact that they were
applied to all triplets of the store. Indeed data
stores contain a lot of objects of Literal type. We
notice that all the strategies have the same behavior.
This can be explained by the strong dependency
between the RDFS rules. Rule engineering will not
enhance materialization performance when the set
of rules presents a strong dependency.
ISBN: 978-1-61804-113-5
For the bookstore RDF data, we note that about
of 25% were inferred as new data. The following
table shows the number of triples before and after
materialization for all the case studied.
We have also done some experiments
on the Geo-coordinate strore
of DBpedia
[http://www4.wiwiss.fu-berlin.de/benchmarks200801/#dataset], the same behavior has been
noticed. The growth of the data was about 34%.
For real RDF data store, the results can be
incomplete; the inferred rules will depend on the
data store. In some scenarios, some of the RDFS
rules will never been triggered. So, a data store
generator will be useful to generate data store where
all the rules will have the same chance to be
triggered and inferred new data. We have tested our
approach on the OWL entailment regime using a
generator. The PRE strategy has given better results
than the other strategies and a growth of 205% of
data has been noticed because of the nature of the
RDF data store that is balanced for the OWL rules.
The proposed approach, presented in this paper,
has the merit to be exhaustive and modular:
exhaustive, because it considers as entry all the rules
of a given entailment regime; and modular because
it supports that the user specifies an interesting subset of rules that she or he prefers to apply on her or
his RDF stores. Automatic calculation will be done
to derive the best order for the optimal strategy.
193
Advances in Applied Information Science
References:
[1] Zoi Kaoudi, Manolis Koubarakis, Kostis Kyzirakos, Iris
Miliaraki.,Matoula Magiridou, Antonios Papadakis-Pesaresi.
Atlas: Storing, updating and querying RDF(S) data on top of
DHTs. Web Semantics: Science, Services and Agents on the
World Wide Web 8 (2010) 271–277 journal homepage:
www.elsevier.com/locate/websem
[2] Dave Kolas, Ian Emmons, and Mike Dean. Efficient LinkedList RDF Indexing in Parliament. The 5th International
Workshop on Scalable Semantic Web Knowledge Base Systems
(SSWS2009), pp.21-32, Washington DC, USA, October 25-29,
2009.
[3] http://www.w3.org/TR/sparql11-query/
[4] Florian Stegmaier, Udo Gröbner, Mario Döller, Harald
Kosch and Gero Baese. Evaluation of Current RDF Database
Solutions. In: Proceedings of the 10th International Workshop
on Semantic Multimedia Database Technologies (SeMuDaTe
2009), Graz, Austria, Vol. 539, pp 39-55, December 2009.
[5] Urbani J., Kotoulas, S., Maaseen J., van Harmelen, F. & Bal,
H. (2010), OWL reasoning with WebPIE: calculating the
closure of 100 billion triples, In Proceedings of the ESWC '10.
[6] A General Framework for Representing, Reasoning and
Querying with Annotated Semantic Web Data. by Antoine
Zimmermann, Nuno Lopes, Axel Polleres, Umberto Straccia
Computer and Information Science. Pages: 1437-1442.
Appendix
Rule
Predicate
Consequent
R1
uuu aaa lll.
where lll is a plain test
_:nnn rdf:type rdfs: Literal .
R2
aaa rdfs:domain xxx .
uuu aaa yyy.
uuu rdf:type xxx.
R3
aaa rdfs:range xxx .
uuu aaa vvv.
vvv rdf:type xxx.
R4a
uuu aaa xxx
uuu rdf:type rdfs:Resource.
R4b
uuu aaa vvv
vvv rdf:type rdfs:Resource.
R5
uuu rdfs:subPropertyOf vvv .
vvv rdfs:subPropertyOf xxx .
uuu rdfs:subPropertyOf xxx
R6
uuu rdf:type rdf:Property .
uuu rdfs:subPropertyOf uuu.
R7
aaa rdfs:subPropertyOf bbb .
uuu aaa yyy .
uuu bbb yyy.
R8
uuu rdf:type rdfs:Class .
uuu rdfs:subClassOf rdfs:Resource.
R9
uuu rdfs:subClassOf xxx .
vvv rdf:type uuu .
vvv rdf:type xxx.
R10
uuu rdf:type rdfs:Class .
uuu rdfs:subClassOf uuu.
R11
uuu rdfs:subClassOf vvv .
vvv rdfs:subClassOf xxx .
uuu rdfs:subClassOf xxx.
R12
uuu rdf:type rdfs:ContainerMembershipProperty .
uuu rdfs:subPropertyOf
rdfs:member.
R13
uuu rdf:type rdfs:Datatype.
uuu rdfs:subClassOf rdfs:Literal.
Table 1: RDFS rules.
ISBN: 978-1-61804-113-5
194
Advances in Applied Information Science
R1
R1
R2
R3
R4a
R4b
R5
R6
R7
R8
R9
R10
R11
R12
R13
IN
1
*
1
R2
1*
1*
1*
1*
1
1*
1*
1
1
1
1
11
R3
R4a
R4b
1*
1*
1*
1*
1*
1*
1*
1*
1*
1
1*
1*
1
1
1
1
11
1*
1*
1*
1*
1*
1*
1*
1*
1*
1*
12
R5
R6
R7
1
1
1
1
1*
R8
1
1
1
R9
1
1
1
1*
R10
1
1
1
R11
1
1*
1
1
1
R12
R13
1
1
1
1
1
1
1
1
4
4
1
1*
1*
1*
1*
1*
1*
1*
1*
1*
12
1
1*
1*
1
1
1*
1
1
1
1
1
11
1
1
2
4
1
1
1*
1*
1
1
8
1*
5
6
5
Table 2: Dependency matrix between RDFS rules.
Strategies
RDFS Rules
SEQ
IN
OUT
PRE
R1
R1
R7
R1
R2
R5
R9
R4
R2
R13
R2
R2
R4
R12
R3
R3
R5
R6
R8
R7
R6
R11
R10
R13
R7
R10
R13
R9
R8
R8
R6
R12
R9
R9
R11
R10
R10
R7
R12
R8
R11
R3
R5
R6
R12
R2
R4
R5
R13
R4
R1
R11
Table 4: Rules’ order for SEQ, IN, OUT and PRE strategies.
R1 R2 R3 R4a
R1
R2
R3
R4a
R4b
R5
R6
R7
R8
R9
R10
R11
R12
R13
R4b R5 R6 R7 R8 R9 R10
R11
R12
R13
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Table 3: Updated dependency matrix after deletion of duplicate generated data.
ISBN: 978-1-61804-113-5
195
Out
3
9
9
5
4
5
6
12
7
10
7
6
6
7