Advances in Applied Information Science Rules Engineering: a Potential Way to Enhance the Reasoning Process in Linked data M. El Koutbi1, A. Salah2, I. Khriss3 ENSIAS, Mohamed V – Souissi University, Rabat, Morocco 2 Computer science department, Quebec University at Montreal, UQAM 3 Computer science department, Quebec University at Rimouski, UQAR [email protected], [email protected], [email protected] 1 Abstract:-Linked data are very similar to deductive databases where reasoning is supported via different entailment regimes. This process of reasoning and inferring new data from existing ones may have a great impact on the system performance. This paper tries to enhance this process by prospecting optimal ways of applying the entailment regime rules on RDF data stores. Three strategies were proposed and compared for different entailment regimes using different data stores. The results of this study shows that rules dependency can have a good impact when the set of applied rules of the regime are not strongly linked. In case of strong dependency, the order of applying rules will have no impact on the execution time since several cycles may exist between the rules. Keywords: Linked data, reasoning process, materialization, entailment regimes, rules dependency, etc. directions: Storage & Querying, Extraction & Fusing, Enrichment & Repair, Data quality Assessment, Browsing & Exploration, Trust & Privacy, etc. In this paper, we study the impact of applying entailment regimes on RDF stores by focusing more on enhancing the performance of the reasoning process. We aim at finding an optimal order of applying rules to rapidly infer data within the RDF store for each regime. In the following section, we will present some related work. Section 3 presents a brief overview of our approach and section 4 discusses in more details our proposed strategies for rules engineering that will be tested on the RDFS entailment regime. Some recommendations will be given in section 5 that concludes this paper. 1. Introduction The Semantic Web isn't just about putting data on the web, but about making links, so that a person or a machine can explore the web of data. The idea of Linked Data based on RDF data triples gives some solutions to this issue. Linked Data do not represent any particular standard; we talk about a set of recommended techniques, rules and languages [3]; which lead to the publication of data in a way that can be manipulated automatically by machines. Each real-world entity should be uniquely described by a URL identifier; these identifiers can be distinguished by HTTP to obtain information about them. These entity representations should be interlinked together to form a global open data cloud. In the case of RDF stores, data is modeled as triples conforming to a concept of subject-predicateobject. Graphs are an alternative way of viewing these triples as vertices stand for subjects and objects, edges represent the triples themselves and are labeled by predicates. At the implementation level we can publish RDF triples in the form of a RDF/XML syntax and also publish RDFS schemata and OWL ontologies restraining the allowed content of such RDF data. Recently, a significant effort was deployed not only in theoretical research, but also in the amount of Linked Data stores available. Since RDF triples are modeled as graph data, we cannot directly adopt the existing solutions from relational databases or XML technologies. Thus, several research questions remain open in different ISBN: 978-1-61804-113-5 2. Related Work In this section, we will summarize some research papers related to the reasoning process that tries to optimize the system performance by exploring rules dependencies. Kaoudi et al. [1] present the Atlas architecture, a peer-to-peer system for storing, updating and querying RDF(S) data. Research in Atlas has resulted in novel distributed RDFS reasoning algorithms, and efficient strategies for RDF query processing on top of DHTs (Distributed Hash Table) which cover one-time and continuous queries. The paper focused only on the one time queries and describes the Atlas architecture composed of a 190 Advances in Applied Information Science reasoning component where they implement forward and backward chaining algorithms for RDFS reasoning in Atlas. Forward chaining means that all inferred triples are pre-computed and stored in the a priori network. Backward chaining is a goaldriven algorithm starting from a given request and deriving all possible answers in a top-down way. In [2], the authors present a method for extracting RDF(S) sub-ontology. To decrease the time of generating closure, a parallel reasoning algorithm is presented. The algorithm consists of a simple loop that iterates over the set of entailment rules and terminates when no new statements have been derived in the last iteration. The authors divide the RDFS ontology in two parts: the main body (user defined concepts) and the restricted body. The main body consists of the classes and the relations between the classes; the restricted body consists of the attributes about relations. Hence the ontology extraction is executed on the main body of the ontology. In [5], the authors present MapReduce which is a programming model introduced by Google for large data processing. The execution of a MapReduce program applies two user-specified functions, map and reduce, to the input data. The map function processes the input and outputs some intermediate key/value pairs. These pairs are partitioned according to the key and each partition is processed by a reduce function. R1 R2 R5 R6 Figure 3: Example of rules dependency graph. Based on this dependency graph, some heuristics have been proposed in order to enhance the performance of the reasoning process: - Heuristic 1: Execute first rules that have an huge number of out edges, this will give as result the ruler order (r1, r2, r5, r3, r4, r6); - Heuristic 2: Execute first rules that have an small number of in edges, this will give as result the ruler order (r1, r2, r5, r6, r4, r3); - Heuristic 3: Execute first rules that respect the precedence in the dependency graph, this will give as result the ruler order (r1, r2, r5, r3, r6, r4). In the following section, we will apply and discuss these heuristics on the RDFS entailment regime to study their impact on the reasoning process. 4. Rules engineering 3. Overview of the approach To illustrate our approach, we will focus in this paper on the RDFS entailment regime. Different strategies for rules are proposed and applied on different RDF public stores in order to assess their impact on the reasoning process. Based on the last W3C recommendation [3], the list of defined RDFS rules can be summarized in table1. The result of a given rule Ri can eventually be used as an antecedent of another rule Rj. This will imply a dependency between Ri and Rj. After a detailed study of rules dependency in the RDFS regime, we summarize the dependency results in table 2. The number 1 at the intersection of a row Ri and a column Rj of the matrix means that the rule Ri triggers the rule Rj. The *character indicates that after triggering, duplicate or not, new information will be generated in the RDF store. For example the intersection between the line R2 and the column R4a is marked with a star because the result of R2 uuu rdf:type xxx will generate uuu rdf:type rdfs:Resource. This information is not new; it was In general, the reasoning process within linked data has only as input a set of rules R to apply to a given store S. S’ = entailment(S, R). The way of applying these rules is important and can have a great impact on machine resources like CPU and RAM. We propose to add a new entry to the process which is the strategy of applying rules. So, S’ = entailment(S, R, Strategy). This strategy will indicate the best and optimal order for applying rules. In this paper, we will present some of these strategies based on the dependency graph of rules. This graph can be manually or automatically calculated for a given set or a sub-set of rules. Rules will represent nodes on the dependency graph and oriented edges will show the relation between two rules Ri Rj which means that the results of the rule Ri can trigger the rule Rj. The figure 3 shows an example of rules graph dependency that contains six rules. ISBN: 978-1-61804-113-5 R4 R3 191 Advances in Applied Information Science normally generated by the antecedent of the rule R2 which has the form uuu aaa yyy. The OUT column for a rule Ri (a line of table 2) calculates the number of triggered rules by Ri. A high value means that the rule Ri impacts a large number of rules and must be launched before them. The OUT metric, corresponding to the heuristic 1, represents the number of outgoing edges for a given rule. The best way to apply RDFS rules based on this metric is to use an descending order. So, rules having a high value of OUT will be invoked first. The last line IN of table 2 calculates the sum of the rules that trigger a given rule Ri. A high value for this metric means that the rule Ri should be executed at last. The IN metric, corresponding to the heuristic 2, represents the number of incoming arcs to the rule Ri on the dependency graph. An ascending order must be chosen to enhance rules application performance. The optimal strategy PRE will normally follow the order on the dependency graph. If there exits an arc between Ri and Rj then Ri should be invoked before Rj. Since the graph is cyclical and highly connected, the optimal solution will be difficult to find and several optimal solutions may even exist. To alleviate this strong dependency for the regime RDFS, we derive a simplified dependency matrix (Table 3) where we preserve only links that produces new information in the data store. Irrelevant information, such as duplicate or information regarding RDFS meta-model will therefore be deleted, this information was marked by an * in Table 2. From this new lightened matrix, we notice that the rules R1 and R4ab are not called by any other rule and therefore can be invoked at first. Taking as root the rule R1, we can easily compute the minimal spanning tree from the rules dependency graph. This spanning tree will give us an optimal execution order. We can also take R4a or R4b as root and hence derive other optimal solutions. We have opted for the scenario that have the rule R1 as root and which produce the following optimal order: R1, R4ab, R2, R3, R7, R13, R9, R12, R10, R8, R6, R5, R11. We have performed some experiments for the four strategies, namely: SEQ representing the sequential order, IN, OUT and PRE; to compare them and to find the best solution for rules application. Rules’ orders for these different strategies are summarized in table 4. ISBN: 978-1-61804-113-5 Our tests have been done using Sesame 2.6.4 with persistence support given by the native store. Sesame is an open source RDF database with support for querying and reasoning. It’s one of the most compliant tool [Valle, 2011] with the last version of Sparql 1.1. Sesame supports RDFS inference and other entailment regimes such as OWL-Horst by coupling external reasoners. Sesame provides also an infrastructure for defining custom inference rules. RDFS rules were implemented using the construct Sparql queries as given in the following: R1: Construct {?y ?p _:nnn} WHERE {?x ?p ?y FILTER isLiteral(?y) } R2: Construct {?u rdf:type ?x} WHERE { {?a rdfs:domain ?x}. {?u ?a ?v }. } R3: Construct {?v rdf:type ?x} WHERE { {?a rdfs:range ?x}. {?u ?a ?v }. } R4: Construct {?u rdf:type rdfs:resource . ?v rdf:type rdfs:resource } WHERE { {?u ?a ?v }. } R5: Construct {?u rdf:type rdfs:resource . ?v rdf:type rdfs:resource } WHERE { {?u ?a ?v }. } R6: Construct {?u rdfs:subPropertyOf ?u } WHERE { ?u rdf:type rdf:Property } R7: Construct {?u ?b ?y } WHERE { {?a rdfs:subPropertyOf ?b} . {?u ?a ?y}. } R8: Construct {?u rdfs:subClassOf rdfs:resource} WHERE { ?u rdf:type rdfs:Class } R9: Construct { ?v rdf:type ?x } WHERE { {?u rdfs:subClassOf ?x} . {?v rdf:Type ?u} .} R10: Construct { ?u rdfs:subClassOf ?u } WHERE { ?u rdf:Type rdfs:Class } R11: Construct { ?u rdfs:subClassOf ?x } WHERE { {?u rdfs:subClassOf ?v} . {?v rdfs:subClassOf ?x} . } R12: Construct { ?u rdfs:subPropertyOf rdfs:Member } WHERE { ?u rdf:Type rdfs:ContainerMembershipProperty } R13: Construct { ?u rdfs:subClassOf rdfs:Literal } WHERE { ?u rdf:Type rdfs:Datatype } Experiments were made on a laptop having a processor of 1.3 GHz and 2 GB of RAM. Measurements have been done using data stores with different sizes: 1X=314ko, 2X, 4X and 8X. These data sets correspond to XML published data of the national library of France [data.bnf.fr]. For space reasons, we will only present the results of SEQ and PRE strategies. Experiment results of the two strategies are summarized in table5 where values represent the execution time in milliseconds (ms), and Figures 3 and 4. We note that when applying the rules for the two strategies SEQ and PRE (also the other two strategies; IN and OUT), rules R1, R4 and R5 are those that are time consuming. 192 Advances in Applied Information Science SEQ Strategy Size Rules 1X R1 1252 R2 46 R3 36 R4 1249 R5 40 R6 35 R7 52 R8 37 R9 38 R10 34 R11 37 R12 35 R13 42 Total 2933 Size 2X 2310 72 60 2464 64 59 57 62 63 56 62 57 58 5444 Size Size Size 4X 8X 1X 3922 8071 1336 130 260 39 120 235 41 4687 10491 42 131 290 1279 130 290 44 125 246 38 122 278 40 126 320 37 120 252 38 121 244 38 123 283 46 120 317 35 9977 21577 3053 PRE Strategy 2364 58 65 62 2525 62 58 59 57 62 61 59 59 5551 Size Size 4X 8X 3965 7904 129 352 122 236 4669 9950 125 244 122 304 121 243 120 274 122 303 121 245 123 242 127 236 120 278 9986 20811 Table5: Execution time in ms for SEQ and PRE strategies. Figure 3: Execution Time for RDFS rules in SEQ strategy. 5. Conclusion In this paper, we proposed a set of strategies for rules’ application for different regimes RDFS and OWL. The first results show that these strategies may have an interest in the case where the set of rules is not strongly dependent. In the case of RDFS rules, the four strategies have shown similar behavior due to the strong dependence between the RDFS rules. In case of the OWL entailment regime using a subset of weakly linked rules, the PRE strategy has given better results. We plan to propose and test other heuristics in order to enhance more the reasoning process. We believe that the order of application of rules will also depend on the nature of the data in the store. We project to define a new metric that combines the dependency of rules with data stores statistics and study their impacts on the reasoning process. We also project to answer the following question: in which conditions a given strategy will perform better than the others? Figure 4: Execution Time for RDFS rules in PRE strategy. This can be explained by the fact that they were applied to all triplets of the store. Indeed data stores contain a lot of objects of Literal type. We notice that all the strategies have the same behavior. This can be explained by the strong dependency between the RDFS rules. Rule engineering will not enhance materialization performance when the set of rules presents a strong dependency. ISBN: 978-1-61804-113-5 For the bookstore RDF data, we note that about of 25% were inferred as new data. The following table shows the number of triples before and after materialization for all the case studied. We have also done some experiments on the Geo-coordinate strore of DBpedia [http://www4.wiwiss.fu-berlin.de/benchmarks200801/#dataset], the same behavior has been noticed. The growth of the data was about 34%. For real RDF data store, the results can be incomplete; the inferred rules will depend on the data store. In some scenarios, some of the RDFS rules will never been triggered. So, a data store generator will be useful to generate data store where all the rules will have the same chance to be triggered and inferred new data. We have tested our approach on the OWL entailment regime using a generator. The PRE strategy has given better results than the other strategies and a growth of 205% of data has been noticed because of the nature of the RDF data store that is balanced for the OWL rules. The proposed approach, presented in this paper, has the merit to be exhaustive and modular: exhaustive, because it considers as entry all the rules of a given entailment regime; and modular because it supports that the user specifies an interesting subset of rules that she or he prefers to apply on her or his RDF stores. Automatic calculation will be done to derive the best order for the optimal strategy. 193 Advances in Applied Information Science References: [1] Zoi Kaoudi, Manolis Koubarakis, Kostis Kyzirakos, Iris Miliaraki.,Matoula Magiridou, Antonios Papadakis-Pesaresi. Atlas: Storing, updating and querying RDF(S) data on top of DHTs. Web Semantics: Science, Services and Agents on the World Wide Web 8 (2010) 271–277 journal homepage: www.elsevier.com/locate/websem [2] Dave Kolas, Ian Emmons, and Mike Dean. Efficient LinkedList RDF Indexing in Parliament. The 5th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2009), pp.21-32, Washington DC, USA, October 25-29, 2009. [3] http://www.w3.org/TR/sparql11-query/ [4] Florian Stegmaier, Udo Gröbner, Mario Döller, Harald Kosch and Gero Baese. Evaluation of Current RDF Database Solutions. In: Proceedings of the 10th International Workshop on Semantic Multimedia Database Technologies (SeMuDaTe 2009), Graz, Austria, Vol. 539, pp 39-55, December 2009. [5] Urbani J., Kotoulas, S., Maaseen J., van Harmelen, F. & Bal, H. (2010), OWL reasoning with WebPIE: calculating the closure of 100 billion triples, In Proceedings of the ESWC '10. [6] A General Framework for Representing, Reasoning and Querying with Annotated Semantic Web Data. by Antoine Zimmermann, Nuno Lopes, Axel Polleres, Umberto Straccia Computer and Information Science. Pages: 1437-1442. Appendix Rule Predicate Consequent R1 uuu aaa lll. where lll is a plain test _:nnn rdf:type rdfs: Literal . R2 aaa rdfs:domain xxx . uuu aaa yyy. uuu rdf:type xxx. R3 aaa rdfs:range xxx . uuu aaa vvv. vvv rdf:type xxx. R4a uuu aaa xxx uuu rdf:type rdfs:Resource. R4b uuu aaa vvv vvv rdf:type rdfs:Resource. R5 uuu rdfs:subPropertyOf vvv . vvv rdfs:subPropertyOf xxx . uuu rdfs:subPropertyOf xxx R6 uuu rdf:type rdf:Property . uuu rdfs:subPropertyOf uuu. R7 aaa rdfs:subPropertyOf bbb . uuu aaa yyy . uuu bbb yyy. R8 uuu rdf:type rdfs:Class . uuu rdfs:subClassOf rdfs:Resource. R9 uuu rdfs:subClassOf xxx . vvv rdf:type uuu . vvv rdf:type xxx. R10 uuu rdf:type rdfs:Class . uuu rdfs:subClassOf uuu. R11 uuu rdfs:subClassOf vvv . vvv rdfs:subClassOf xxx . uuu rdfs:subClassOf xxx. R12 uuu rdf:type rdfs:ContainerMembershipProperty . uuu rdfs:subPropertyOf rdfs:member. R13 uuu rdf:type rdfs:Datatype. uuu rdfs:subClassOf rdfs:Literal. Table 1: RDFS rules. ISBN: 978-1-61804-113-5 194 Advances in Applied Information Science R1 R1 R2 R3 R4a R4b R5 R6 R7 R8 R9 R10 R11 R12 R13 IN 1 * 1 R2 1* 1* 1* 1* 1 1* 1* 1 1 1 1 11 R3 R4a R4b 1* 1* 1* 1* 1* 1* 1* 1* 1* 1 1* 1* 1 1 1 1 11 1* 1* 1* 1* 1* 1* 1* 1* 1* 1* 12 R5 R6 R7 1 1 1 1 1* R8 1 1 1 R9 1 1 1 1* R10 1 1 1 R11 1 1* 1 1 1 R12 R13 1 1 1 1 1 1 1 1 4 4 1 1* 1* 1* 1* 1* 1* 1* 1* 1* 12 1 1* 1* 1 1 1* 1 1 1 1 1 11 1 1 2 4 1 1 1* 1* 1 1 8 1* 5 6 5 Table 2: Dependency matrix between RDFS rules. Strategies RDFS Rules SEQ IN OUT PRE R1 R1 R7 R1 R2 R5 R9 R4 R2 R13 R2 R2 R4 R12 R3 R3 R5 R6 R8 R7 R6 R11 R10 R13 R7 R10 R13 R9 R8 R8 R6 R12 R9 R9 R11 R10 R10 R7 R12 R8 R11 R3 R5 R6 R12 R2 R4 R5 R13 R4 R1 R11 Table 4: Rules’ order for SEQ, IN, OUT and PRE strategies. R1 R2 R3 R4a R1 R2 R3 R4a R4b R5 R6 R7 R8 R9 R10 R11 R12 R13 R4b R5 R6 R7 R8 R9 R10 R11 R12 R13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Table 3: Updated dependency matrix after deletion of duplicate generated data. ISBN: 978-1-61804-113-5 195 Out 3 9 9 5 4 5 6 12 7 10 7 6 6 7
© Copyright 2026 Paperzz