A Fuzzy Extension for the XPath Query Language Alessandro Campi, Sam Guinea, and Paola Spoletini Dipartimento di Elettronica e Informazione - Politecnico di Milano Piazza L. da Vinci 32, I-20133 Milano, Italy campi|guinea|[email protected] Abstract XML has become a widespread format for data exchange over the Internet. The current state of the art in querying XML data is represented by XPath and XQuery, both of which define binary predicates. In this paper, we advocate that binary selection can at times be restrictive due to very nature of XML, and to the uses that are made of it. We therefore suggest a querying framework, called FXPath, based on fuzzy logics. In particular, we propose the use of fuzzy predicates for the definition of more “vague” and softer queries. We also introduce a function called “deep-similar”, which aims at substituting XPath’s typical “deepequal” function. Its goal is to provide a degree of similarity between two XML trees, assessing whether they are similar both structure-wise and content-wise. The approach is exemplified in the field of e-learning metadata. 1 Introduction In the last few years XML has become one of the most important data formats for information exchange over the Internet. Ever since the advent of XML as a widespread data format, query languages have become of paramount importance. The principal proposals for querying XML documents have been XPath and XQuery. The first is a language that allows for the selection of XML nodes through the definition of ”tree traversal” expressions. Although not a fully-fledged query language, it remains sufficiently expressive, and has become widely adopted within other XML query languages for expressing selection conditions. Its main advantage is the presence of a rich set of available built-in functions. XQuery, on the other hand, is W3C’s current candidate for a fully-fledged query language for XML documents. It is capable of working on multiple XML documents, of joining results, and of transforming and creating XML structures. It builds upon XPath, which is used as the selection language, to obtain its goals. Both XPath and XQuery divide data into those which fully satisfy the selection conditions, and those which do not. However, binary conditions can be —in some scenarios— a limited approach to effective querying of XML data. A few considerations can be made to justify this claim. First of all, even when XML schemas do exist, data producers do not always follow them precisely. Second, users often end up defining blind queries, either because they do not know the XML schema in detail, or because they do not know exactly what they are looking for. For example, they might be querying for some vague interest. Third, the same data can sometimes be described using different schemas. As it is often the case with semi-structured data, it is difficult to distinguish between the data itself and the structure containing it. There is LOM General Description Language Rights Title Intended End User Educational Learning Resource Type LifeCycle Semantic Density Technical Requirements Format Figure 1. A simplified representation of the structure of the LOM standard an intrinsic overlap between data and structure and it is very common for “physically” near data to also be semantically related. It is easy to see how basing a binary query on such unsafe grounds can often lead to unnecessary silence. For example, all these considerations are true in the field of e-learning metadata, which will constitute our explanatory context throughout this paper. In this field of research it is common to describe learning objects (LOs) using the LOM standard1 and its XML representation. A simplified representation of the structure of these documents is shown in Figure 1. Within this context, we tackle the case in which a user is searching for a certain learning object across multiple and distributed repositories in which the same content may be stored with slightly different metadata. In this paper, we propose a framework for querying XML data that goes beyond binary selection, and that allows the user to define more ’vague’ and ’softer’ selection criteria in order to obtain more results. To do so, we use concepts coming from the area of fuzzy logics. The main idea is that the selection should produce fuzzy sets, which differ from binary sets in the sense that it is possible to belong to them to different degrees. This is achieved through membership functions that consider the semantics behind the selection predicates (through domain specific ontologies or by calling WordNet), such as Q : R[0, 1], where Q(x) indicates the degree to which the data x satisfies the concept Q. Concretely, we propose to extend the XPath query language to accommodate fuzzy selection. The choice has fallen on XPath since it presents a simpler starting point with respect to XQuery. Due to the considerations already stated and analised in [1], we define some extensions that constitute FXPath and fall into the following categories: – Fuzzy Predicates: The user can express vague queries by exploiting fuzzy predicates, whose semantics can be based on structural relaxations or clarified through the use of domain specific ontologies. – Fuzzy Tree Matching: Standard XPath provides a deep-equal function that can be used to assess whether two sequences contain items that are atomic values and are equal, or that are nodes of the same kind, with the same name, whose children are deep-equal. This can be restrictive, so we propose an extension named deep-similar to assess whether the sequences are similar both content-wise and structure-wise. 1 This standard specifies the syntax and semantics of Learning Object Metadata, defined as the attributes required to describe a Learning Object. The Learning Object Metadata standards focuses on the minimal set of attributes needed to allow these Learning Objects to be managed, located, and evaluated. Relevant attributes of Learning Objects include object type, author, owner, terms of distribution, format and pedagogical attributes such as teaching or interaction style, grade level, mastery level, and prerequisites. When query results are returned, they are accompanied by a ranking indicating ”how much” each data item satisfies the selection condition. For example, while searching for LOs published in a year near 2000, we might retrieve LOs published in 2000, in 2001, 2002, etc. Returned items are wrapped into annotations, so as to recall XML tagging: <!--RankingDirective RankingValue="1.0" --> <LO year="2000"> <title>t1</title> </LO> <!-- /RankingDirective --> <!-- RankingDirective RankingValue=".8"--> <LO year="2001"> <title>t2</title> </LO> <!-- /RankingDirective --> <!-- RankingDirective RankingValue=".65" --> <LO year="2002"> <title>t3</title> </LO> <!-- /RankingDirective --> In this example it is possible to notice the presence of a ranking directive containing the ranking value —in the set [0, 1]— of each retrieved item. The closer it is to 1, the better the item satisfies the condition (i.e. to be published in a year near 2000). The rest of this paper is structured as follows. Section 2 presents relevant and related work. Section 3 presents our approach for fuzzy predicates. Section 4 presents the concept of fuzzy tree matching and our implementation of the “deep-similar” function. Section 5 brings the two approaches together, and Section 6 concludes this paper. 2 Related Work Fuzzy sets have been shown to be a convenient way to model flexible queries in [2]. Many attempts to extend SQL with fuzzy capabilities were undertaken in recent years. [3] describes SQLf, a language that extends SQL by introducing fuzzy predicates that are processed on crisp information. Fuzzy quantifiers allowing to define aggregated concepts have been proposed in [4] and [5]. The FSQL system [6], developed upon Oracle, represents imprecise information as possibility distributions stored in standard tables. Users write queries using FSQL, which are then translated into ordinary SQL queries that call functions provided by FSQL to compute the degrees of matching. Different approaches have been defined to compare fuzzy values. [7] proposes measures to evaluate (by considering similarity relations) how close two fuzzy representations are. [8], based on possibility distribution and the semantic measure of fuzzy data, introduces an extended object-oriented database model to handle imperfect, as well as complex objects, in the real world. Some major notions in object-oriented databases such as objects, classes, objects-classes relationships, subclass/superclass, and multiple inheritances are extended in the fuzzy information environment. In [9] the use of fuzzy querying in particular in the Internet is shown. This paper is an example of how fuzzy querying should be used in widely distributed data sources: it is shown how elements of fuzzy logic and linguistic quantifiers can be employed to attain human consistent and useful solutions. [10] applies fuzzy set methods to multimedia databases which have a complex structure, and from which documents have to be retrieved and selected depending not only on their contents, but also on the idea the user has of their appearance, through queries specified in terms of user criteria. A stream of research on fuzzy pattern matching (FPM) started in the eighties, and was successfully used in flexible querying of fuzzy databases and in classification. Given a pattern representing a request expressed in terms of fuzzy sets, and a database containing imprecise or fuzzy attribute values, the FPM returns two matching degrees. An example of an advanced techniques based on FPM can be found in [11]. [12] proposes a counterpart of FPM, called ”Qualitative Pattern Matching” (QPM), for estimating levels of matching between a request and data expressed with words. Given a request, QPM rank-orders the items which possibly, or which certainly match the requirements, according to the preferences of the user. The problem of fuzzy similarity between graphs is studied in [13]. [14] presents FlexPath, an attempt to integrate database-style query languages such as XPath and XQuery and full-text search on textual content. FlexPath considers queries on structure as a template, and looks for answers that best match this template and the full-text search. To achieve this, FlexPath provides an elegant definition of relaxation on structure and defines primitive operators to span the space of relaxations. Query answering is now based on ranking potential answers on structural and full-text search conditions. 3 Fuzzy Predicates Differently from classical binary logic semantics, fuzzy logics allow to describe reality using sets to which objects can belong to with a certain degree. Fuzzy sets, as introduced in [15], are described through a membership function Q : X → [0, 1], that assigns each object a membership degree for the considered set. The main aspect of querying with vagueness is to analyze and choose the predicates that constitute the basic blocks for creating interrogations in the presence of uncertain data. Since the environment in which we apply fuzzy logics is XML, we can have vagueness at different levels in the interrogation: in PCDATA, in attributes, in tag-names, and in the information structure. The predicates we will analyze can be applied differently in all these contexts. Definition 1 The predicate NEAR defines the closeness among different elements and, depending on the type of element being treated, it can assume different meanings: 1. When the predicate NEAR is applied to a PCDATA value, the query selects nodes in which the PCDATA has a value close to the value expressed in the query. In this case the syntax is: "[{" "selection_node" ("NOT")? "NEAR" "compare_value }]" 2. When the predicate NEAR is applied to an attribute value, the query selects nodes in which the attribute has a value close to the value expressed in the query. In this case the syntax is: "[{" "attribute_name" ("NOT")? "NEAR" "compare_value }]" ! " ! " Figure 2. An example of membership function and the concept of NEAR 3. When the predicate NEAR is applied to a tag or to an attribute name, the query selects nodes with a name similar to the name expressed in the query, with the following syntax: "[{" ("NOT")? "NEAR" "node_name }]" 4. When the predicate NEAR is inserted into the axis of a path expression, the selection tries to extract elements, attributes or text that are successors of the current node, giving a penalty which is proportional to the result’s distance from the current node. The following syntax is used: "/{" ("NOT")? "NEAR::node_name"$ Let us now analyze how the predicate NEAR can be, in practice, applied to different fields and different data types. When the considered type is numeric, the query process is quite natural. We have defined a set of vocabularies tailored for different application domains. In these vocabularies we define a membership function and an α − cut that induces the concept of closeness for each particular type and field. Consider for example the following query: \LOM[{\\duration NEAR PT1H}] In this case, in terms of LOM duration, closeness can be naturally represented through a triangular membership function and what is NEAR is induced by an α−cut, with α = 0, 6 (see Figure 2). The evaluation of textual information or strings is more complicated. In these cases we consider two different approaches: the first analyzes the linguistic similarity between two strings, while the other performs a semantic analysis between words. For the first problem we use the Levenshtein algorithm [16]. Given two strings S and T, it performs all the possible matchings among their characters, obtaining a matrix |S| ∗ |T | from which the distance between the two words is obtained. This approach can be useful when the same word appears in a tag written in a correct way and in an other misspelled, but in general it does not help us identify two words with the same meaning but different lexical roots. For analyzing the similarity from a semantics point of view it is necessary to integrate the system with a vocabulary that contains all possible synonyms. The vocabulary we use is called JWordNet2 . In our querying process, we want to find 2 JWordNet is available at http://wordnet.princeton.edu/. both misspelled versions of a word x and words with the same meaning. That is why we use both evaluation methods, and consider the maximum value obtained. The previous analysis of the use of the predicate NEAR for numeric and textual fields covers the first three meanings given in definition 1. Let us now analyze the fourth meaning, in which NEAR is applied within an axis. \LOM[{\LOM\NEAR::duration}] In this case, we search for a duration element placed near a LOM element. The degree of satisfaction of this predicate is a function of the number of steps needed to reach the duration element starting from the LOM one. Besides the predicate NEAR we also introduce two other predicates: – APPROXIMATELY allows to select —from a document— the elements with a given name that have a number of direct descendants close to the one indicated in the query. It is a derived predicate that can be substituted by a NEAR predicate applied to the result of the COUNT operator on the sons of a given node. – BESIDE, applicable only to element names, is used to find the nodes that are close to the given node but not directly connected to it. The idea is to perform a horizontal search in the XML structure to find a given element’s neighbors. 4 Deep-Similar The deep-similar function is a fuzzy calculation of the distance between two XML trees based on the concept of Tree Edit Distance, a well-known approach for calculating how much it costs to transform one tree (the source tree) into another (the destination tree). Our novel contributions are that we consider structure, content, and the intrinsic overlap that can exist between the two. In particular, semantics are considered using Wordnet’s system of hypernymys, as described in Section 3. Another novel aspect of our approach is the introduction of a new Tree Edit Operation, called Permute. This new operation is added to the classic set of operations including Insert, Delete, and Modify. It is specifically introduced to tackle situations in which two nodes are present in both the source and the destination tree, but in a different order. This is an important aspect that cannot be ignored, since order often represents a important contribution to a tree’s content. Moreover, we advocate that the costs of these operations cannot be given once and for all, but must depend on the nodes being treated. Definition 2 (Deep-similar) Given two XML trees T1 and T2 , deep-similar(T1 , T2 ) is the function that returns their degree of similarity as a value contained in the set [0,1]. This degree of similarity is given as 1 - (the cost of transforming T1 into T2 using Tree Edit Operations). Therefore, if two trees are completely different, their degree of similarity is 0; if they are exactly the same —both structure-wise and content-wise— their degree of similarity is 1. In order to transform the XML tree T1 into T2 , the deep-similar function can use the following Tree Edit Operations: 1. distribute-weight (Tree T, Weight w) 2. { x y 3. wRoot = (f /m ) 4. annotate the root node with (w_root * w) 5. for each first-level sub-tree s 6. { 7. w_subTree = b * Ls/LT + (1-b) * Is/IT 8. distribute-weight(s, (1-w_root)*w_subTree) 9. } 10. } Figure 3. The distribute-weight algorithm Definition 3 (Insert) Given an XML tree T, an XML node n, a location loc (defined through a path expression that selects a single node p in T), and an integer i, Insert(T, n, loc, i) transforms T into a new tree T 0 in which node n is added to the first level children nodes of p in position i. Definition 4 (Delete) Given an XML tree T, and a location loc (defined through a path expression that selects a single node n in T), Delete(T, loc) transforms T into a new tree T 0 in which node n is removed. Definition 5 (Modify) Given an XML tree T, a location loc (defined through a path expression that selects a single node n in T), and a new value v, Modify(t, loc, v) transforms T into a new tree T 0 in which the content of node n is replaced by v. Definition 6 (Permute) Given an XML tree T, a location loc1 (defined through a path expression that selects a single node n1 in T), and a location loc2 (defined through a path expression that selects a single node n2 in T), Permute(T, loc1 , loc2 ) transforms T into a new tree T 0 in which the locations of nodes n1 and n2 are exchangd. Before these operations can be applied to the source tree, two fundamental steps must be performed. The first consists in weighing the nodes within the two XML trees, in order to discover the importance they have within the trees. This will prove fundamental in calculating how much the edit operations cost. The second consists in matching the nodes in the source tree with those in the destination tree, and vice-versa. This step is paramount in discovering onto which nodes the different Tree Edit Operations must be performed. The costs of these Tree Edit Operations will be presented as soon as we complete a more in depth explanation of the two aforementioned steps. 4.1 Tree Node Weighing Tree node weighing consists in associating a weight value —in the set [0,1]— to each and every node in an XML tree, by taking into consideration the structural properties it possesses due to its position within that very tree. The weighing algorithm we propose is devised to maintain a fundamental property, that the sum of the weights associated to all the nodes in a tree must be equal to 1. LOM General 0.278 0.048 Source 0.033 LOM ... 0.278 0.116 Learning Resource Type 0.116 Intended End User 0.081 LOM 0.165 Educational General 0.278 0.278 Learning Resource Type 0.081 0.081 0.081 0.081 0.081 0.081 0.048 0.048 Source 0.048 0.048 Source 0.048 Value 0.048 Source Value 0.033 learner 0.033 LOM 0.033 text 0.033 LOM Value 0.033 medium 0.033 LOM ... 0.278 0.116 Intended End User 0.116 0.116 Semantic Density 0.165 Educational 0.081 0.081 0.048 0.048 Value Source 0.033 text 0.033 LOM 0.278 0.116 Difficulty 0.081 0.048 Value 0.033 student 0.081 0.048 Catalogue 0.033 LOMv1.0 0.081 0.048 Data 0.033 simple Figure 4. The example In our framework, a function called distribute-weight is provided for this very reason. As seen in Figure 3, it is a recursive function that traverses the XML tree, annotating each node with an appropriate weight. It is recursive in order to mimic the intrinsic structural recursiveness present in XML trees, which are always built of one root-node and any number of sub-trees. The algorithm is initially called passing it the entire tree to be weighed, and its total weight of 1. The first step is to decide how much of the total weight should be associated with the root-node (code line 3). Intuitively, the weight —and therefore the importance— of any given node is directly proportional to the number of first-level children nodes it possesses (variable f in code line 3) and inversely proportional to the total number of nodes in the tree (variable m in code line 3). The relationship is calibrated through the use of two constant exponents, x and y. After a great number of experiments in various contexts, the values 0.2246 and 0.7369 (respectively) have proven to give good results. Regarding the example shown in Figure 4, we are interested in finding LOMs that possess similar educational metadata. The FXPath selection \LOM{[deep-similar(Educational,\LOM[1]\educational] } asks the system to provide just that. The Educational sub-tree from the left LOM is compared for similarity to the Educational sub-tree of the right LOM. Therefore, the algorithm is initially called passing it the sub-tree starting at node Educational and a weight of 1 to be distributed. The variable wRoot is calculated to be 30.2246 /160.7369 , which is equal to 0.165. The second step consists in deciding how to distribute the remaining weight onto the sub-trees. This cannot be done by simply looking at the number of nodes each of these sub-trees possesses. The importance —and therefore the weight— of a sub-tree depends both on the amount of “data” it contains (typically in its leaf nodes) and on the amount of “structure” it contains (typically in its intermediate nodes). These considerations are taken into account in code line 7, in which the variable Ls indicates the number of leaf nodes in sub-tree s, while the variable LT indicates the number of leaf nodes in the tree T. On the other hand, the variable Is indicates the number of intermediate nodes in sub-tree s, while the variable IT indicates the number of intermediate nodes in the tree T. The balance between leaf and intermediate nodes is set through the use of the constant b. Our experiments have demonstrated that the algorithm is more effective if we balance the formula slightly in favor of the leaf nodes by setting b to 0.6. Code line 7, however, only gives the percentage of the remaining weight that should be associated to each sub-tree. The actual weight is calculated in code line 8 through an appropriate multiplication, and passed recursively to the distribute-weight together with the respective sub-tree. Reprising our running example, the remaining weight (1 − 0.165), equal to 0.835, is distributed onto the three sub-trees. The percentage of the weight that is associated to sub-tree Intended End User is 0.6 ∗ 2/6 + 0.4 ∗ 2/6, which is equal to 0.333 (see code line 7). Therefore, the algorithm is called recursively passing sub-tree Intended End User and a weight of 0.835 ∗ 0.333, which is equal to 0.278. For lack of space, the weights associated to the remaining nodes are shown directly in Figure 4. 4.2 Tree Node Matching Tree Node Matching is the last step in determining which Tree Edit Operations must be performed to achieve the transformation. Its goal is to establish a matching between the nodes in the source and the destination trees. Whenever a match for a given a node n cannot be found in the other tree, it is matched to the null value. In this step, we use an algorithm that takes into account more complex structural properties the nodes might have, and —for the first time— semantic similarity. It scores the candidate nodes by analyzing how “well” they match the reference node. The candidate node with the highest score is considered the reference node’s match. Regarding structural properties, the algorithm considers the following characteristics: number of direct children, number of nodes in the sub-tree, depth of the sub-tree, distance from the expected position, position of the node with respect to its father, positions of the nodes in the sub-tree, same value for the first child node, and same value for the last node. Each of these characteristics can give from a minimum of 1 point to a maximum of 5 points. Regarding semantic similarity, the algorithm looks at the nodes’ tag names. These are confronted using, once again, Wordnet’s system of hypernymys (see Section 3). If the two terms are exactly the same, 1 point is given, if not their degree of similarity —a value in the set [0,1]— is considered. 4.3 Costs Considerations presented in Sections 4.1 and 4.2 determine the costs the Tree Edit Operations have in our framework: – The cost of the Insert edit operation corresponds to the weight the node being inserted has in the destination tree. – The cost of the Delete edit operation corresponds to the weight the node being deleted from the source tree had. – The cost of the Modify edit operation can be seen as the deletion of a node from the source tree, and its subsequent substitution by means of an insertion of a new node containing the new value. This operation does not modify the tree’s structure, it only modifies its content. This is why it is necessary to consider the degree of similarity existing between the node’s old term and its new one. The cost is therefore k ∗ w(n) ∗ (1 − Sim(n, destinationN ode)), where w(n) is the weight the node being modified has in the source tree, the function Sim gives the degree of similarity between node n and the destination value, and k is a constant (0.9). – The Permute edit operation does not modify the tree’s structure. It only modifies the semantics that are intrinsically held in the order the nodes are placed in. Therefore, its cost is h ∗ [w(a) + w(b)], where w(a) is the weight of node a, w(b) is the weight of node b, and h is a constant value (0.36). All the above cost-formulas are directly proportional to the weights of the nodes being treated. All of the operations, except Permute, can also be used directly on sub-trees as long as their total weights are used. A simple analysis shows that no cost can be higher than the weight of the nodes being treated, which means that the total cost cannot be higher than one, and that two trees’ degree of similarity (1- the cost of transforming T1 into T2 ) must be a value in the set [0,1]. To conclude the running example (see Figure 4), the matchings, the needed tree edit operations, and their costs are presented in following table. The cost of the Modify operation in the third row is calculated as 0.9 ∗ 0.033 ∗ (1 − 0.769), while the cost of the Permute operation in row four is calculated as 0.36 ∗ [0.278 + 0.278]. Source Destination Operation Cost null Difficulty Insert 0.278 Semantic Density null Delete 0.278 learner student Modify 0.006 Intended End User / Learning Resource Type NA Permute 0.2 The degree of similarity between the two Educational sub-trees can therefore be calculated as 1 − (0.278 + 0.278 + 0.006 + 0.2), which evaluates to 0.238. This means that the two LOMs have quite different Educational sub-trees. If we had applied the same algorithm to the entire LOM trees, and had they differed only in their Educational sub-trees, a much higher degree of similarity would have been obtained. 5 Complex queries evaluation Fuzzy predicates and fuzzy tree matching can be composed within a single query by combination through logical operators, that, in this context, are redefined to take into account the different semantics for predicates and sets. The following meanings for logical operators are used in this work: – NOT The negation preserves the meaning of complement, hence considered a fuzzy set A, with membership function Q, for each element a, if a ∈ A implies Q(a) then a∈ / A implies 1 − Q(a). – AND The conjunction of conditions can be considered in the following manner. Given n fuzzy sets A1 , A2 , . . . , An with membership functions Q1 , Q2 , . . . , Qn , and given the elements a1 , a2 , . . . , an , if I(a1 , a2 , . . . , an ) is the intersection of the conditions (a1 ∈ A1 ), (a2 ∈ A2 ), . . . , (an ∈ An ) then (a1 ∈ A1 )AN D(a2 A2 )AN D . . . AN D(an ∈ An ) implies I(a1 , a2 , . . . , an ) is Q(a1 , a2 , , an ) where Q(a1 , a2 , . . . , an ) = min[Q1 (a1 ), Q2 (a2 ), . . . , Qn (an )]. The chosen approach preserves the meaning of conjunction in a crisp sense. In fact, once a threshold for the condition to hold has been fixed, it is necessary that all the conditions in the conjunction respect it. As in classical logic, for an AND condition to be true, all the conditions composing it have to be true. – OR The interpretation chosen for the union of conditions is symmetric with respect to conjunction. Given n fuzzy sets A1 , A2 , . . . , An with membership functions Q1 , Q2 , . . . , Qn , and given the elements a1 , a2 , . . . , an , if U (a1 , a2 , . . . , an ) is the union of the conditions (a1 ∈ A1 ), (a2 ∈ A2 ), . . . , (an ∈ An ) then (a1 ∈ A1 )OR(a2 A2 )OR . . . OR(an ∈ An ) implies U (a1 , a2 , . . . , an ) is Q(a1 , a2 , , an ) where Q(a1 , a2 , . . . , an ) = max[Q1 (a1 ), Q2 (a2 ), . . . , Qn (an )] Consider now a query combining the described fuzzy features: \LOM{[deep-similar(educational,\LOM[1]\educational] AND \\duration NEAR PT1H } The evaluation of such a query is the result of a four step process: 1. The query is transformed into a crisp one, capable of extracting data guaranteed to be a superset of the desired result. In the example we obtain \LOM which extracts all the LOMs, which are clearly a superset of the desired result. 2. Every fuzzy predicate pi is evaluated w.r.t. each of the extracted data’s items, and a degree of satisfaction is assigned through use of a variable vi . In this example, we evaluate the deep-similar between the Educational sub-items, and to what degree the LOM’s duration is NEAR to one hour (thanks to a devoted dictionary). For the first item of the result-set (the right-hand side LOM in Figure 4), the former evaluates to 0.238, and the latter to 0.67 (45 minutes against one hour). 3. An overall degree of satisfaction is obtained for each item in the result. This is done considering all the different predicates of the path expression in conjunction, in accordance to the crisp XPath semantics. In our example, we take the minor of the two degrees of satisfaction (0.238). 4. The items in the result are ordered according to the obtained degree of satisfaction. The idea of considering all the different fuzzy predicates in conjunction can be too rigid in some contexts. An alternative approach can be to allow advanced users to explicitly bound each degree to a variable and to define a function to calculate the final degree of satisfaction. We define a WITH RANKING clause in order to combine the bound values in the final ranking. The following example shows a FXPath expression with two fuzzy conditions: \LOM{[deep-similar(educational,\LOM[1]\educational] | v1 AND \\duration NEAR PT1H | v2 WITH RANKING v1 * v2} The ranking of the result set is obtained as the product of the values bound to v1 and v2. More complex WITH RANKING clauses could involve: – linear combination of the values bound to the ranking variables: WITH RANKING 0.4*v1 + 0.6*v2 – normalized weighted average 6 Conclusion We have presented a framework for querying semi-structured XML data based on key aspects of fuzzy logics. Its main advantage is the minimization of the silent queries that can be caused by (1) data not following an appropriate schema faithfully, (2) the user providing a blind query in which he does not know the schema or exactly what he is looking for, and (3) data being presented with slightly diverse schemas. This is achieved through the use of fuzzy predicates, and fuzzy tree matching. Both rely on domain semantics for achieving their goal, and we currently propose WordNet for calculating semantic similarity. However, more precise domain specific ontologies could be used to obtain better results. Future work will, in fact, concentrate on further validating our approach using domain specific ontologies. References 1. Braga, D., Campi, A., Damiani, E., Pasi, G., Lanzi, P.L.: FXPath: Flexible querying of xml documents. In: Proc. of EuroFuse. (2002) 2. Bosc, P., Lietard, L., Pivert, O.: Soft querying, a new feature for database management systems. In: DEXA. (1994) 631–640 3. Bosc, P., Lietard, L., Pivert, O.: Quantified statements in a flexible relational query language. In: SAC ’95: Proceedings of the 1995 ACM symposium on Applied computing, New York, NY, USA, ACM Press (1995) 488–492 4. Kacprzyk, J., Ziolkowski, A.: Database queries with fuzzy linguistic quantifiers. IEEE Trans. Syst. Man Cybern. 16(3) (1986) 474–479 5. Bosc, P., Pivert, O.: Fuzzy querying in conventional databases. (1992) 645–671 6. Galindo, J., Medina, J., Pons, O., Cubero, J.: A server for fuzzy sql queries. In: Proceedings of the Flexible Query Answering Systems. (1998) 7. Bosc, P., Pivert, O.: On representation-based querying of databases containing ill-known values. In: ISMIS ’97: Proceedings of the 10th International Symposium on Foundations of Intelligent Systems, London, UK, Springer-Verlag (1997) 477–486 8. Ma, Z.M., Zhang, W.J., Ma, W.Y.: Extending object-oriented databases for fuzzy information modeling. Inf. Syst. 29(5) (2004) 421–435 9. Kacprzyk, J., Zadrozny, S.: Internet as a challenge to fuzzy querying. (2003) 74–95 10. Dubois, D., Prade, H., Sèdes, F.: Fuzzy logic techniques in multimedia database querying: A preliminary investigation of the potentials. IEEE Transactions on Knowledge and Data Engineering 13(3) (2001) 383–392 11. Mouchaweh, M.S.: Diagnosis in real time for evolutionary processes in using pattern recognition and possibility theory (invited paper). International Journal of Computational Cognition 2(1) (2004) 79–112 ISSN 1542-5908. 12. Loiseau, Y., Prade, H., Boughanem, M.: Qualitative pattern matching with linguistic terms. AI Commun. 17(1) (2004) 25–34 13. Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: applications to synonym extraction and Web searching. SIAM Review 46(4) (2004) 647–666 14. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: Flexpath: flexible structure and full-text querying for xml. In: SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, New York, NY, USA, ACM Press (2004) 83–94 15. Zadeh, L.: Fuzzy sets. Information and Control. 8(4) (1965) 338–353 16. Levenshtein: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10 (1966) 707–710
© Copyright 2024 Paperzz