A Fuzzy Extension for the XPath Query Language

A Fuzzy Extension for the XPath Query Language
Alessandro Campi, Sam Guinea, and Paola Spoletini
Dipartimento di Elettronica e Informazione - Politecnico di Milano
Piazza L. da Vinci 32, I-20133 Milano, Italy
campi|guinea|[email protected]
Abstract XML has become a widespread format for data exchange over the Internet. The current state of the art in querying XML data is represented by XPath
and XQuery, both of which define binary predicates. In this paper, we advocate
that binary selection can at times be restrictive due to very nature of XML, and to
the uses that are made of it. We therefore suggest a querying framework, called
FXPath, based on fuzzy logics. In particular, we propose the use of fuzzy predicates for the definition of more “vague” and softer queries. We also introduce a
function called “deep-similar”, which aims at substituting XPath’s typical “deepequal” function. Its goal is to provide a degree of similarity between two XML
trees, assessing whether they are similar both structure-wise and content-wise.
The approach is exemplified in the field of e-learning metadata.
1
Introduction
In the last few years XML has become one of the most important data formats for information exchange over the Internet. Ever since the advent of XML as a widespread
data format, query languages have become of paramount importance. The principal
proposals for querying XML documents have been XPath and XQuery. The first is a
language that allows for the selection of XML nodes through the definition of ”tree
traversal” expressions. Although not a fully-fledged query language, it remains sufficiently expressive, and has become widely adopted within other XML query languages
for expressing selection conditions. Its main advantage is the presence of a rich set of
available built-in functions. XQuery, on the other hand, is W3C’s current candidate for a
fully-fledged query language for XML documents. It is capable of working on multiple
XML documents, of joining results, and of transforming and creating XML structures.
It builds upon XPath, which is used as the selection language, to obtain its goals. Both
XPath and XQuery divide data into those which fully satisfy the selection conditions,
and those which do not. However, binary conditions can be —in some scenarios— a
limited approach to effective querying of XML data.
A few considerations can be made to justify this claim. First of all, even when XML
schemas do exist, data producers do not always follow them precisely. Second, users
often end up defining blind queries, either because they do not know the XML schema
in detail, or because they do not know exactly what they are looking for. For example,
they might be querying for some vague interest. Third, the same data can sometimes be
described using different schemas. As it is often the case with semi-structured data, it
is difficult to distinguish between the data itself and the structure containing it. There is
LOM
General
Description
Language
Rights
Title
Intended End
User
Educational
Learning
Resource Type
LifeCycle
Semantic
Density
Technical
Requirements
Format
Figure 1. A simplified representation of the structure of the LOM standard
an intrinsic overlap between data and structure and it is very common for “physically”
near data to also be semantically related. It is easy to see how basing a binary query on
such unsafe grounds can often lead to unnecessary silence.
For example, all these considerations are true in the field of e-learning metadata, which
will constitute our explanatory context throughout this paper. In this field of research it
is common to describe learning objects (LOs) using the LOM standard1 and its XML
representation. A simplified representation of the structure of these documents is shown
in Figure 1. Within this context, we tackle the case in which a user is searching for a
certain learning object across multiple and distributed repositories in which the same
content may be stored with slightly different metadata.
In this paper, we propose a framework for querying XML data that goes beyond binary
selection, and that allows the user to define more ’vague’ and ’softer’ selection criteria
in order to obtain more results. To do so, we use concepts coming from the area of
fuzzy logics. The main idea is that the selection should produce fuzzy sets, which differ
from binary sets in the sense that it is possible to belong to them to different degrees.
This is achieved through membership functions that consider the semantics behind the
selection predicates (through domain specific ontologies or by calling WordNet), such
as Q : R[0, 1], where Q(x) indicates the degree to which the data x satisfies the concept
Q. Concretely, we propose to extend the XPath query language to accommodate fuzzy
selection. The choice has fallen on XPath since it presents a simpler starting point with
respect to XQuery. Due to the considerations already stated and analised in [1], we
define some extensions that constitute FXPath and fall into the following categories:
– Fuzzy Predicates: The user can express vague queries by exploiting fuzzy predicates, whose semantics can be based on structural relaxations or clarified through
the use of domain specific ontologies.
– Fuzzy Tree Matching: Standard XPath provides a deep-equal function that can be
used to assess whether two sequences contain items that are atomic values and are
equal, or that are nodes of the same kind, with the same name, whose children are
deep-equal. This can be restrictive, so we propose an extension named deep-similar
to assess whether the sequences are similar both content-wise and structure-wise.
1
This standard specifies the syntax and semantics of Learning Object Metadata, defined as
the attributes required to describe a Learning Object. The Learning Object Metadata standards
focuses on the minimal set of attributes needed to allow these Learning Objects to be managed,
located, and evaluated. Relevant attributes of Learning Objects include object type, author,
owner, terms of distribution, format and pedagogical attributes such as teaching or interaction
style, grade level, mastery level, and prerequisites.
When query results are returned, they are accompanied by a ranking indicating ”how
much” each data item satisfies the selection condition. For example, while searching for
LOs published in a year near 2000, we might retrieve LOs published in 2000, in 2001,
2002, etc. Returned items are wrapped into annotations, so as to recall XML tagging:
<!--RankingDirective RankingValue="1.0" -->
<LO year="2000">
<title>t1</title> </LO>
<!-- /RankingDirective -->
<!-- RankingDirective RankingValue=".8"-->
<LO year="2001">
<title>t2</title> </LO>
<!-- /RankingDirective -->
<!-- RankingDirective RankingValue=".65" -->
<LO year="2002">
<title>t3</title> </LO>
<!-- /RankingDirective -->
In this example it is possible to notice the presence of a ranking directive containing the
ranking value —in the set [0, 1]— of each retrieved item. The closer it is to 1, the better
the item satisfies the condition (i.e. to be published in a year near 2000).
The rest of this paper is structured as follows. Section 2 presents relevant and related
work. Section 3 presents our approach for fuzzy predicates. Section 4 presents the concept of fuzzy tree matching and our implementation of the “deep-similar” function.
Section 5 brings the two approaches together, and Section 6 concludes this paper.
2
Related Work
Fuzzy sets have been shown to be a convenient way to model flexible queries in [2].
Many attempts to extend SQL with fuzzy capabilities were undertaken in recent years.
[3] describes SQLf, a language that extends SQL by introducing fuzzy predicates that
are processed on crisp information. Fuzzy quantifiers allowing to define aggregated
concepts have been proposed in [4] and [5].
The FSQL system [6], developed upon Oracle, represents imprecise information as possibility distributions stored in standard tables. Users write queries using FSQL, which
are then translated into ordinary SQL queries that call functions provided by FSQL to
compute the degrees of matching. Different approaches have been defined to compare
fuzzy values. [7] proposes measures to evaluate (by considering similarity relations)
how close two fuzzy representations are.
[8], based on possibility distribution and the semantic measure of fuzzy data, introduces
an extended object-oriented database model to handle imperfect, as well as complex objects, in the real world. Some major notions in object-oriented databases such as objects,
classes, objects-classes relationships, subclass/superclass, and multiple inheritances are
extended in the fuzzy information environment.
In [9] the use of fuzzy querying in particular in the Internet is shown. This paper is an
example of how fuzzy querying should be used in widely distributed data sources: it is
shown how elements of fuzzy logic and linguistic quantifiers can be employed to attain
human consistent and useful solutions.
[10] applies fuzzy set methods to multimedia databases which have a complex structure,
and from which documents have to be retrieved and selected depending not only on their
contents, but also on the idea the user has of their appearance, through queries specified
in terms of user criteria.
A stream of research on fuzzy pattern matching (FPM) started in the eighties, and was
successfully used in flexible querying of fuzzy databases and in classification. Given
a pattern representing a request expressed in terms of fuzzy sets, and a database containing imprecise or fuzzy attribute values, the FPM returns two matching degrees. An
example of an advanced techniques based on FPM can be found in [11]. [12] proposes
a counterpart of FPM, called ”Qualitative Pattern Matching” (QPM), for estimating
levels of matching between a request and data expressed with words. Given a request,
QPM rank-orders the items which possibly, or which certainly match the requirements,
according to the preferences of the user.
The problem of fuzzy similarity between graphs is studied in [13].
[14] presents FlexPath, an attempt to integrate database-style query languages such as
XPath and XQuery and full-text search on textual content. FlexPath considers queries
on structure as a template, and looks for answers that best match this template and the
full-text search. To achieve this, FlexPath provides an elegant definition of relaxation
on structure and defines primitive operators to span the space of relaxations. Query
answering is now based on ranking potential answers on structural and full-text search
conditions.
3
Fuzzy Predicates
Differently from classical binary logic semantics, fuzzy logics allow to describe reality
using sets to which objects can belong to with a certain degree. Fuzzy sets, as introduced
in [15], are described through a membership function Q : X → [0, 1], that assigns each
object a membership degree for the considered set.
The main aspect of querying with vagueness is to analyze and choose the predicates that
constitute the basic blocks for creating interrogations in the presence of uncertain data.
Since the environment in which we apply fuzzy logics is XML, we can have vagueness
at different levels in the interrogation: in PCDATA, in attributes, in tag-names, and in
the information structure. The predicates we will analyze can be applied differently in
all these contexts.
Definition 1 The predicate NEAR defines the closeness among different elements and,
depending on the type of element being treated, it can assume different meanings:
1. When the predicate NEAR is applied to a PCDATA value, the query selects nodes
in which the PCDATA has a value close to the value expressed in the query. In this
case the syntax is:
"[{" "selection_node" ("NOT")? "NEAR" "compare_value }]"
2. When the predicate NEAR is applied to an attribute value, the query selects nodes
in which the attribute has a value close to the value expressed in the query. In this
case the syntax is:
"[{" "attribute_name" ("NOT")? "NEAR" "compare_value }]"
!
"
!
"
Figure 2. An example of membership function and the concept of NEAR
3. When the predicate NEAR is applied to a tag or to an attribute name, the query
selects nodes with a name similar to the name expressed in the query, with the
following syntax:
"[{" ("NOT")? "NEAR" "node_name }]"
4. When the predicate NEAR is inserted into the axis of a path expression, the selection
tries to extract elements, attributes or text that are successors of the current node,
giving a penalty which is proportional to the result’s distance from the current node.
The following syntax is used:
"/{" ("NOT")? "NEAR::node_name"$
Let us now analyze how the predicate NEAR can be, in practice, applied to different
fields and different data types. When the considered type is numeric, the query process
is quite natural. We have defined a set of vocabularies tailored for different application
domains. In these vocabularies we define a membership function and an α − cut that
induces the concept of closeness for each particular type and field. Consider for example
the following query:
\LOM[{\\duration NEAR PT1H}]
In this case, in terms of LOM duration, closeness can be naturally represented through
a triangular membership function and what is NEAR is induced by an α−cut, with
α = 0, 6 (see Figure 2).
The evaluation of textual information or strings is more complicated. In these cases we
consider two different approaches: the first analyzes the linguistic similarity between
two strings, while the other performs a semantic analysis between words. For the first
problem we use the Levenshtein algorithm [16]. Given two strings S and T, it performs
all the possible matchings among their characters, obtaining a matrix |S| ∗ |T | from
which the distance between the two words is obtained. This approach can be useful
when the same word appears in a tag written in a correct way and in an other misspelled, but in general it does not help us identify two words with the same meaning but
different lexical roots. For analyzing the similarity from a semantics point of view it is
necessary to integrate the system with a vocabulary that contains all possible synonyms.
The vocabulary we use is called JWordNet2 . In our querying process, we want to find
2
JWordNet is available at http://wordnet.princeton.edu/.
both misspelled versions of a word x and words with the same meaning. That is why
we use both evaluation methods, and consider the maximum value obtained.
The previous analysis of the use of the predicate NEAR for numeric and textual fields
covers the first three meanings given in definition 1. Let us now analyze the fourth
meaning, in which NEAR is applied within an axis.
\LOM[{\LOM\NEAR::duration}]
In this case, we search for a duration element placed near a LOM element. The degree
of satisfaction of this predicate is a function of the number of steps needed to reach the
duration element starting from the LOM one.
Besides the predicate NEAR we also introduce two other predicates:
– APPROXIMATELY allows to select —from a document— the elements with a
given name that have a number of direct descendants close to the one indicated in
the query. It is a derived predicate that can be substituted by a NEAR predicate
applied to the result of the COUNT operator on the sons of a given node.
– BESIDE, applicable only to element names, is used to find the nodes that are close
to the given node but not directly connected to it. The idea is to perform a horizontal
search in the XML structure to find a given element’s neighbors.
4
Deep-Similar
The deep-similar function is a fuzzy calculation of the distance between two XML trees
based on the concept of Tree Edit Distance, a well-known approach for calculating how
much it costs to transform one tree (the source tree) into another (the destination tree).
Our novel contributions are that we consider structure, content, and the intrinsic overlap
that can exist between the two. In particular, semantics are considered using Wordnet’s
system of hypernymys, as described in Section 3.
Another novel aspect of our approach is the introduction of a new Tree Edit Operation,
called Permute. This new operation is added to the classic set of operations including
Insert, Delete, and Modify. It is specifically introduced to tackle situations in which
two nodes are present in both the source and the destination tree, but in a different
order. This is an important aspect that cannot be ignored, since order often represents a
important contribution to a tree’s content. Moreover, we advocate that the costs of these
operations cannot be given once and for all, but must depend on the nodes being treated.
Definition 2 (Deep-similar) Given two XML trees T1 and T2 , deep-similar(T1 , T2 ) is
the function that returns their degree of similarity as a value contained in the set [0,1].
This degree of similarity is given as 1 - (the cost of transforming T1 into T2 using
Tree Edit Operations). Therefore, if two trees are completely different, their degree of
similarity is 0; if they are exactly the same —both structure-wise and content-wise—
their degree of similarity is 1.
In order to transform the XML tree T1 into T2 , the deep-similar function can use the
following Tree Edit Operations:
1. distribute-weight (Tree T, Weight w)
2. {
x y
3.
wRoot = (f /m )
4.
annotate the root node with (w_root * w)
5.
for each first-level sub-tree s
6.
{
7.
w_subTree = b * Ls/LT + (1-b) * Is/IT
8.
distribute-weight(s, (1-w_root)*w_subTree)
9.
}
10. }
Figure 3. The distribute-weight algorithm
Definition 3 (Insert) Given an XML tree T, an XML node n, a location loc (defined
through a path expression that selects a single node p in T), and an integer i, Insert(T,
n, loc, i) transforms T into a new tree T 0 in which node n is added to the first level
children nodes of p in position i.
Definition 4 (Delete) Given an XML tree T, and a location loc (defined through a path
expression that selects a single node n in T), Delete(T, loc) transforms T into a new tree
T 0 in which node n is removed.
Definition 5 (Modify) Given an XML tree T, a location loc (defined through a path
expression that selects a single node n in T), and a new value v, Modify(t, loc, v) transforms T into a new tree T 0 in which the content of node n is replaced by v.
Definition 6 (Permute) Given an XML tree T, a location loc1 (defined through a path
expression that selects a single node n1 in T), and a location loc2 (defined through a
path expression that selects a single node n2 in T), Permute(T, loc1 , loc2 ) transforms T
into a new tree T 0 in which the locations of nodes n1 and n2 are exchangd.
Before these operations can be applied to the source tree, two fundamental steps must be
performed. The first consists in weighing the nodes within the two XML trees, in order
to discover the importance they have within the trees. This will prove fundamental in
calculating how much the edit operations cost. The second consists in matching the
nodes in the source tree with those in the destination tree, and vice-versa. This step is
paramount in discovering onto which nodes the different Tree Edit Operations must be
performed. The costs of these Tree Edit Operations will be presented as soon as we
complete a more in depth explanation of the two aforementioned steps.
4.1
Tree Node Weighing
Tree node weighing consists in associating a weight value —in the set [0,1]— to each
and every node in an XML tree, by taking into consideration the structural properties it
possesses due to its position within that very tree. The weighing algorithm we propose
is devised to maintain a fundamental property, that the sum of the weights associated to
all the nodes in a tree must be equal to 1.
LOM
General
0.278
0.048
Source
0.033
LOM
...
0.278
0.116
Learning
Resource Type
0.116
Intended End
User
0.081
LOM
0.165
Educational
General
0.278
0.278
Learning
Resource Type
0.081
0.081
0.081
0.081
0.081
0.081
0.048
0.048
Source
0.048
0.048
Source
0.048
Value
0.048
Source
Value
0.033
learner
0.033
LOM
0.033
text
0.033
LOM
Value
0.033
medium
0.033
LOM
...
0.278
0.116
Intended End
User
0.116
0.116
Semantic
Density
0.165
Educational
0.081
0.081
0.048
0.048
Value
Source
0.033
text
0.033
LOM
0.278
0.116
Difficulty
0.081
0.048
Value
0.033
student
0.081
0.048
Catalogue
0.033
LOMv1.0
0.081
0.048
Data
0.033
simple
Figure 4. The example
In our framework, a function called distribute-weight is provided for this very reason.
As seen in Figure 3, it is a recursive function that traverses the XML tree, annotating
each node with an appropriate weight. It is recursive in order to mimic the intrinsic
structural recursiveness present in XML trees, which are always built of one root-node
and any number of sub-trees.
The algorithm is initially called passing it the entire tree to be weighed, and its total
weight of 1. The first step is to decide how much of the total weight should be associated with the root-node (code line 3). Intuitively, the weight —and therefore the
importance— of any given node is directly proportional to the number of first-level
children nodes it possesses (variable f in code line 3) and inversely proportional to
the total number of nodes in the tree (variable m in code line 3). The relationship is
calibrated through the use of two constant exponents, x and y. After a great number
of experiments in various contexts, the values 0.2246 and 0.7369 (respectively) have
proven to give good results.
Regarding the example shown in Figure 4, we are interested in finding LOMs that possess similar educational metadata. The FXPath selection
\LOM{[deep-similar(Educational,\LOM[1]\educational] }
asks the system to provide just that. The Educational sub-tree from the left LOM is
compared for similarity to the Educational sub-tree of the right LOM. Therefore, the
algorithm is initially called passing it the sub-tree starting at node Educational and a
weight of 1 to be distributed. The variable wRoot is calculated to be 30.2246 /160.7369 ,
which is equal to 0.165.
The second step consists in deciding how to distribute the remaining weight onto the
sub-trees. This cannot be done by simply looking at the number of nodes each of these
sub-trees possesses. The importance —and therefore the weight— of a sub-tree depends
both on the amount of “data” it contains (typically in its leaf nodes) and on the amount
of “structure” it contains (typically in its intermediate nodes). These considerations are
taken into account in code line 7, in which the variable Ls indicates the number of leaf
nodes in sub-tree s, while the variable LT indicates the number of leaf nodes in the
tree T. On the other hand, the variable Is indicates the number of intermediate nodes in
sub-tree s, while the variable IT indicates the number of intermediate nodes in the tree
T. The balance between leaf and intermediate nodes is set through the use of the constant b. Our experiments have demonstrated that the algorithm is more effective if we
balance the formula slightly in favor of the leaf nodes by setting b to 0.6. Code line 7,
however, only gives the percentage of the remaining weight that should be associated to
each sub-tree. The actual weight is calculated in code line 8 through an appropriate multiplication, and passed recursively to the distribute-weight together with the respective
sub-tree.
Reprising our running example, the remaining weight (1 − 0.165), equal to 0.835, is
distributed onto the three sub-trees. The percentage of the weight that is associated to
sub-tree Intended End User is 0.6 ∗ 2/6 + 0.4 ∗ 2/6, which is equal to 0.333 (see code
line 7). Therefore, the algorithm is called recursively passing sub-tree Intended End
User and a weight of 0.835 ∗ 0.333, which is equal to 0.278. For lack of space, the
weights associated to the remaining nodes are shown directly in Figure 4.
4.2 Tree Node Matching
Tree Node Matching is the last step in determining which Tree Edit Operations must be
performed to achieve the transformation. Its goal is to establish a matching between the
nodes in the source and the destination trees. Whenever a match for a given a node n
cannot be found in the other tree, it is matched to the null value.
In this step, we use an algorithm that takes into account more complex structural properties the nodes might have, and —for the first time— semantic similarity. It scores the
candidate nodes by analyzing how “well” they match the reference node. The candidate
node with the highest score is considered the reference node’s match.
Regarding structural properties, the algorithm considers the following characteristics:
number of direct children, number of nodes in the sub-tree, depth of the sub-tree, distance from the expected position, position of the node with respect to its father, positions
of the nodes in the sub-tree, same value for the first child node, and same value for the
last node. Each of these characteristics can give from a minimum of 1 point to a maximum of 5 points.
Regarding semantic similarity, the algorithm looks at the nodes’ tag names. These are
confronted using, once again, Wordnet’s system of hypernymys (see Section 3). If the
two terms are exactly the same, 1 point is given, if not their degree of similarity —a
value in the set [0,1]— is considered.
4.3 Costs
Considerations presented in Sections 4.1 and 4.2 determine the costs the Tree Edit Operations have in our framework:
– The cost of the Insert edit operation corresponds to the weight the node being inserted has in the destination tree.
– The cost of the Delete edit operation corresponds to the weight the node being
deleted from the source tree had.
– The cost of the Modify edit operation can be seen as the deletion of a node from the
source tree, and its subsequent substitution by means of an insertion of a new node
containing the new value. This operation does not modify the tree’s structure, it only
modifies its content. This is why it is necessary to consider the degree of similarity
existing between the node’s old term and its new one. The cost is therefore k ∗
w(n) ∗ (1 − Sim(n, destinationN ode)), where w(n) is the weight the node being
modified has in the source tree, the function Sim gives the degree of similarity
between node n and the destination value, and k is a constant (0.9).
– The Permute edit operation does not modify the tree’s structure. It only modifies the
semantics that are intrinsically held in the order the nodes are placed in. Therefore,
its cost is h ∗ [w(a) + w(b)], where w(a) is the weight of node a, w(b) is the weight
of node b, and h is a constant value (0.36).
All the above cost-formulas are directly proportional to the weights of the nodes being
treated. All of the operations, except Permute, can also be used directly on sub-trees as
long as their total weights are used. A simple analysis shows that no cost can be higher
than the weight of the nodes being treated, which means that the total cost cannot be
higher than one, and that two trees’ degree of similarity (1- the cost of transforming T1
into T2 ) must be a value in the set [0,1].
To conclude the running example (see Figure 4), the matchings, the needed tree edit
operations, and their costs are presented in following table. The cost of the Modify
operation in the third row is calculated as 0.9 ∗ 0.033 ∗ (1 − 0.769), while the cost of
the Permute operation in row four is calculated as 0.36 ∗ [0.278 + 0.278].
Source
Destination Operation Cost
null
Difficulty
Insert 0.278
Semantic Density
null
Delete 0.278
learner
student
Modify 0.006
Intended End User / Learning Resource Type
NA
Permute 0.2
The degree of similarity between the two Educational sub-trees can therefore be calculated as 1 − (0.278 + 0.278 + 0.006 + 0.2), which evaluates to 0.238. This means that
the two LOMs have quite different Educational sub-trees.
If we had applied the same algorithm to the entire LOM trees, and had they differed
only in their Educational sub-trees, a much higher degree of similarity would have been
obtained.
5
Complex queries evaluation
Fuzzy predicates and fuzzy tree matching can be composed within a single query by
combination through logical operators, that, in this context, are redefined to take into
account the different semantics for predicates and sets. The following meanings for
logical operators are used in this work:
– NOT The negation preserves the meaning of complement, hence considered a fuzzy
set A, with membership function Q, for each element a, if a ∈ A implies Q(a) then
a∈
/ A implies 1 − Q(a).
– AND The conjunction of conditions can be considered in the following manner.
Given n fuzzy sets A1 , A2 , . . . , An with membership functions Q1 , Q2 , . . . , Qn ,
and given the elements a1 , a2 , . . . , an , if I(a1 , a2 , . . . , an ) is the intersection of the
conditions (a1 ∈ A1 ), (a2 ∈ A2 ), . . . , (an ∈ An ) then
(a1 ∈ A1 )AN D(a2 A2 )AN D . . . AN D(an ∈ An ) implies I(a1 , a2 , . . . , an ) is
Q(a1 , a2 , , an ) where Q(a1 , a2 , . . . , an ) = min[Q1 (a1 ), Q2 (a2 ), . . . , Qn (an )].
The chosen approach preserves the meaning of conjunction in a crisp sense. In
fact, once a threshold for the condition to hold has been fixed, it is necessary that
all the conditions in the conjunction respect it. As in classical logic, for an AND
condition to be true, all the conditions composing it have to be true.
– OR The interpretation chosen for the union of conditions is symmetric with respect to conjunction. Given n fuzzy sets A1 , A2 , . . . , An with membership functions Q1 , Q2 , . . . , Qn , and given the elements a1 , a2 , . . . , an , if U (a1 , a2 , . . . , an )
is the union of the conditions (a1 ∈ A1 ), (a2 ∈ A2 ), . . . , (an ∈ An ) then
(a1 ∈ A1 )OR(a2 A2 )OR . . . OR(an ∈ An ) implies U (a1 , a2 , . . . , an ) is
Q(a1 , a2 , , an ) where Q(a1 , a2 , . . . , an ) = max[Q1 (a1 ), Q2 (a2 ), . . . , Qn (an )]
Consider now a query combining the described fuzzy features:
\LOM{[deep-similar(educational,\LOM[1]\educational] AND
\\duration NEAR PT1H }
The evaluation of such a query is the result of a four step process:
1. The query is transformed into a crisp one, capable of extracting data guaranteed to
be a superset of the desired result. In the example we obtain \LOM which extracts
all the LOMs, which are clearly a superset of the desired result.
2. Every fuzzy predicate pi is evaluated w.r.t. each of the extracted data’s items, and a
degree of satisfaction is assigned through use of a variable vi .
In this example, we evaluate the deep-similar between the Educational sub-items,
and to what degree the LOM’s duration is NEAR to one hour (thanks to a devoted
dictionary). For the first item of the result-set (the right-hand side LOM in Figure
4), the former evaluates to 0.238, and the latter to 0.67 (45 minutes against one
hour).
3. An overall degree of satisfaction is obtained for each item in the result. This is
done considering all the different predicates of the path expression in conjunction,
in accordance to the crisp XPath semantics. In our example, we take the minor of
the two degrees of satisfaction (0.238).
4. The items in the result are ordered according to the obtained degree of satisfaction.
The idea of considering all the different fuzzy predicates in conjunction can be too rigid
in some contexts. An alternative approach can be to allow advanced users to explicitly
bound each degree to a variable and to define a function to calculate the final degree of
satisfaction. We define a WITH RANKING clause in order to combine the bound values
in the final ranking. The following example shows a FXPath expression with two fuzzy
conditions:
\LOM{[deep-similar(educational,\LOM[1]\educational] | v1 AND
\\duration NEAR PT1H | v2 WITH RANKING v1 * v2}
The ranking of the result set is obtained as the product of the values bound to v1 and
v2. More complex WITH RANKING clauses could involve:
– linear combination of the values bound to the ranking variables:
WITH RANKING 0.4*v1 + 0.6*v2
– normalized weighted average
6
Conclusion
We have presented a framework for querying semi-structured XML data based on key
aspects of fuzzy logics. Its main advantage is the minimization of the silent queries that
can be caused by (1) data not following an appropriate schema faithfully, (2) the user
providing a blind query in which he does not know the schema or exactly what he is
looking for, and (3) data being presented with slightly diverse schemas. This is achieved
through the use of fuzzy predicates, and fuzzy tree matching. Both rely on domain
semantics for achieving their goal, and we currently propose WordNet for calculating
semantic similarity. However, more precise domain specific ontologies could be used
to obtain better results. Future work will, in fact, concentrate on further validating our
approach using domain specific ontologies.
References
1. Braga, D., Campi, A., Damiani, E., Pasi, G., Lanzi, P.L.: FXPath: Flexible querying of xml
documents. In: Proc. of EuroFuse. (2002)
2. Bosc, P., Lietard, L., Pivert, O.: Soft querying, a new feature for database management
systems. In: DEXA. (1994) 631–640
3. Bosc, P., Lietard, L., Pivert, O.: Quantified statements in a flexible relational query language.
In: SAC ’95: Proceedings of the 1995 ACM symposium on Applied computing, New York,
NY, USA, ACM Press (1995) 488–492
4. Kacprzyk, J., Ziolkowski, A.: Database queries with fuzzy linguistic quantifiers. IEEE Trans.
Syst. Man Cybern. 16(3) (1986) 474–479
5. Bosc, P., Pivert, O.: Fuzzy querying in conventional databases. (1992) 645–671
6. Galindo, J., Medina, J., Pons, O., Cubero, J.: A server for fuzzy sql queries. In: Proceedings
of the Flexible Query Answering Systems. (1998)
7. Bosc, P., Pivert, O.: On representation-based querying of databases containing ill-known
values. In: ISMIS ’97: Proceedings of the 10th International Symposium on Foundations of
Intelligent Systems, London, UK, Springer-Verlag (1997) 477–486
8. Ma, Z.M., Zhang, W.J., Ma, W.Y.: Extending object-oriented databases for fuzzy information
modeling. Inf. Syst. 29(5) (2004) 421–435
9. Kacprzyk, J., Zadrozny, S.: Internet as a challenge to fuzzy querying. (2003) 74–95
10. Dubois, D., Prade, H., S&#232;des, F.: Fuzzy logic techniques in multimedia database querying: A preliminary investigation of the potentials. IEEE Transactions on Knowledge and Data
Engineering 13(3) (2001) 383–392
11. Mouchaweh, M.S.: Diagnosis in real time for evolutionary processes in using pattern recognition and possibility theory (invited paper). International Journal of Computational Cognition 2(1) (2004) 79–112 ISSN 1542-5908.
12. Loiseau, Y., Prade, H., Boughanem, M.: Qualitative pattern matching with linguistic terms.
AI Commun. 17(1) (2004) 25–34
13. Blondel, V.D., Gajardo, A., Heymans, M., Senellart, P., Dooren, P.V.: A measure of similarity between graph vertices: applications to synonym extraction and Web searching. SIAM
Review 46(4) (2004) 647–666
14. Amer-Yahia, S., Lakshmanan, L.V.S., Pandit, S.: Flexpath: flexible structure and full-text
querying for xml. In: SIGMOD ’04: Proceedings of the 2004 ACM SIGMOD international
conference on Management of data, New York, NY, USA, ACM Press (2004) 83–94
15. Zadeh, L.: Fuzzy sets. Information and Control. 8(4) (1965) 338–353
16. Levenshtein: Binary codes capable of correcting deletions, insertions and reversals. Soviet
Physics-Doklady 10 (1966) 707–710