Applying of matching parameters to XML document nodes

Larisa Globa, Mykola Alieksieiev, Iurii Molchanov, Liudmyla Kobzar
National Technical University of Ukraine „Kyiv Polytechnic Institute”,
Institute for Telecommunication Systems
e-mail: [email protected], [email protected], [email protected]
XML Documents Change Detection System
Abstract
XML is widely used standard for the data representation in different applications. The
characteristics of XML such as tree-structure and self-description cause problems with the
detection of changes in an XML document at the document level. Currently there are no
effective methods for detecting changes in XML documents. This paper presents a new
algorithm for detecting changes in XML documents that represent the actual and the previous
versions of the monitored XML document. The technique can be effectively used to discover
changes between text parts of the initial and new documents.
Rather than looking for minimum cost edit distance between two trees that represents two
version of XML document, we propose an effective algorithm, based on linear programming
method, which will detect changes in XML documents by calculating of difference between
nodes in old and new version of XML document. The algorithm will detect these changes, not
only by comparison nodes by their content but also by their attributes and position in the tree
structure. The proposed technique considers only parts of XML documents meaningful for
end users.
Keywords: XML document, change detection, publish/subscribe system.
Introduction
For effective change detection of certain part of web-page it is necessary to assign the
corresponding part in the new document at first. Since in new document version the changes
may be made in any part, the location of text of subscribers' interest may change. Thus firstly
such part in new document should be found, which matches good to the text in old version.
In previous works [1,2,3] this problem was solved using minimum cost edit distance.
But this approach is not optimal regarding page monitoring, as algorithm usage result requires
additional processing to search changed parts. Also this approach provides “good matching”
availability between nodes in considered document. Such “good matching” search is difficult
resource-intensive problem. The other method of matching parts search in old and new
documents is searching of certain similarity measure of HTML document parts. In work [4]
the set of parameters for similarity measure determination was proposed. On the base
approach, proposed in [4], in proposed work the search parameters of matching between XML
document parts were applied. These parameters will further used to solve the problem of
“good matching” search.
The main contributions of this work are as follows: first, we introduce matching
criteria to compare XML nodes by their content, attributes and position; second, we propose
new method to map XML nodes with each other based on matching criteria.
Main part
We treat an XML document as an ordered tree, in which left to right order among
siblings is important. We assume that not only text nodes can be changed but also other types
of nodes, such nodes attributes or node position. Our goal is at first to introduce method for
XML nodes comparison, then to match each element value in the old version with its
corresponding value in the new version in order to map nodes of old to with nodes of new
tree.
According to the formulation of change detection problem between XML parts, the
following conditions were formulated:
1. XML document is considered as ordered tree, in which nodes are ordered from left-toright.
2. The search of matching is made only for XML document leaves, which value contains
most important for subscriber text information.
3. While comparison of XML tags the attributes could be considered as matched ones
only if their names and values are matching.
4. While search of matching the order of child nodes for any XML tree parent node, and
also the position of the nodes into XML tree hierarchy based on XML tags indexing
introduced in this work are taken into account.
Firstly we want to describe the complete change detection algorithm between XML
document versions which is presented on Fig.1
Fig.1 Algorithm flowchart
XML document indexing is necessary for consideration of result XML tree nodes in
left-to-right order and for node position tracking into XML tree hierarchy. This will be
described later.
Step “Calculation of matching parameters” shows identification of important for
subscriber tags which will be used for comparison. Usually these are XML tags with text
content which is interesting for subscriber. They are usually presented as leave nodes in XML
tree.
Parameters which can be found on step “Calculation of matching parameters” are
calculated for all pairs of chosen nodes in old and new version of XML tree. They are used
further for creation of integral matching criteria matrix. The integer solution obtained on step
“Resolving of formulated linear programming task” presents the best match between given
documents in accepted conditions.
XML document indexing
Since in this approach the condition that XML document is considered as ordered tree,
in which the nodes left-to-right order is taken into account was made, so this means that
change of children nodes position order in parent node important for subscriber and should be
detected by developed publish\subscribe system.
To implement this function the decision to implement XML-document indexing based
on numerical value was made. This decision allows to take into consideration the nodes leftto-right order, and to keep the possibility of node position tracking in XML tree hierarchy.
This means that for each tag in analyzed document the attribute ‘index’ will be
applied. Index value will be chosen according to following rules:
1. index=1 for root tag.
2. index=1.i for child node of root tag, where i – the number of child node according to
left-to-right order.
3. index=1.i.j accordingly, for j child node of i parent node.
Initial and indexing versions of XML document can be shown on Fig.2.
<books>
<books index="1">
<book>
<book index="1.1">
<title>
<title index="1.1.1">
Title1
Title1
</title>
</title>
<author>
<author index="1.1.2">
<name>
<name index="1.1.2.1">
Name1
Name1
</name>
</name>
<surname>
<surname index="1.1.2.2">
Surname1
Surname1
</surname>
</surname>
</author>
</author>
<edition>
<edition index="1.1.3">
Edition1
Edition1
</edition>
</edition>
</book>
</book>
<book>
<book index="1.2">
<title>
<title index="1.2.1">
Title2
Title2
</title>
</title>
Fig. 2. Process of XML document indexing
Thus, XML document to XML tree correct transformation based on proposed rules
takes place while taking into account XML hierarchy and child node order in initial XML
document.
Applying of matching parameters to XML document nodes
Let XML document version is represented by tree T1 . Then tree T1 is characterized by
following parameters.
N – the amount of nodes in tree T1 , which corresponds to amount of tags in XML
document;
R  ri | i  1...m- the set of parent nodes in tree T1 , where m - is the amount of
parent nodes, ri - parent node of node i .
A  ai | i  1 N  - the set of node attributes in tree T1 , ai - the attribute of i node.
con( xi ) - the content of i node, where xi - i node of tree T1 .
XML document which was shown on Fig.2 can be presented as tree shown on Fig. 3.
books
con(books)=0,
a(index(books))=1
book
book
con(book)=0,
a(index(book))=1.1
title
con(book)=0,
title
author
edition
con(title)=Title1,
a(index(title))=1.1.1
con(author)=0,
a(index(author))=1.1.2
name
con(name)=Name1,
a(index(name))=1.1.2.1
con(surname)=Surname1,
a(index(surname))=1.1.2.2
con(edition)=Edition2,
a(index(book))=1.2.3
author
con(edition)=Edition1,
a(index(edition))=1.1.3
surna
me
edition
con(title)=Title2, a(index(book))=1.2
a(index(title))=1.2.1
name
con(author)=0,
a(index(author))=1.2.2
con(name)=Name2,
a(index(name))=1.2.2.1
surna
me
con(surname)=Surname2,
a(index(surname))=1.2.2.2
Fig. 3. Representing of XML tree
Thus document tree is unordered tree, which elements are characterized by their
positions and related set of attributes. The parts of text which are displayed on web page are
the leaves of document tree.
Let
T (e n )
e
e
is a subtree of tree T with root in node n for given node n of document
tree T .
Let introduce content matching parameter of nodes x1 and x 2 like that:
Pcon ( x1 , x2 ) 
| con( x1 )  con( x2 ) |
| con( x1 )  con( x2 ) |
Parameter Pcon ( x1 , x2 ) returns the percentage of words that appear in both nodes x1
and x 2 .
Attribute matching parameter between nodes x1 and x 2 can be obtained like that:
Patt ( x1 , x2 ) 
a
a
i
 {a(r1 )  a(r2 )}
i
 {a(r1 )  a(r2 )}
Parameters Patt ( x1 , x2 ) shows the measure of the relative weight of the attributes that
have the same value in x1 and x 2 . In XML every attribute may have different value for
different XML documents as far as syntax of language doesn’t define attribute value on
default. For specified document attributes which are used have unique values. So weight
functions proposed in [4] cannot be applied for matching attribute parameter calculation in
XML tree. Thus, all attributes are treated as equivalent in proposed formula and only identical
attributes of two nodes are taken into consideration.
Thus old and new versions of the same XML document will be considered during
comparison so we make a decision that the names of attributes match in both documents.
Accordingly, the identical attributes are those attributes which have identical names and
values.
Position matching parameter can be obtained by following expression:
Pdist ( x1 , x2 ) 
suf (index ( x1 ), index ( x2 ))
max( index ( x1 ), index ( x2 )) ,
In proposed expression function suf defines the length of total suffixes between
attributes of nodes x1 and x 2 , which define the position of the node in XML tree hierarchy –
between index( x1 ) and index ( x2 ) . Function max defines maximum length of attribute
between index( x1 ) and index ( x2 ) .
It is necessary to get an expression for integral matching criteria for given content
matching parameter, attribute matching parameter and position matching parameter of two
nodes. These parameters should be weighted differently by using the weight factors in
expression for integral matching criteria because some parameters could be considered more
relevant than others.
Let  ,  ,  be weight factors for Pcon ( x1 , x2 ) , Patt ( x1 , x2 ) , Pdist ( x1 , x2 ) accordingly. Then
      1 , and integral matching criteria can be obtained as follows:
CS ( x1 , x2 )  1  2  (  Pcon ( x1 , x2 )    Patt ( x1 , x2 )    Pdist ( x1 , x2 ))
Matrix of integral matching criteria
Let’s consider simplified case of integral matching criteria matrix creation.
Let x1 , x2 , x3 , x4 be text parts of old XML document version - (tree T1 ), y1 , y2 - text
parts of new XML document version - (tree T2 ). Accordingly we can assume that two text
parts of old document version were deleted and two other text parts were changed. It is
necessary to find which parts were deleted and which were changed. Also we should find
what changes were done. Also it is necessary to find matching between nodes x1 , x2 , x3 , x4
and nodes y1 , y2 .
First of all, we need to find integral matching criteria for each pair of nodes. For nodes
x1 and y1 :
CS ( x1 , y1 )  1  2  (  Pcon ( x1 , y1 )    Patt ( x1 , y1 )    Pdist ( x1 , y1 ))
Similarly we can obtain find integral matching criteria for all pairs of nodes. In
specified case matrix of integral matching criteria will be the following (table 1).
Table.1. Matrix of integral matching criteria
x1
x2
x3
x4
y1
CS ( x1 , y1 )
CS ( x2 , y1 )
CS ( x3 , y1 )
CS ( x4 , y1 )
y2
CS ( x1 , y2 )
CS ( x2 , y2 )
CS ( x3 , y2 )
CS ( x4 , y2 )
Mathematic simulation of good matching search of XML documents versions.
While comparison old and new document versions it was agreed that one node in old
version can match not more than one node in new document version and vice versa. Thus
good matching search problem is turned into optimal path search problem that is a transport
problem, which is the linear programming task. While searching optimal matching necessary
is to find solution, when the sum of integral matching criteria is maximal.
To formalize this task the connectedness matrix between nodes of T1 tree and T2 tree
was suggested. T1 tree and T2 tree corresponds to old and new document versions
accordingly.
Table 2 The connectedness matrix of new and old XML document versions
x1
x2
x3
x4
y1
a11
a12
a13
a14
y2
a 21
a 22
a 23
a 24
Thus the linear programming task can be formalized as the following:
CS ( x1 , y1 )  a11  CS ( x2 , y1 )  a12  CS ( x3 , y1 )  a13  CS ( x4 , y1 )  a14  CS ( x1 , y2 )  a21  CS ( x2 , y2 )  a22 
 CS ( x3 , y2 )  a23  CS ( x4 , y2 )  a24  max
a11  a12  a13  a14  1
(1)
a21  a22  a23  a24  1
a11  a21  1
(2)
(3)
a12  a22  1 (4)
a13  a23  1
(5)
a14  a24  1 (6)
a11, a12 , a13 , a14 , a21, a22 , a23 , a24  0
(7)
Conclusions
In this paper the new algorithm which allows the efficient detection of XML document
differences in a quantitative way was proposed. This algorithm introduces new approach
which includes determination of similarity between nodes and resolving of good matching
search problem as linear programming task.
References
[1] Meaningful change detection in structured data S. Сhawathe, H. Garcia-Molina,
Proceedings of the ACM, SIGMOD International Conference on Management of Data,
Tuscon, Arizona, May 1997, pp. 26–37.
[2] Representing and querying changes in semistructured data S. Chawathe, S. Abiteboul, J.
Widom, , Proceedings of the International Conference on Data Engineering, Orlando, Florida,
February 1998, pp. 4–13.
[3] Publish/Subscribe System for R&D Information Resources. M.O.Alieksieiev,
Y.M.Molchanov, O.M.Alekseyev, “Visnyk SumDU”, #2, 2009, Sumy. ISSN1817-9215
[4] Efficient and affective Web change Detection S. Flesca, E. Masciari Data & Knowledge
Engineering 46,2003. – pp. 203–224.
[5] Detecting Changes in XML Documents Gregory Cobena, Serge Abiteboul, Amelie Marian
SIGMOD, 25(2):493–504, 2002