Larisa Globa, Mykola Alieksieiev, Iurii Molchanov, Liudmyla Kobzar National Technical University of Ukraine „Kyiv Polytechnic Institute”, Institute for Telecommunication Systems e-mail: [email protected], [email protected], [email protected] XML Documents Change Detection System Abstract XML is widely used standard for the data representation in different applications. The characteristics of XML such as tree-structure and self-description cause problems with the detection of changes in an XML document at the document level. Currently there are no effective methods for detecting changes in XML documents. This paper presents a new algorithm for detecting changes in XML documents that represent the actual and the previous versions of the monitored XML document. The technique can be effectively used to discover changes between text parts of the initial and new documents. Rather than looking for minimum cost edit distance between two trees that represents two version of XML document, we propose an effective algorithm, based on linear programming method, which will detect changes in XML documents by calculating of difference between nodes in old and new version of XML document. The algorithm will detect these changes, not only by comparison nodes by their content but also by their attributes and position in the tree structure. The proposed technique considers only parts of XML documents meaningful for end users. Keywords: XML document, change detection, publish/subscribe system. Introduction For effective change detection of certain part of web-page it is necessary to assign the corresponding part in the new document at first. Since in new document version the changes may be made in any part, the location of text of subscribers' interest may change. Thus firstly such part in new document should be found, which matches good to the text in old version. In previous works [1,2,3] this problem was solved using minimum cost edit distance. But this approach is not optimal regarding page monitoring, as algorithm usage result requires additional processing to search changed parts. Also this approach provides “good matching” availability between nodes in considered document. Such “good matching” search is difficult resource-intensive problem. The other method of matching parts search in old and new documents is searching of certain similarity measure of HTML document parts. In work [4] the set of parameters for similarity measure determination was proposed. On the base approach, proposed in [4], in proposed work the search parameters of matching between XML document parts were applied. These parameters will further used to solve the problem of “good matching” search. The main contributions of this work are as follows: first, we introduce matching criteria to compare XML nodes by their content, attributes and position; second, we propose new method to map XML nodes with each other based on matching criteria. Main part We treat an XML document as an ordered tree, in which left to right order among siblings is important. We assume that not only text nodes can be changed but also other types of nodes, such nodes attributes or node position. Our goal is at first to introduce method for XML nodes comparison, then to match each element value in the old version with its corresponding value in the new version in order to map nodes of old to with nodes of new tree. According to the formulation of change detection problem between XML parts, the following conditions were formulated: 1. XML document is considered as ordered tree, in which nodes are ordered from left-toright. 2. The search of matching is made only for XML document leaves, which value contains most important for subscriber text information. 3. While comparison of XML tags the attributes could be considered as matched ones only if their names and values are matching. 4. While search of matching the order of child nodes for any XML tree parent node, and also the position of the nodes into XML tree hierarchy based on XML tags indexing introduced in this work are taken into account. Firstly we want to describe the complete change detection algorithm between XML document versions which is presented on Fig.1 Fig.1 Algorithm flowchart XML document indexing is necessary for consideration of result XML tree nodes in left-to-right order and for node position tracking into XML tree hierarchy. This will be described later. Step “Calculation of matching parameters” shows identification of important for subscriber tags which will be used for comparison. Usually these are XML tags with text content which is interesting for subscriber. They are usually presented as leave nodes in XML tree. Parameters which can be found on step “Calculation of matching parameters” are calculated for all pairs of chosen nodes in old and new version of XML tree. They are used further for creation of integral matching criteria matrix. The integer solution obtained on step “Resolving of formulated linear programming task” presents the best match between given documents in accepted conditions. XML document indexing Since in this approach the condition that XML document is considered as ordered tree, in which the nodes left-to-right order is taken into account was made, so this means that change of children nodes position order in parent node important for subscriber and should be detected by developed publish\subscribe system. To implement this function the decision to implement XML-document indexing based on numerical value was made. This decision allows to take into consideration the nodes leftto-right order, and to keep the possibility of node position tracking in XML tree hierarchy. This means that for each tag in analyzed document the attribute ‘index’ will be applied. Index value will be chosen according to following rules: 1. index=1 for root tag. 2. index=1.i for child node of root tag, where i – the number of child node according to left-to-right order. 3. index=1.i.j accordingly, for j child node of i parent node. Initial and indexing versions of XML document can be shown on Fig.2. <books> <books index="1"> <book> <book index="1.1"> <title> <title index="1.1.1"> Title1 Title1 </title> </title> <author> <author index="1.1.2"> <name> <name index="1.1.2.1"> Name1 Name1 </name> </name> <surname> <surname index="1.1.2.2"> Surname1 Surname1 </surname> </surname> </author> </author> <edition> <edition index="1.1.3"> Edition1 Edition1 </edition> </edition> </book> </book> <book> <book index="1.2"> <title> <title index="1.2.1"> Title2 Title2 </title> </title> Fig. 2. Process of XML document indexing Thus, XML document to XML tree correct transformation based on proposed rules takes place while taking into account XML hierarchy and child node order in initial XML document. Applying of matching parameters to XML document nodes Let XML document version is represented by tree T1 . Then tree T1 is characterized by following parameters. N – the amount of nodes in tree T1 , which corresponds to amount of tags in XML document; R ri | i 1...m- the set of parent nodes in tree T1 , where m - is the amount of parent nodes, ri - parent node of node i . A ai | i 1 N - the set of node attributes in tree T1 , ai - the attribute of i node. con( xi ) - the content of i node, where xi - i node of tree T1 . XML document which was shown on Fig.2 can be presented as tree shown on Fig. 3. books con(books)=0, a(index(books))=1 book book con(book)=0, a(index(book))=1.1 title con(book)=0, title author edition con(title)=Title1, a(index(title))=1.1.1 con(author)=0, a(index(author))=1.1.2 name con(name)=Name1, a(index(name))=1.1.2.1 con(surname)=Surname1, a(index(surname))=1.1.2.2 con(edition)=Edition2, a(index(book))=1.2.3 author con(edition)=Edition1, a(index(edition))=1.1.3 surna me edition con(title)=Title2, a(index(book))=1.2 a(index(title))=1.2.1 name con(author)=0, a(index(author))=1.2.2 con(name)=Name2, a(index(name))=1.2.2.1 surna me con(surname)=Surname2, a(index(surname))=1.2.2.2 Fig. 3. Representing of XML tree Thus document tree is unordered tree, which elements are characterized by their positions and related set of attributes. The parts of text which are displayed on web page are the leaves of document tree. Let T (e n ) e e is a subtree of tree T with root in node n for given node n of document tree T . Let introduce content matching parameter of nodes x1 and x 2 like that: Pcon ( x1 , x2 ) | con( x1 ) con( x2 ) | | con( x1 ) con( x2 ) | Parameter Pcon ( x1 , x2 ) returns the percentage of words that appear in both nodes x1 and x 2 . Attribute matching parameter between nodes x1 and x 2 can be obtained like that: Patt ( x1 , x2 ) a a i {a(r1 ) a(r2 )} i {a(r1 ) a(r2 )} Parameters Patt ( x1 , x2 ) shows the measure of the relative weight of the attributes that have the same value in x1 and x 2 . In XML every attribute may have different value for different XML documents as far as syntax of language doesn’t define attribute value on default. For specified document attributes which are used have unique values. So weight functions proposed in [4] cannot be applied for matching attribute parameter calculation in XML tree. Thus, all attributes are treated as equivalent in proposed formula and only identical attributes of two nodes are taken into consideration. Thus old and new versions of the same XML document will be considered during comparison so we make a decision that the names of attributes match in both documents. Accordingly, the identical attributes are those attributes which have identical names and values. Position matching parameter can be obtained by following expression: Pdist ( x1 , x2 ) suf (index ( x1 ), index ( x2 )) max( index ( x1 ), index ( x2 )) , In proposed expression function suf defines the length of total suffixes between attributes of nodes x1 and x 2 , which define the position of the node in XML tree hierarchy – between index( x1 ) and index ( x2 ) . Function max defines maximum length of attribute between index( x1 ) and index ( x2 ) . It is necessary to get an expression for integral matching criteria for given content matching parameter, attribute matching parameter and position matching parameter of two nodes. These parameters should be weighted differently by using the weight factors in expression for integral matching criteria because some parameters could be considered more relevant than others. Let , , be weight factors for Pcon ( x1 , x2 ) , Patt ( x1 , x2 ) , Pdist ( x1 , x2 ) accordingly. Then 1 , and integral matching criteria can be obtained as follows: CS ( x1 , x2 ) 1 2 ( Pcon ( x1 , x2 ) Patt ( x1 , x2 ) Pdist ( x1 , x2 )) Matrix of integral matching criteria Let’s consider simplified case of integral matching criteria matrix creation. Let x1 , x2 , x3 , x4 be text parts of old XML document version - (tree T1 ), y1 , y2 - text parts of new XML document version - (tree T2 ). Accordingly we can assume that two text parts of old document version were deleted and two other text parts were changed. It is necessary to find which parts were deleted and which were changed. Also we should find what changes were done. Also it is necessary to find matching between nodes x1 , x2 , x3 , x4 and nodes y1 , y2 . First of all, we need to find integral matching criteria for each pair of nodes. For nodes x1 and y1 : CS ( x1 , y1 ) 1 2 ( Pcon ( x1 , y1 ) Patt ( x1 , y1 ) Pdist ( x1 , y1 )) Similarly we can obtain find integral matching criteria for all pairs of nodes. In specified case matrix of integral matching criteria will be the following (table 1). Table.1. Matrix of integral matching criteria x1 x2 x3 x4 y1 CS ( x1 , y1 ) CS ( x2 , y1 ) CS ( x3 , y1 ) CS ( x4 , y1 ) y2 CS ( x1 , y2 ) CS ( x2 , y2 ) CS ( x3 , y2 ) CS ( x4 , y2 ) Mathematic simulation of good matching search of XML documents versions. While comparison old and new document versions it was agreed that one node in old version can match not more than one node in new document version and vice versa. Thus good matching search problem is turned into optimal path search problem that is a transport problem, which is the linear programming task. While searching optimal matching necessary is to find solution, when the sum of integral matching criteria is maximal. To formalize this task the connectedness matrix between nodes of T1 tree and T2 tree was suggested. T1 tree and T2 tree corresponds to old and new document versions accordingly. Table 2 The connectedness matrix of new and old XML document versions x1 x2 x3 x4 y1 a11 a12 a13 a14 y2 a 21 a 22 a 23 a 24 Thus the linear programming task can be formalized as the following: CS ( x1 , y1 ) a11 CS ( x2 , y1 ) a12 CS ( x3 , y1 ) a13 CS ( x4 , y1 ) a14 CS ( x1 , y2 ) a21 CS ( x2 , y2 ) a22 CS ( x3 , y2 ) a23 CS ( x4 , y2 ) a24 max a11 a12 a13 a14 1 (1) a21 a22 a23 a24 1 a11 a21 1 (2) (3) a12 a22 1 (4) a13 a23 1 (5) a14 a24 1 (6) a11, a12 , a13 , a14 , a21, a22 , a23 , a24 0 (7) Conclusions In this paper the new algorithm which allows the efficient detection of XML document differences in a quantitative way was proposed. This algorithm introduces new approach which includes determination of similarity between nodes and resolving of good matching search problem as linear programming task. References [1] Meaningful change detection in structured data S. Сhawathe, H. Garcia-Molina, Proceedings of the ACM, SIGMOD International Conference on Management of Data, Tuscon, Arizona, May 1997, pp. 26–37. [2] Representing and querying changes in semistructured data S. Chawathe, S. Abiteboul, J. Widom, , Proceedings of the International Conference on Data Engineering, Orlando, Florida, February 1998, pp. 4–13. [3] Publish/Subscribe System for R&D Information Resources. M.O.Alieksieiev, Y.M.Molchanov, O.M.Alekseyev, “Visnyk SumDU”, #2, 2009, Sumy. ISSN1817-9215 [4] Efficient and affective Web change Detection S. Flesca, E. Masciari Data & Knowledge Engineering 46,2003. – pp. 203–224. [5] Detecting Changes in XML Documents Gregory Cobena, Serge Abiteboul, Amelie Marian SIGMOD, 25(2):493–504, 2002
© Copyright 2026 Paperzz