A New Sequential Mining Approach to XML Document Similarity
Computation1
Ho-pong Leung, Fu-lai Chung2 and Stephen Chi-fai Chan
Department of Computing
Hong Kong Polytechnic University
Hunghom, Kowloon, Hong Kong.
{csleung, cskchung, csschan}@comp.polyu.edu.hk
Abstract - Measuring the structural similarity among XML documents is the task of finding their semantic
correspondence and is fundamental to many web-based applications. While there exist several methods to address
the problem, the data mining approach seems to be a novel, interesting and promising one. It works on the idea of
extracting paths from XML documents, encoding them as sequences and finding the maximal frequent sequences
using the sequential pattern mining algorithms. In view of the deficiencies encountered by ignoring the hierarchical
information in encoding the paths for mining, a new sequential pattern mining scheme for XML document similarity
computation is proposed in this paper. It takes use of a preorder tree representation (PTR) to encode the XML tree’s
paths so that the element’s semantic and the hierarchical structure of document can be taken into accounts when
computing the structural similarity among documents. In addition, it includes a post-processing step to reuse the
mined patterns to estimate the similarity of unmatched elements so that another metric to qualify the similarity
between XML documents can be introduced. Encouraging experimental results were obtained and reported.
1. Introduction
Extensible Mark-up Language (XML) [1] is a markup language derived from Standard Generalized Markup
Language (SGML) [2], which takes use of a hierarchical format for encoding structured data within document. It
allows the web publishers to declare their own elements and attributes using arbitrary words so that customized
document formats can be set up. Each XML element embeds the meaning of its corresponding data. Hence, the
elements and their arrangement in the hierarchy not only describe the XML document structure, but also implicitly
provide its semantic meanings that will be very useful for the manipulation of XML documents in many
applications.
Measuring the structural similarity among XML documents has been an active area of research in the past few
years [3,4] and it is fundamental to many applications, such as integrating XML data source [5,6], XML
dissemination [7,8], XML routing [9,10] and XML repositories [11]. This problem is by no means trivial due to the
irregular and incomplete properties of XML documents. For example, XML documents may contain the same data,
but they may have different structures. Even when they use the same Document Type Descriptor (DTD), their tree
1
2
Manuscript submitted to Postgraduate Research Day
Corresponding author
1
structures may not be identical. Various methods have been proposed to solve the problem. In [4,12,13], the (tree)
edit distance is adopted to measure the similarity among XML documents. They work on finding the minimum
sequence of edit operations that can transform an XML tree into another. The sequence obtained can then be used to
compute the similarity. However, it is well-known that any edit distance measure critically depends on the costs of
the underlying edit operations and the problem how these edit costs are obtained is still unsolved [14].
The concept of frequent tree pattern for determining XML similarity was introduced in [3,15,16]. This approach
takes use of data mining techniques to find the repetitive (document) structure for determining the similarity
between documents. References [15][16] proposed to represent a semi-structured document by a tag tree pattern and
attempt to mine maximal frequent tag tree patterns in semi-structured documents. However, if the XML document
set is large, the tag tree will consume a huge amount of storage space. In [3], Lee et al. defines the structural
similarity as the number of paths that are common and similar between the hierarchical structures of the XML
documents. They proposed to compute a “minimal” hierarchical structure of a XML document using automata and
determine the frequent path of a tree using an adapted sequential mining approach. This method provides accurate
quantitative computation of similarity between XML documents and encouraging experimental results have been
reported. However, the determination of similar paths does not take into considerations of the hierarchical
information, i.e., the level of the hierarchy at which an element locates, several weaknesses of the method can be
observed.
Firstly, XML documents with dissimilar hierarchical structures but many common paths (located at different
levels of the hierarchies) will have very high similarity. Consider the two XML trees T and Q in Fig.1, where tree P
matches trees T and Q with a common rooted element labeled "A" and the sub-elements labeled "C" and "D", i.e.,
the sub-tree R of Fig.1 is their common tree. Here, the common elements C and D do not locate at the same level of
the two hierarchies (of T and Q) but the high similarity value as computed by Lee et al.’s method will indicate that
they are exactly the same. Secondly, the proposed similarity computation method restricts measuring the element
similarity in the pre-processing stage and prevents the use of the level information of common elements to discover
the synonym elements in the mining stage. Given trees P and Q of Fig.1, Lee et al.’s method will return sub-tree R
as the common tree. It will not produce an unknown node between the common root and the common leaf nodes as
exemplified by tree S in Fig.1 where the level information remains unchanged. If this method can return tree S as the
common tree, we can reuse this common tree to determine the location of synonym elements and thus efforts to
identify the possible location of synonym elements can be saved. Further, we can compute the similarity between
2
each pair of unknown elements, i.e., B' and B", by their surrounding elements. In this paper, we propose to take this
element similarity into accounts because it can be used to determine the degrees of semantic heterogeneity between
the XML documents. In some cases, it can further qualify the similarity between the XML documents. The third
weakness of Lee et al.’s method is that ignoring the hierarchical information is too restrictive and incompatible to
measure the overall hierarchical structure of XML documents. Consider trees U and V of Fig.1. Lee et al.’s method
will return the sub-tree R as the common tree and will not compute the overall hierarchical structure of the
unmatched nodes between them.
A
A
B
B"
C
C
D
A
A
D
C
Tree Q
Tree P
C
D
Tree S
A
A
B'
...
......
n1
B"
ni
D
Tree T
D
Sub-Tree R
A
C
*
C
....
D
nj
...
nm
nk
ny
mi
nz
...
......
C
Tree U
D
...
mj
m1
mm
mk
my
mz
Tree V
Fig.1 Examples of XML document tree
In this paper, we further pursue the sequential pattern mining approach to compute the structural similarity of
XML documents. Here, the structural similarity means the number of paths and elements that are common and
similar among the hierarchical structure of the XML documents. A simple method composed of a pre-processing
step and a post-processing step (for the mining engine) is proposed and through which the XML semantics can be
determined and hence the similarity between the XML documents can be computed appropriately. To overcome the
aforementioned weaknesses of Lee et al.’s method, the proposed method takes into accounts of the change of level
information in the tree hierarchy and elements along the path. It is beneficial to the similarity computation and also
facilitates the determination of synonyms (common unmatched elements) between the paths. The remainder of this
paper is structured as follows. In Section 2, the problem of sequential mining of frequent XML document tree
pattern and the proposed pre-processing method are presented. The similarity computation and the way to identify
the synonyms between XML documents are described in Section 3. Section 4 reports the experimental results. The
final section concludes the paper and outlines the future works.
2. Mining Frequent XML Document Tree Patterns
In this section, we first introduce a pre-processing step for the incorporation of hierarchical information in
encoding the XML tree’s paths. It is based on the preorder tree representation (PTR) [17] and will be introduced
3
after a brief review of how to generate an XML tree from an XML document. We then describe the sequential
pattern mining approach to compute the similarity between two sets of encoded paths, i.e., two XML documents.
2.1 Generating encoded paths from XML document
The XML’s hierarchical structure can be represented by a labeled rooted tree [18]. Fig.2 shows the
correspondence of an XML document and its XML tree. XML tree is a rooted tree, where each node represents an
element in the XML document and the children of each node are the sub-elements of that node. In the rest of the
paper, each XML document is represented as a labeled tree and the values of the elements in the tree will not be
considered, i.e., considering the structure of the XML document only.
<SigmodRecord>
<issue>
<volume>19</volume>
<number>7</number>
<articles>
<article>
<title>Example XML</title>
<initPage>2</initPage>
<endPage>3</endPage>
<authors>
<author>Peter</author>
</authors>
</article>
</issue>
</SigmodRecord>
Sigmod
Record
( ( ( ) ( ) ( ( ( ) ( )( ) ( ( ) ) ) ) ) )
Representation of XML tree in Figure 2 using parenthesis system
issue
volume
number
(
articles
)
(
)
( ) ( ) (
article
)
(
)
( ) ( )( ) (
title
initPage
endPage
authors
author
)
( )
The structural relationship of XML tree in Figure 2
Fig.3 The parenthesis system
Fig.2 An XML tree generation example
There are many ways to represent tree structures and one of them is the parenthesis system [17]. In this system,
a tree is defined by a sequence of parentheses and it consists of a root and a sequence of sub-trees. Each tree is
enclosed in parentheses of its parent node. By rewriting the parentheses at different levels and their correspondence
to the parent node of the tree, the structural relationship can be easily seen. As the parenthesis system is a spaceefficient representation of the tree structure, it is extremely useful in applications where very large trees need to be
stored. Fig.3 shows the representation of the XML tree in Fig.2 using the parenthesis system.
In this paper, we have adopted another representation to encode the XML tree, i.e., the preorder tree
representation (PTR) [17]. It is an extension of the parenthesis system where the information of each node is listed
before its sub-tree and a tree is simply represented by a list of nodes and their parentheses. The key difference
between the parenthesis system and the preorder tree representation is the inclusion of the node labels in the
representation. Fig.4 shows an example of the XML tree in PTR.
As the XML’s hierarchical structure is represented as a labeled, rooted tree, the path can be represented by the
elements from the root to the leaf. According to the preorder representation, each element followed by a left
parenthesis denotes a change of level in the tree hierarchy, i.e., going down the tree. The left parenthesis is a start
4
symbol of traversing from a non-leaf root of the tree. The right parentheses are meaningless and hence can be
removed from the path. More importantly, it reduces the mining time significantly. Under this representation, the
level and the structure of the tree can be presented by a set of paths. It does not only contain the path’s element
information, but also the level of the hierarchy for every element in the path.
The structure of an XML document can be partitioned into multiple units or sub-trees by the level of the tree
structure. Each unit is associated with some document contents and indicates the amount of information included.
The higher level of the tree structure contains more information at a coarser resolution while the lower level of the
tree consists of less information at a finer resolution. Besides, the level of the tree also facilitates the determination
of common unmatched elements between the paths and it will be described in the next section. Therefore, if any
such paths are found frequent by the sequential mining algorithm, the structural similarity between XML documents
can be computed by determining the number of paths and their levels of hierarchy that are common and similar. To
do so, we have to first go through the following four preprocessing steps for XML document representation. They
are illustrated in Fig.5.
Step 1: Conversion
Convert the XML document to tree format. The values of the elements in the tree are not considered here and only
the structural information like that in Fig.2 will be passed to the subsequent steps.
Step 2: Path Extraction
Traverse the elements from the root to each leaf node of the tree. Record the sequence and hierarchical information
for each path.
Step 3: Duplicated Path Removal
Remove any duplicated path. The duplicated paths in the tree are not considered here and only the unique ones will
be passed to the next step.
Step 4: Path Encoding
Encode each path by the preorder tree representation. The root of the path is listed before the lower level elements
and all the traversed elements are sorted by their levels in the tree. The path is then represented by the elements from
the root to the leaf and separated by the left parentheses.
As mentioned in the introduction section, Lee et al.’s method [3] only considers the common elements in the
paths and ignores the level of hierarchy of the elements. Based upon Step 4 above, the level of hierarchy can be
introduced via the left parenthesis in the encoded path. Fig.6 elaborates the differences between the proposed PTR
encoding scheme and Lee et al.’s method. The goal of this new scheme is to achieve more accurate similarity
computation from the mined preorder path patterns and to identify the possible synonym term locations. Consider
the two XML trees Y and Z in Fig.6, where tree X matches tree Y and Z with a common rooted element “A” and the
sub-elements “C”, “D” and “E”. For the PTR method, the common elements “C”, “D” and “E” do not locate at the
5
same level of the two hierarchies and their similar values are different. Besides, “B” and “F” can be identified as
possible synonym terms.
A
A
2
3
4
5
6
7
8
9
10
11
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
SigmodRecord
issue
volume
number
articles
article
title
initPage
endPage
authors
author
( 1 ( 2 ( 3 ) ( 4 ) ( 5 ( 6 ( 7 ) ( 8 ) ( 9 ) ( 10 ( 11 ) ) ) ) )
Fig.4 An example of preorder tree representation (PTR)
2
4
5
6
7
D
8
9
10
11
XML Tree
1(2(3)))
1(2(4)))
1(2(5(6(7))))
1(2(5(6(8))))
1(2(5(6(9))))
1 ( 2 ( 5 ( 6 ( 10 ( 11 ) ) ) ) )
1(2(3
1(2(4
1(2(5(6(7
1(2(5(6(8
1(2(5(6(9
1 ( 2 ( 5 ( 6 ( 10 ( 11
Encoded with Preorder
Tree Representation
Encoded Path
Fig.5 Preprocessing steps for XML document
E
C
Tree X
Paths of Tree X
Paths of Tree Y
Paths of Tree Z
Maximal
common path
between Trees
X and Y
1
3
C
A
F
B
1
D
E
Tree Y
C
D
E
Tree Z
Lee et al.'s method
A, B, C
A, B, D
A, B, E
A, F, C
A, F, D
A, F, E
A, C
A, D
A, E
A, C
A, D
A, E
PTR method
A, (, B, (, C
A, (, B, (, D
A, (, B, (, E
A, (, F, (, C
A, (, F, (, D
A, (, F, (, E
A, (, C
A, (, D
A, (, E
A, (, (, C
A, (, (, D
A, (, (, E
A
A
C
D
*
E
C
A
A
C
D
E
A, (, C
A, (, D
A, (, E
A, C
A, D
A, E
Maximal
common path
between Tree X
and Z
D
E
C
D
E
Fig.6 Determination of common paths by Lee et al.’s method
and PTR method
2.2 Mining frequent tree patterns
The problem being considered here is based on the idea that each path traversing from the root to the leaf of an
XML document can be viewed as a sequence and based upon which the sequential pattern mining [19] can be
applied to find the frequent tree patterns. Each sequence corresponds to a set of elements ordered by decreasing level
of hierarchy and is labeled by a sequence-id. We denote it as < x1 x2 L xn > and call such a sequence as “XMLsequence”. An XML document contains a number of XML-sequences. Let the set of XML-sequences of an XML
document be {s1 , s2 , L , sn }. Using the terminology of sequential pattern mining [19], a sequence is contained by
another if it is a subsequence of that sequence. In a set of sequences, a sequence sj is maximal if it is not contained
by any other sequence.
Here, we are given a database D of XML sequences, each of which consists of the following fields: documentid, sequence-id, element (tag) and the corresponding level of the hierarchy. While no document is assumed having
more than one element with the same sequence-id and the level of hierarchy, we combine the document-id,
sequence-id and level of the sequence as the identifier of that sequence. We also do not consider the quantities of the
same sequence found in an XML document. Thus, the problem of mining XML sequence patterns is to find the
6
maximal frequent sequences among all sequences satisfying the user-specified minimum support. Each such
maximal frequent sequence represents a common structure or pattern of the XML documents. Unlike other data
mining applications, the minimum support for finding the maximal sequence between the two XML documents (for
similarity computation) must be 100%. In order to mine the frequent tree patterns or structural sequences, all XMLsequences have to be extracted and then encoded in PTR format for finding the maximal frequent sequences via the
sequential mining algorithm depicted in Fig.7. An example is illustrated in Fig.8, the records or transactions convert
to the document’s paths as Fig.8(c). (The detail of adapted sequential mining algorithm is omitted here.)
L1 := {frequent 1-sequences};
For (k = 2; L k-1 ≠ ∅; k++) do begin
Ck := new candidates of size k generated from Lk-1;
Foreach XML document d in the database do
Increment the count of all candidates in Ck that are
contained in any XPath expression of d;
Lk = Candidates in Ck with minimum support;
End
Answer := Maximal Sequences in ∪k Lk;
Fig.7 Sequential pattern mining algorithm
Path 1
Document
1
Path 2
Path 1
Path 2
A
A
B
B"
C
C
D
Document
2
D
a) Examples of XML document tree
Docu.
ID
Path
ID
Level
Item/
Tag
Docu.
ID
Path
ID
Level
Item/
Tag
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
1
2
3
4
5
1
2
3
4
5
A
(
B
(
C
A
(
B
(
D
2
2
2
2
2
2
2
2
2
2
1
1
1
1
1
2
2
2
2
2
1
2
3
4
5
1
2
3
4
5
A
(
B”
(
C
A
(
B”
(
D
b) Database version of XML document 1 & 2
Document ID
Path ID
Document Path
1
1
A, (, B, (, C
1
2
A, (, B, (, D
2
1
A, (, B”, (, C
2
2
A, (, B”, (, D
c) Document-sequence version of the database
Fig.8 An example of mining maximal frequent sequences
For (k = n; k > 1; k--)
Foreach k-sequence sk
Delete from S all subsequences of sk;
Fig.9 Algorithm of Maximal Phase
7
3. Measuring Similarity between XML documents
As described in previous section, the sequential pattern mining algorithm [19] was adopted to find the maximal
common paths (i.e., maximal frequent sequences) from the encoded XML paths of the two documents to compare.
Based upon the mining results, the similarity between them is the “ratio” of the maximal common paths and the
extracted paths of the larger document, i.e. the one with more elements. Here, we assume that more similar XML
document pair should have more number of maximal common paths and more left parentheses in the paths. The
similarity between two XML documents D1 and D2 is defined as follows:
Sim ( D1 , D 2 ) =
1 N 1
∑
N + 1 t =1 N t
Nt
mt , p
+ MR (1)
t, p
∑M
p =1
where N is the total number of level 1 sub-trees in the larger document; Nt is the total number of paths in the t-th
sub-tree; Mt,p is the number of elements in the (t,p)-th path; mt,p is the number of common elements (obtained from
the maximal frequent sequences) in the (t,p)-th path; and MR is either 0 or 1 to denote whether the two documents
have the same root element.
3.1 Element similarity
Based upon the maximal frequent sequences being mined, we further propose to determine the possible
positions of synonyms, i.e., common unmatched elements, and use them to calculate the similarity of these elements.
We call this similarity as element similarity (ES) which can be used as an additional measure for document
similarity computation. Here, a common unmatched element refers to the element located at a relatively similar
position in the common path but denoted by another wording of the same meaning.
The ES is determined as follows. First, the maximal common path (maximal frequent sequence) obtained from
the mining stage is used to determine the most similar paths from each document. The most similar path means the
longest path that contains the maximal frequent sequence. For notational convenience, let S1 and S2 be two sets of
PTR encoded paths from two XML documents D1 and D2, between which we aim at determining the common paths
and identifying the synonyms. Let pi∈S1 and pj∈S2 be two most similar paths and pc be their maximal common
path, i.e., pc∈pi and pc∈pj.
Second, pc is aligned with pi and pj and the candidates of common unmatched element from each path is
determined. Let ec,1…ec,k be an ordered elements set of the path pc, where k is the number of elements in pc and ec,1
8
(ec,k) is the first (last) element of path pc. Consider the paths p1=“A, (, B, (, D, (, E”, p2 = “A, (, G, (, E” and pc = “A,
(, (, E”. For p1 and pc, B and D are considered as “difference” elements in the alignment of p1 and pc and thus they
are the candidates of common unmatched element from p1. Similarly for p2 and pc, G is the “difference” element in
the alignment of p2 and pc and then G is the candidate of common unmatched element from p2. Algorithmically,
when the elements of the common path ec,r = “(” and ec,r+1 = “(” where 1≤r<k are located, a common unmatched
element is assumed within the position r and r+1 of the common path. With respect to pi or pj, the element(s) in
between the corresponding positions (i.e., ec,r=“(” and ec,r+1=“(”) is (are) the candidate(s) of common unmatched
element or possible synonyms.
Third, a reference table is constructed for the candidates identified in the previous step and the element
similarity is computed. The reference table stores information about leaf nodes (elements) of the sub-tree for each
candidate common unmatched element. Since each element is assumed unique in each XML document and the
XML structure is acyclic, the leaf nodes indicate the semantic or concept of their ancestors. We take use of this
property to measure the semantic association between the candidate common unmatched elements from pi and pj and
thus to compute the element similarity. The element similarity of candidate common unmatched elements ea and eb
from pi and pj respectively is defined as:
Similarity(ea , eb ) =
Na ∩ Nb
Na ∪ Nb
(2)
where Na and Nb are the sets of leaf nodes from the reference table for ea and eb respectively. Obviously,
Similarity(ea , eb ) falls in the real interval [0, 1].
The steps above determine the most relevant pair of common unmatched elements for documents D1 and D2
and the corresponding element similarity can be computed according to eq.(2). Such a similarity measure can be
used to further qualify the XML document similarity. By replacing the zero contribution of the unmatched
element(s) in the path in eq.(1) with the ES in eq.(2), a combined method called PTR&ES is proposed and the new
similarity computation is defined as
Sim ( D1 , D 2 ) =
1 N 1
∑
N + 1 t =1 N t
mt , p + ct, p
+ MR (3)
M t , p
p =1
Nt
∑
where ct,p is the sum of ES of the common unmatched elements in the (t,p)-th path, i.e.,
9
ct , p =
∑ Similarity(e e )
∀ei ,e j pairs
i,
(4)
j
The PTR&ES similarity computation is exemplified in Fig.10.
C
A
A
B
F
D
E
Tree X
C
D
E
Tree Y
Paths of Tree X
Paths of Tree Y
Maximal common path between X and Y
Structural Similarity (X, Y)
A, (, B, (, C
A, (, B, (, D
A, (, B, (, E
A, (, F, (, C
A, (, F, (, D
A, (, F, (, E
A, (, (, C
A, (, (, D
A, (, (, E
83 .33%
Element Similarity of (B, F)
{C , D, E} ∩ {C , D, E} = 1
{C , D, E} ∪ {C , D, E}
Maximal common path between X and Y
(use * to represent B and F and similarity
value of * is 100%)
A, (, *, (, C
A, (, *, (, D
A, (, *, (, E
A
Combined Structural Similarity (X, Y)
100%
*
C
D
E
Fig.10 A PTR&ES similarity computation example
4. Experimental Results
The goals of our experiments are to validate the idea of incorporating hierarchical information for better
similarity computation and to test on the effectiveness of the proposed methods. To do so, Lee et al.’s method was
chosen for comparisons. The experiments were conducted as follows. The following three DTDs were downloaded
from ACM’s SIGMOD Record homepage [20]: OrdinaryIssuePage.dtd, SigmodRecord.dtd and Record.dtd where
Record.dtd is a modified version of SigmodRecord.dtd. Specifically, the hierarchical structure between Record and
OrdinaryIssuePage.dtd is more similar to that between SigmodRecord.dtd and OrdinaryIssuePage.dtd. We also
downloaded the XML document generator from IBM’s homepage [21]. This generator accepts the above DTDs as
input and creates the sets of XML documents for simulations. We added the option of generating documents skewed
with maximum height equal to seven and maximum repeated pattern equal to three, where some tag names appear
more frequently than others, as is generally the case in real-life documents. Based upon the three sets of XML
documents with similar characteristics, their similarities were computed, analyzed and reported as follows.
4.1 Similarity of documents of same DTD
In this experiment, five XML documents were generated from each DTD (homogeneous XML documents) and
the similarities between documents generated from the same DTD were computed. As the XML documents come
10
from the same DTD, this is called homogeneous XML document similarity. Fig.11 shows a summary of the results.
It can be seen that the similarity values obtained by the proposed methods, i.e., preorder tree representation (PTR)
and element similarity (ES), are pretty similar to those of Lee et al.’s method. In fact, they are slightly higher
because our methods take into accounts of the common element and structure (via left parentheses) in the hierarchy.
Besides, it can be seen that the similarity values obtained by the PTR and PTR&ES method are the same. The reason
is that the documents do not have common unmatched elements and they have same hierarchical structure. On the
other hand, the proposed methods fluctuate less than Lee et al.’s method. It is because Lee et al.’s method only
considers the common element in the path while ours consider also the level of hierarchy, making it more insensitive
to unmatched elements in the paths.
a) Document Similarity of SigmodRecord.dtd
Lee et al's method
PTR method
b) Document Similarity of Record.dtd
PTR+ES method
Lee et al's method
Similarity (Ratio)
Similarity (ratio)
1
0.98
0.96
0.94
0.92
1,2
1,3
1,4
1,5
2,3
2,4
2,5
3,4
3,5
DocumentSet(SigmodRecord-id, SigmodRecord-id)
4,5
PTR method
PTR+ES method
1.02
1
0.98
0.96
0.94
0.92
0.9
0.88
1,2
1,3
1,4
1,5
2,3
2,4
2,5
3,4
3,5
4,5
DocumentSet(Record-id, Record-id)
Fig.11 Results of homogeneous XML document similarity using different methods
4.2 Similarity of documents of different DTDs
In this experiment, the similarities between documents of different DTDs were analyzed. The XML documents
from OrdinaryIssuePage.dtd were adopted as the base documents while those from Record.dtd and
SigmodRecord.dtd were used as query documents. The experimental results are shown in Fig.12 where
DocumentSet(x,y,z) is used to denote the similarities between document x from OrdinaryIssuedPage.dtd and
document y from Record.dtd and between document x and document z from SigmodRecord.dtd. As the XML
documents come from different DTDs, this is called heterogeneous XML document similarity.
Here, the proposed PTR method was found superior to Lee et al.’s method. For DocumentSet (1,2,2),
DocumentSet(2,3,3), DocumentSet(3,2,2), DocumentSet(4,1,1) and DocumentSet(4,2,2), Lee et al.’s method shows
that the similarity values of Rec(ord)-Ord(inaryIssuedPage) and Sig(modRecord)- Ord(inaryIssuedPage) are the
same. It is because the Record.dtd’s document and the SigmodRecord.dtd’s document contain the same number of
common elements in the paths for the OrdinaryIssuePage.dtd’s document. When comparing these two sets of
documents, our PTR method can determine the most similar document from them. Again, this is due to the
11
considerations of the hierarchical information in the proposed method. The common elements are identified by Lee
et al.’s method and hence the similarity values between them are same. The proposed PTR method not only
discovers these common elements, but also determines the common unmatched element positions which will be
considered as synonym terms. This property provides an additional capability to calculate the distance among XML
documents.
For DocumentSet(2,2,2) and DocumentSet (5,2,2), the PTR method obtained a different result from Lee et al.’s
method. The reason is that their method has found more common elements in the paths for the SigmodRecordOrdinaryIssuePage case and hence generating higher similarity values. As mentioned in the beginning of this
section, the hierarchical structure between Record.dtd and OrdinaryIssuePage.dtd is more similar to that between
SigmodRecord.dtd and OrdinaryIssuePage.dtd. Hence, the proposed PTR method has obtained more reasonable
results. It has also been validated that documents with dissimilar hierarchical structures but many common paths
(located at different levels of the hierarchies) will not have unreasonably high similarity values generated by the
proposed PTR method. The aforementioned distinctive features of the PTR method are also possessed by the
PTR&ES method because it takes use of the PTR method to encode the path.
In Fig.13, the three methods are further compared. DocumentSet(x,y) is used to denote the similarity between
document x from OrdinaryIssuedPage.dtd and document y from SigmodRecord.dtd. For DocumentSet (4,1),
DocumentSet(4,2), DocumentSet(4,3), Document Set(4,4) and DocumentSet(4,5), both PTR and Lee et al.’s
methods show that the similarity values among them are the same (as indicated in Fig.14(a)&(b)). The reason is that
all SigmodRecord.dtd’s documents contain the same number of common elements and unmatched elements in the
paths for the OrdinaryIssuePage.dtd’s document. When comparing these two sets of documents, our PTR&ES
method can determine the most similar document from them, i.e., SigmodRecord.dtd’s document 4. It found
common unmatched elements in the paths and hence generated non-zero unmatched element similarity values.
(Similar cases have been observed from other heterogeneous XML document similarity experiments but are omitted
here.) It provides an example to understand the differences between by the PTR method and the PTR&ES method.
Here, these common elements and unmatched common elements are also determined by the PTR method. There are
same number of common elements and common unmatched elements in the SigmodRecordx.xml trees, and so the
PTR and Lee et al.’s method as well cannot distinguish which of the two trees is most similar to
OrdinaryIssuedPage.xml. For our PTR&ES method, it determinates not only these common elements and
unmatched common elements, but also the similarity of common unmatched elements. The positions of unmatched
12
common elements are considered as the synonym terms and be used as additional information to calculate the
distance among the XML documents for further distinguishing the most similar XML document among the XML
documents with the same number of common elements and unmatched common elements.
a) Document Similarity Using Lee et al's Method
Rec-Ord
Sig-Ord
0.4
Similarity (Ratio)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
1,1
,1
1,2
,2
1,3
,3
1,4
,4
1,5
,5
2,1
,1
2,2
,2
2,3
,3
2,4
,4
2,5
,5
3,1
,1
3,2
,2
3,3
,3
3,4
,4
3,5
,5
4,1
,1
4,2
,2
4,3
,3
4,4
,4
4,5
,5
5,1
,1
5,2
,2
5,3
,3
5,4
,4
5,5
,5
0
Document Set (OrdinaryIssuePage-id, Record-id, SigmodRecord-id)
b) Document Similarity Using the PTR Method
Rec-Ord
Sig-Ord
0.45
Similarity (Ratio)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
5,2,2
5,3,3
5,4,4
5,5,5
5,3,3
5,4,4
5,5,5
5,1,1
5,2,2
4,5,5
4,4,4
4,3,3
4,2,2
4,1,1
3,5,5
3,4,4
3,3,3
3,2,2
3,1,1
2,5,5
2,4,4
2,3,3
2,2,2
2,1,1
1,5,5
1,4,4
1,3,3
1,2,2
1,1,1
0
Document Set (OrdinaryIssuePage-id, Record-id, SigmodRecord-id)
c) Document similarity using the PTR+ES method
Rec-Ord
Sig-Ord
0.6
Similarity (Ratio)
0.5
0.4
0.3
0.2
0.1
5,1,1
4,5,5
4,4,4
4,3,3
4,2,2
4,1,1
3,5,5
3,4,4
3,3,3
3,2,2
3,1,1
2,5,5
2,4,4
2,3,3
2,2,2
2,1,1
1,5,5
1,4,4
1,3,3
1,2,2
1,1,1
0
Document Set (OrdinaryIssuePage-id, Record-id, SigmodRecord-id)
Fig.12 Results of heterogeneous XML document similarity using different methods
b) Document similarity using the PTR method
a) Document similarity using Lee et al.'s method
SigmodRecord-OrdinaryIssuePage
SigmodRecord-OrdinaryIssuePage
Similarity (Ratio)
Similarity (Ratio)
0.4
0.3
0.2
0.1
0
4,1
4,2
4,3
4,4
4,5
Document Set (OrdinaryIssuePage-id, SigmodRecord-id)
0.5
0.4
0.3
0.2
0.1
0
4,1
4,2
4,3
4,4
4,5
Document Set (OrdinaryIssuePage-id,
SigmodRecord-id)
13
c) Document similarity using the PTR&ES method
SigmodRecord-OrdinaryIssuePage
Similarity (Ratio)
0.455
0.45
0.445
0.44
0.435
4,1
4,2
4,3
4,4
4,5
Document Set (OrdinaryIssuePage-id, SigmodRecord-id)
Fig.13 Further results of heterogeneous XML document similarity using different methods
5. Conclusions and Future Work
XML has become increasingly popular and people will have strong needs for a tool to effectively and
automatically retrieve the target XML documents. There exist previous works on extracting paths from XML
documents and finding maximal common paths among extracted XML paths using the sequential pattern mining
approach. In order to determine the similarity between XML documents, a previous algorithm takes the advantages
of the common elements in the paths but ignores the hierarchical information. In this paper, a new pre-processing
step for preparing XML documents for similarity computation using sequential pattern mining is proposed. It takes
use of a preorder tree representation (PTR) to encode the XML tree’s paths. It has the ability to include the
element’s semantic and hierarchical structure of document in the computation of similarity between XML
documents. A novel post-processing step is also proposed to calculate the common unmatched element similarity
obtained from the hierarchical information based mining stage. It estimates the element’s semantic by their sub-tree
or leaf node. The experimental results showed that the PTR method is an attractive alternative to Lee et al.’s method
[3]. It can overcome some of its weaknesses. The combined PTR and element similarity (PTR&ES) method
provides a further improvement in computing the XML documents’ structural similarity and compensates the
shortcoming of the PTR method and Lee et al.’s method. We believe that the proposed methods can provide
valuable help to for the development and evaluation of XML similarity.
In this paper we are focus on the computation of structural similarity using data mining method, it would be
interesting to compare the different in performance and quality with other methods (e.g. tree theory). We plan to
study and investigate these directions in our future work in this area.
References
[1] W3C’s XML home page: http://www.w3.org/XML/
[2] W3C’s SGML home page: http://www.w3.org/MarkUp/ SGML/
[3] J.W. Lee, K. Lee and W. Kim, “Preparations for semantics-based XML mining,” Proceedings of the 2001 IEEE
International Conference on Data Mining, pp.345-352, San Jose, California, December, 2001.
14
[4] Nierman and H.V. Jagadish, “Evaluating structural similarity in XML documents,” Proceedings of the Fifth International
Workshop on the Web and Databases (WEDDB), Madison, Wisconsin, June 2002.
[5] S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, T. Yu, “Approximate XML joins,” Proceedings of the ACM SIGMOD
Conference on Management of Data, Madison, Wisconsin, pp.287-298, June 2002.
[6] H. Galhardas, D. Florescu, D. Shasha, E. Simon, C.A. Saita, “Declarative data cleaning: Language, model, and algorithms,”
Proceedings of 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, pp.371-380, August
2002.
[7] J. Pereira, F. Fabret, H.A. Jacobsen, F. Llirbat, D. Shasha, “WebFilter: A High-throughput XML-based publish and
subscribe system,” Proceedings of 27th International Conference on Very Large Data Bases (VLDB) , Roma, Italy, pp.723724, September 2001
[8] M. Altinel, M. J. Franklin, “Efficient filtering of XML documents for selective dissemination of information,” Proc. of 26th
International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, pp.53-64, Sept. 2000.
[9] C.Y. Chan, P. Felber, M.N. Garofalakis, R. Rastogi, “Efficient filtering of XML Documents with XPath expressions,”
Proceedings of 18th International Conference on Data Engineering (ICDE), San Jose, California, pp.235-244, February 26March 1, 2002.
[10] C.Y. Chan, W. Fan, P. Felber, M. Garofalakis, R. Rastogi, “Tree pattern aggregation for scalable XML data dissemination,”
Proceedings of 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, pp.826-837, August,
2002.
[11] Hartmut Liefke, Dan Suciu, “XMill: An efficient compressor for XML data,” Proc. of the 2000 ACM SIGMOD
International Conference on Management of Data, Dallas, Texas, pp.153-164, May 2000.
[12] C.H. Moh, E.P. Lim and W.K. Ng, “DTD-Miner: A tool for mining DTD from XML documents,” Proceedings of the 2nd
Int. Workshop on Advance Issues of E-Commerce and Web-Based Information Systems, Milpitas, California, pp.144-151,
June, 2000.
[13] S. Nestorov, S. Abiteboul and R. Motwani, “Extracting schema from semi-structured data,” Proceedings of ACM SIGMOD
International Conference on Management of Data, Seattle, Washington, pp.295-306, June 1998.
[14] H. Bunke and K. Shearer, “A graph distance metric based on the maximal common sub-graph,” Pattern Recognition Letters,
vol.19, no.3-4, pp.227-381, 1998.
[15] T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi and H. Ueda, “Discovery of frequent tree structured patterns in semistructured Web documents,” Proceedings of the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD), Hong Kong, China, pp.47-52, April 2001.
[16] C.H. Chang, S.C. Lui and Y.C. Wu, “Applying pattern mining to Web information extraction,” Proceedings of the Fifth
Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hong Kong, China, pp.4-16, April 2001.
[17] R. Sedgewick, An introduction to the analysis of algorithms. Addison-Wesley, 1996.
[18] W3C’s Document Object Model home page: http:// www.w3.org/DOM/
[19] R. Agrawal and R. Srikant, “Mining sequential patterns,” Proceedings of the Eleventh International Conference on Data
Engineering (ICDE), Taipei, Taiwan, pp.3-14, March 1995.
[20] ACM SIGMOD Record home page: http://www.acm.org/ sigmod/record/xml.
[21] IBM’s XML Generator homepage: http://www. alphaworks.ibm.com.
15
© Copyright 2026 Paperzz