INode 1

INode: An Effective Numbering Scheme for Storing
XML Data in Relational Databases
Lau Ho Kit and Vincent Ng
Department of Computing
Hong Kong Polytechnic University
Abstract
been proposed and the common feature of the languages
is the use of regular path expressions to query XML data
XML has become the standard for representing and
such as XQuery [1], XPath [2], XML_QL [3], Lorel [4]
exchanging information on Internet. This poses a
and Quilt [5].
challenge to efficiently store and query XML data. Several
As most of the existing systems use relational database
model-mapping approaches have been proposed to store
systems, a large fraction of the XML documents will be
XML data in relational database systems. The key features
stored in relational DBMS in order to minimize the
of model-mapping approach are that fixed database
management costs to use XML as data exchange.
schemas are used for all XML documents and DTD
Therefore, the modeling issue between XML data and
information is not needed. In this paper, we present a new
relational database has received special attention.
model-mapping approach, called INode. This approach is
In [6], the authors classified the mapping approach of
based on a numbering scheme for elements. It enables
XML documents to relational DBMS into 2 classes. They
quick retrieval of parent-child and ancestor-descendant
are structure-mapping approach and model-mapping
relationships between elements in the XML data graph.
approach. The database schemas of the former approach
Experiments with two data sets and two query sets show
are based on the logical structure of the XML documents.
that INode can reduce the storage requirement while
Mostly, they are based on the DTD (Document Type
having a better query performance.
Definition) such as X-ray [7]. Since the schemas are
based on the logical structure, it is not suitable for
1. Introduction
dynamic structure data. The database schemas of the latter
approach are fixed for all XML documents. Examples
Extensible Markup Language (XML) has become the
include Edge [8], XRel [6] and XParent [9]. It is capable
standard for exchanging information in Internet. XML
to support XML documents whose DTDs are not known
documents comprise hierarchically nested collections of
in the design phase or without DTDs. Therefore, it is
elements and tags describing the semantics of the data. It
more flexible and convenient to manage the XML data in
provides a flexible way to exchange data between
relational DBMS.
different platforms. Several XML query languages have
In this paper, we propose a new model-mapping
approach. The approach uses
unique element identifiers
(UIDs) as a numbering scheme for node IDs. With this
scheme, we can obtain the UIDs of the ancestors and
descendents directly from the UID of an element. Each
element in the XML document will be assigned an UID as
an identifier. Therefore, we can find the relationship
between different elements based on their UIDs.
The paper is structured as follows. Section 2 briefly
discusses a sample XML document. Section 3 reviews
three existing model-mapping approaches. Section 4
<SigmodRecord>
<issue>
<volume>11</volume>
<number>1</number>
<articles>
<article>
<title>Annotated Bibliography on Data
Design.</title>
<initPage>45</initPage>
<endPage>77</endPage>
<authors>
<author position="00">Anthony I.
Wasserman</author>
<author position="01">Karen
Botnich</author>
</authors>
</article>
</articles>
</issue>
</SigmodRecord>
introduces our new model-mapping approach, INode. In
Figure 1. A Simplified XML Document of
Section 5, we will discuss the query processing issue. In
SigmodRecord.
Section 6, experimental results are shown. Finally, we
element node
Root
conclude the paper in Section 7.
SigmodRecord
text node
1
2. Overview of XML document
issue
attribute node
2
Extensible Markup Language (XML) is a simplified
volume
number
articles
subset of SGML that is created by the World Wide Web
7
8
11
1
9
Consortium (W3C) [10]. XML documents comprise
article
hierarchically nested collections of elements, where each
42
element can be either atomic or composite. Further, tags
title
stored with elements in an XML document describe the
semantics of the data rather than simply specifying how
the elements are to be displayed (as in HTML). Figure 1
initPage
endPage
207
208
209
Annotated Bibliography
on Data Design.
45
77
authors
210
author
author
shows a simplified XML document of SigmodRecord [15]
1047
and the corresponding data graph in Figure 2.
@position
5232
00
1048
@position
Anthony I
Wasserman
5237
Karen
Botnich
01
Figure 2. Data graph of the Simplified XML
Document of SigmodRecord.
3. Three Model-Mapping Approaches
the ancestor and descendant relationships by joining that
table.
n0
l1
n1
3.1. Edge
l3
l2
n2
n3
The Edge approach [8] stores the XML data graph of
l4
n4
...
...
Figure 1 in a single table called Edge.
Src Ord Tgt
Label
Flag
...
ln
...
.
.
.
.
.
.
...
nn
Figure 3. A Sample XML Data Graph.
0
1
2
2
2
5
6
1
1
1
2
3
1
1
1
2
3
4
5
6
7
SigmodRecord
Issue
Volume
number
Articles
Article
Title
ref
ref
val
val
ref
ref
val
6
6
6
10
2
3
4
1
8
9
10
11
initPage
endPage
Authors
Author
val
val
ref
val
11
10
1
2
12
13
@position
author
val
val
13
1
14
@position
val
Edge [8], XRel [6] and XParent [9] are three
model-mapping approaches that store different structures
of XML documents in relational DBMS. Consider an
XML data graph as shown in Figure 3, where the nodes in
the graph can either be element nodes or attribute nodes,
and text nodes are not included.
The data graph contains nodes with IDs n0, n1, n2,…, nn
where n0 is the root and l1, l2, l3,…,ln as the labels between
11
1
Annotated
Bibliograph
y on Data
Design.
45
77
Anthony I
Wasserman
00
Karen
Botnich
01
Table 1. Edge Table for the XML Data Graph in
Figure 2.
a pair of nodes. The Edge approach records the pair of
node IDs with the corresponding label in a single table.
Value
For example, the label between the nodes n0 and n1 is l1
Edge(Source, Ordinal, Target, Label, Flag, Value)
Each node in the data graph is assigned to a number.
and they are stored in the attributes Source, Target and
Each tuple in the table corresponds to an edge in the data
Label, respectively. Unlike Edge, XRel stores all the
graph. For each edge in the data graph, it stores the source
simple path expressions that are represented as a sequence
ID, the target ID and the label of the edge. The ordinal
of the labels, such as (l1, l4), in a table and keeps the
attribute keeps the ordinal of the edge among its siblings.
region of each node to preserve the precedence and the
The flag attribute indicates whether the attribute refers to
relation between ancestor and descendant among nodes.
an inter-object reference (ref) or a value (val). Further, it
The region is specified by a pair of numbers. They are the
uses the inlining approach to store the text in the value
start and end positions of a node in an XML document.
attribute when the node has a text child.
The XParent approach also uses a table to store all the
path expressions of an XML document. Instead of using
3.2. XRel
the region to maintain the ancestor and descendant
relationship, XParent uses a separate table to keep the
The XRel approach [6] uses a schema of four tables to
parent and child relationships among nodes and retrieve
store the XML data graph. The tables are Path, Element,
Text and Attribute.
PathID
1
2
3
4
5
6
7
8
9
10
11
12
PathExp
#/SigmodRecord
#/SigmodRecord#/issue
#/SigmodRecord#/issue#/volume
#/SigmodRecord#/issue#/number
#/SigmodRecord#/issue#/articles
#/SigmodRecord#/issue#/articles#/article
#/SigmodRecord#/issue#/articles#/article#/title
#/SigmodRecord#/issue#/articles#/article#/initPag
e
#/SigmodRecord#/issue#/articles#/article#/endPa
ge
#/SigmodRecord#/issue#/articles#/article#/author
s
#/SigmodRecord#/issue#/articles#/article#/author
s#/author
#/SigmodRecord#/issue#/articles#/article#/author
s#/author#/@position
(i) Path Table
The database attributes DocID, PathID, Start, End and
Value
represent
document
identifier,
simple
path
expression identifier, start position of a region, end
position of a region and string value, respectively [6]. The
region is used to uniquely identify the occurrence of an
element node or a text node. The region of a node is
identified by the start and end position of this node in the
XML document. As given in [12], the region can be
computed based on the Absolute Region Coordinate
(ARC) and Relative Region Coordinate (RRC). ARC
expresses the absolute location of a node in relation to the
DocID
PathID
Start
End
Value
0
0
12
12
184
235
184
235
00
01
(ii) Attribute Table
root node in the XML document. RRC expresses the
location of a node in relation to its parent node location
rather than that of the root node. The advantage of using
DocID
PathID
Start
End
Ordinal
0
0
0
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
8
9
10
11
11
0
14
21
40
58
68
77
130
153
174
183
234
332
317
40
58
309
298
130
153
174
288
234
278
1
1
1
2
3
1
1
2
3
4
1
2
RRC is that it can minimize the cost of document updates
on positions. The key feature of XRel is that each node in
the XML document is represented by the combination of
simple path expression and region.
3.3. XParent
The XParent approach [9] uses a four-table schema to
(iii) Element Table
store XML data. They are LabelPath, DataPath, Element
DocID
PathID
Start
End
Value
0
0
0
3
4
7
29
48
84
31
49
122
0
0
0
8
9
11
140
162
191
142
164
211
0
11
242
255
11
1
Annotated
Bibliography on
Data Design.
45
77
Anthony I.
Wasserman
Karen Botnich
(iv) Text Table
Table 2. The XRel Schema for the XML Data
Graph in Figure 2.
Path(PathID, PathExp)
Element(DocID, PathID, Start, End, Ordinal)
Text(DocID, PathID, Start, End, Value)
Attribute(DocID, PathID, Start, End, Value)
and Data.
PathID Len
1
2
3
4
5
6
7
8
1
2
3
3
3
4
5
5
9
5
10
5
11
6
12
7
Path
./SigmodRecord
./SigmodRecord./issue
./SigmodRecord./issue./volume
./SigmodRecord./issue./number
./SigmodRecord./issue./articles
./SigmodRecord./issue./articles./article
./SigmodRecord./issue./articles./article./title
./SigmodRecord./issue./articles./article./InitP
age
./SigmodRecord./issue./articles./article./endP
age
./SigmodRecord./issue./articles./article./auth
ors
./SigmodRecord./issue./articles./article./auth
ors./author
./SigmodRecord./issue./articles./article./auth
ors./author./@position
(i) LabelPath Table
4. INode
Pid
Cid
PathID
Ordinal
Did
1
2
2
2
5
5
6
6
6
6
10
10
10
10
2
3
4
5
6
15
7
8
9
10
11
12
13
14
1
2
3
4
5
6
7
8
9
10
11
12
11
12
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
In this section, we introduce a new model-mapping
approach called INode. For Edge approach, it is easy to
maintain because it uses a singe table schema. Since it
only has edges individually, it needs a large number of
equijoins to check the edge-connections. When a query
needs to retrieve the ancestor-descendent relationship, the
query performance becomes bad.
(ii) DataPath
(iii) Element Table
Table
Table 3. The XParent Schema for the XML Data
This improves the query performance and performs
Graph in Figure 2.
regular path expressions easily. It also uses the concept of
XRel uses a table to store all simple path expressions.
region to maintain the ancestor-descendent relationship.
LabelPath(PathID, Len, Path)
DataPath(Pid, Cid)
Element(PathID, Ordinal, Did)
Data(PathID, Did, Ordinal, Value)
For a node i, it is reachable from another node j if the
region of i is included in the region of j. As a result, it can
The database attributes PathID, Len, Pid, Cid, Did and
Value represent label-path identifier, number of edges of
identify the containment relationship by using θ–joins.
However, θ–join is more costly than an equijoin.
the label path, parent-node id, child-node id, data-path
Unlike XRel, XParent uses a table to maintain the
identifier and string value, respectively. Since the
parent-child relationship instead of using the concept of
DataPath table stores the attributes Pid and Cid, it
region. Therefore, it can use equijoins to test the
maintains the parent-child relationships. It needs table
relationship and the performance can be increased.
joins
ancestor-descendent
However, for some complex queries that require checking
relationship. To speed up this processing, the authors
the ancestor-descendent relationship, XParent requires a
proposed
large number of equijoins. Although it can use another
in
order
to
to
use
ancestor-descendent
check
another
the
table
relationship
to
named
keep
this
Ancestor.
However, it needs more disk space to store this table and
table,
Ancestor,
to
keep
the
ancestor-descendent
relationship, it requires more storage space.
increases the overhead for update operations. The key
Therefore, we introduce a new model-mapping
feature of XParent is that it uses LabelPath and DataPath
approach that has a good query performance for complex
tables to maintain the structural information of the XML
queries while reducing the storage usage.
documents. The DataPath table is also used to keep the
Lee et al. [11] proposed an inverted index and
parent-child relationship instead of using region as in
signature file schemes in order to reduce the storage to
XRel.
store all index entries. They interpreted a document
structure as a k-ary tree where k is the maximum number
of child nodes of a node in the structure. Since the
document tree made to be be complete, there are some
virtual nodes in the document tree that do not exist.
Secondly, they assigned UIDs to each node according to
the order of the level-order tree traversal. Figure 4 shows
 2
,k  2
  a i
f ( path )   i 1 2
k 3
 n ck  2  a i  n ck  2   n ci a k  i  1 , k  2

i 1
i0
(1)
a document tree and the assignment of UIDs to nodes. For
In our numbering scheme, we can embed the document
a node whose UID is i, the parent’s UID can be calculated
id in the node id in order to save the storage space. The
directly by the following function.
function to calculate the node id is shown in Equation 2. It
 i  2 
Parent (i )  
 1
 k

has two parts. The first part is the path function and the
second part is about the document id. Given a
1
document-id, docid, and m, where m is the number of
decimal places for the document id, we have
2
3
4
nodeid  f ( path ) 
14
5
6
15
16
7
8
9
10
11
12
13
With
the
node-ids,
docid
10m
we
(2)
can
retrieve
the
ancestor-descendent relationship by calculation. Given the
real node
virtual node
Figure 4. 3-ary document tree with UIDs.
Our approach uses the concept of UID for assigning
the ID to each node in the XML data graph. It represents
an XML document as an nc-ary complete tree where nc is
the maximum number of child nodes of a node in the
node ids of i and j, where i is the ancestor of j, and the
depth difference between these two nodes is d, we can use
Equation 3 to confirm that relationship.
d 1


i
 nodeid j  2   nc 
i 1
nodeid i  
 1  nodeid j  nodeid j 
ncd




(3)
structure. Given a node n, the path of n will be denoted as
Since the path sequence of each node in an XML
(a1, a2, a3,…, ak), where k is the depth of n and 1  ai  nc.
document tree is increased from the top to bottom and
Figure 5 shows an example with nc equals to 3.
from the left to right, the result of the path function will
always increase as the path sequence increased. Therefore,
the path function is a monotonic increasing function. As
the document-id is embedded in the node-id, the function
must be reversible in order to retrieve the document-id.
The requirement to let the function reversible is the
document-id less than one when it is embedded in the
Figure 5. 3-ary Document Tree with Path
Sequences.
Based on the Parent(i) function, we can calculate the
node id with the path of a node as shown in equation 1.
node-id. Therefore, we use decimal places to store the
document-id. Here are the reversible functions to retrieve
the value of path function and document-id.
docid  nodeid  nodeid *10 m
f ( path )  nodeid 
INode uses a three-table schema. They are Path,
Element and Attribute.
PathID
1
2
3
4
5
6
7
8
9
10
11
12
table and uses to identify the path expression of a node.
The database attribute Ordinal in the Element table
PathExp
#/SigmodRecord
#/SigmodRecord#/issue
#/SigmodRecord#/issue#/volume
#/SigmodRecord#/issue#/number
#/SigmodRecord#/issue#/articles
#/SigmodRecord#/issue#/articles#/article
#/SigmodRecord#/issue#/articles#/article#/title
#/SigmodRecord#/issue#/articles#/article#/initPag
e
#/SigmodRecord#/issue#/articles#/article#/endPa
ge
#/SigmodRecord#/issue#/articles#/article#/author
s
#/SigmodRecord#/issue#/articles#/article#/author
s#/author
#/SigmodRecord#/issue#/articles#/article#/author
s#/author#/@position
represents the occurrence order of a node among the
sibling nodes in document order.
Table 4 shows the tables that store the XML data graph
with the nc = 5. Like XRel [6], we use ‘#/’ to separate the
tag names in order to perform regular path expression
with SQL correctly.
The key features of INode schema can be summarized
as follows:
1.
INode is a node-oriented approach. The schema
does not maintain the edge information explicitly.
(i) Path Table
Therefore, it does not need to concatenate the edges
NodeID
PathID
Ordinal
Value
7.00
8.00
207.00
3
4
7
1
2
1
11
1
Annotated
Bibliography on
Data Design.
45
77
Anthony
I.
Wasserman
Karen Botnich
208.00
209.00
1047.00
8
9
11
2
3
1
1048.00
210.00
42.00
9.00
2.00
1.00
11
10
6
5
2
1
2
4
1
3
1
1
NodeID
(ii) Element Table
PathID
Value
5232.00
5237.00
12
12
to form a simple path for query processing.
2.
INode uses a table to stores all simple path
expressions. This reduces the database size and it is
more efficient to perform queries with regular path
expressions.
3.
Unlike XRel, INode uses the numbering scheme to
replace the region concept. The parent-child
relationship is embedded in the NodeID. In addition,
the document identifier is also embedded in this
attribute instead of using an extra column to store.
00
01
(iii) Attribute Table
Table 4. The INode Schema for the XML Data
Graph in Figure 2.
INode 1
Path(PathID, PathExp)
Element(NodeID, PathID, Ordinal, Value)
Attribute(NodeID, PathID, Value)
The Path table stores the path information of the XML
document collection. Each path expression is stored in the
attribute PathExp and a unique ID is assigned. In Element
and Attribute tables, NodeID is the node identifier (node
id) that is calculated based on Equation 1. The attribute
PathID serves as the foreign key of the ID in the Path
This further reduces the database size.
4.
INode uses the attribute NodeID to retrieve the
parent-child relationship by calculation. It can
reduce the number of equijoins or θ–joins.
To assess the effectiveness of the embedding the
document-id in the node-id (denoted as INode1), we have
implemented another schema that uses a new attribute to
store the document-id. This schema is denoted as INode2.
Besides the Path table, the new schema has changes in the
Element table and Attribute table as shown below.
INode 2
Path(PathID, PathExp)
Element(NodeID, DocID, PathID, Ordinal, Value)
Attribute(NodeID, DocID, PathID, Value)
5. Query Processing
In [9], the authors had a discussion on the translation
between XML queries and SQL statements for the
approaches Edge, XRel and XParent. Here, we focus on
the translation for INode. Example 1 shows a XML query
using the XPath syntax [2].
Example 1. “Select the authors of all articles that the end
pages are equal to 77” with the XML data shown in
Figure 2.
Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors
SQL-1 A translated SQL query for the XPath query Q1
using INode 1
select e1.value
from path p1, path p2, element e1, element e2
where p1.pathexp =
'#/SigmodRecord#/issue#/articles#/article#/authors'
and p2.pathexp =
'#/SigmodRecord#/issue#/articles#/article#/endPage'
and e1.pathid = p1.pathid
and e2.pathid = p2.pathid
and mod(e1.nodeid, 1) = mod(e2.nodeid, 1)
and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2)
/ 5) + 1)
and e2.value = '77'
SQL-1 shows the translated SQL query using the
INode with embedding the document id in the node id.
and p2.pathexp =
'#/SigmodRecord#/issue#/articles#/article#/endPage'
and e1.pathid = p1.pathid
and e2.pathid = p2.pathid
and e1.docid = e2.docid
and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2)
/ 5) + 1)
and e2.value = '77'
SQL-3 A translated SQL query for the XPath query Q1
using Edge
select authors.value
from edge sigmodrecord, edge issue, edge articles, edge article,
edge authors, edge endpage
where sigmodreacord.label = ‘SigmodRecord’
and issue.label = ‘issue’
and articles.label = ‘articles’
and article.label = ‘article’
and authors.label = ‘authors’
and endpage.label = ‘endPage’
and sigmodrecord.source = ‘0’
and sigmodrecord.target = issue.source
and issue.target = articles.source
and articles.target = article.source
and article.target = authors.source
and endpages.source = authors.source
and endpage.value = ‘77’
SQL-4 A translated SQL query for the XPath query Q1
using XRel
select t2.value
from path p1, path p2, path p3, element e1, text t1, text t2
where p1.pathexp = ‘#/SigmodRecord#/issue#/articles#/article'
and p2.pathexp =
‘#/SigmodRecord#/issue#/articles#/article#/authors'
and p3.pathexp =
‘#/SigmodRecord#/issue#/articles#/article#/endPage’
and e1.pathid = p1.pathid
and t1.pathid = p2.pathid
and t3.pathid = p3.pathid
and e1.start < t1.start
and e1.end > t1.end
and e1.start < t2.start
and e1.end > t2.end
and t3.value = ‘77’
SQL-1 uses two equijoins and two selections to identify
the two path identifiers. Then it uses two equijoins to
check the edge connections. In total, four equijoins and
three selections are used.
SQL-2 shows the translated SQL query using the
INode without embedding the document id in the node-id.
Similar to the SQL-1, it uses four equijoins to process the
query. However, it can check the document condition
directly whithout using calculation.
SQL-2 A translated SQL query for the XPath query Q1
using INode 2
select e1.value
from path p1, path p2, element e1, element e2
where p1.pathexp =
'#/SigmodRecord#/issue#/articles#/article#/authors'
SQL-5 A translated SQL query for the XPath query Q1
using XParent
select d1.value
from labelpath lp1, labelpath lp2, datapath dp1, datapath dp2,
data d1, data d2
where lp1.path =
‘./SigmodRecord./issue./articles./article./authors'
and lp2.path =
‘./SigmodRecord./issue./articles./article./endPage’
and d1.pathid = lp1.pathid
and d2.pathid = lp2.pathid
and d1.did = dp1.cid
and d2.did = dp2.cid
and dp1.pid = dp2.pid
and d2.value = ‘77’
As mention in [9], the number of equijoins and
selections of Edge approach are determined by the
number of edge connections. For the above sample XPath
query, totally five equijoins and eight selections are used
for Edge as shown in SQL-3. SQL-4 shows the translated
14
SQL query for XRel, three equijoins, four θ–joins and
SQL-5, XParent uses five equijoins and three selections.
Therefore, INode uses less table joins to process the query
Storage Usage (MB)
four selections are used to process the sample query. In
12
10
and this can improve the performance.
8
6
4
2
0
Edge
XRel
XParent
INode 1
INode 2
6. Experiment Results
Figure 6. Storage Usage with Edge, XRel, XParent
and INode.
To evaluate the performance of INode, experimental
studies have been conducted. We studied Edge, XRel,
INode uses less storage than the other three approaches.
XParent and two versions of the proposed approaches. All
For Edge, each tuple has an attribute to store the
experiments were conducted on a 733MHz Pentium III
corresponding label that increases the storage usage.
with 256M RAM, 20G hard disk. The RDBMS used was
Unlike Edge, INode reduces the storage by storing the
Oracle 8i Enterprise Edition. We conducted our
path expressions in a table. For XRel and XParent, they
experiments using the data set of the Bosak Shakespeare
use the region and a table to maintain the parent-child and
collection [13] and the XML benchmark project [14].
ancestor-descendent relationships respectively. This also
Detailed information about the former data set is shown
increases the storage usage. INode reduces the storage
below:
usage by using the node id to maintain the parent-child
Number of documents
Total Size
Total number of element nodes
Total number of attribute nodes
Total number of text nodes
Total number of unique paths
The largest number of child elements of an
element in the structure
Maximum depth of documents
37
7.5MB
179,689
0
147,442
57
434
6
Table 5: Data Set Details of the Bosak
Shakespeare Collection
and ancestor-descendent relationships.
For the second data set, we generated the data with
different scale factors. Three different sizes of data are
used. Table 6 shows the details of the second data set.
The
largest
Parameters Size
Maximum
Name
number
Used
(MB)
depth
of child
elements
As shown in the Table 5, the largest number of child
elements of a node in the structure is 434, we choose 435
as the maximum number of child nodes. We use 2 decimal
Set 1
Set 2
places to store the document-id since the total number of
XML documents is only 37. Considering the storage
Set 3
scale
factor=0.05,
split=10
scale
factor=0.1,
split=10
scale
factor=0.2,
split=10
5.61
12
50
11.3
12
100
22.8
12
200
usage of the tables in relational DBMS using 5 different
Table 6. Data Set Details of the XML Benchmark
approaches for the first data set. Figure 6 shows the detail.
Project.
Data Set
INode 1 / XParent
INode 2 / XParent
Set 1
Set 2
Set 3
0.819
0.832
0.835
0.844
0.856
0.860
indexes were created on Element(nodeid,
((nodeid - 2) / nc) + 1) and Attribute(nodeid,
((nodeid - 2) / nc) + 1). Two B+-tree indexes
Table 7. Storage Usage Ratio between INode and
were
XParent.
created
on
Element(Value)
and
Attribute(Value).
40
INode 1
Storage Usage (MB)
35
6.1. Query Performances
INode 2
30
XParent
25
20
We conducted our experiments using the Bosak
15
10
Shakespeare collection and the same set of queries in [6]
5
and [9]. The set of queries are listed below.
0
Set 1
Set 2
Set 3
Data Set
Figure 7: Storage Usage of The XML Benchmark
Project Data Sets.
Table 7 shows that INode reduces about 15% of the
storage usage comparing with XParent. Figure 7 shows
the storage usage of the three data sets by using XParent







QS1: /PLAY/ACT
QS2: /PLAY/ACT/SCENE/SPEECH/LINE/STAGEDIR
QS3: //SCENE/TITLE
QS4: //ACT//TITLE
QS5: /PLAY/ACT[2]
QS6: (/PLAY/ACT)[2]/TITLE
QS7: /PLAY/ACT/SCENE/SPEECH[SPEAKER =
'CURIO']
 QS8: /PLAY/ACT/SCENE[//SPEAKER =
'Steward']/TITLE
and two versions of INode. By observation, the increasing
an extra table to stores the parent-child relationship and
the number of tuples of this table increases when the
number of nodes increases. For INode, the parent-child
relationship is embedded in the attribute NodeID. This
reduces the storage usage and hence reduces the
Time (second)
rate of INode is lower than XParent because XParent uses
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Edge
XRel
XParent
INode 1
INode 2
QS1
increasing rate.
Indexes on the relational DBMS were built to improve
query processing as follows:



QS3
QS4
QS5
Queries
QS6
QS7
QS8
Figure 8. Query Elapsed Time: using The Bosak
Shakespeare Collection.
For Edge, we created indexes as proposed in [8]
The query elapsed times are shown in Figure 8. Every
respectively.
query is run ten times for each approach and the average
For XParent, three B+-tree indexes ware created
elapsed time is taken. For the queries with short and
on
simple paths, such as QS1, QS3, QS5 and QS6, Edge
DataPath(Pid),
DataPath(Cid)
and
Data(Value).

QS2
performs similarly to INode because the number of table
+
For XRel, a B -tree index was created on
joins is not too large to form the simple path. For QS2 and
Text(Value).
QS7, Edge needs a lot of table joins to connecting the
In proposed approach, two
function-based
edges. For the queries with regular path queries, such as
QS4 and SQ8, Edge needs to traverse nearly the whole
9
8
7
6
XML data tree. INode stored simple path expressions to
connection.
Therefore,
INode
outperforms
Edge
Ratio
limit the search space and no table join is need for edge
significantly for these queries.
5
4
3
2
Q1
Q2
Q4
Q8
Q9
Q15
Q17
Q19
1
XRel performs similarly to INode for the queries
0
Set1
QS1-QS5, which are one-path queries. For QS6-QS8,
INode outperforms XRel because INode uses calculation
θ –joins.
Comparing INode with XParent, we found that they
have similar performance except QS7. The reason is that
INode does not need table joins to retrieve the
parent-child relationship.
10
9
8
7
6
5
4
3
2
1
0
Q1
Q2
Q4
Q8
Q9
Q15
Q17
Q19
Set1
For the two versions of INode, they have similar
performance for all queries, the main difference is that
6.2. Scalability Test: INode vs XParent
In this section, we investigate the scalability of the two
versions of INode in comparison with XParent using the
Set2
Set3
(b) Query Elapsed Time Ratio for INode 2
10
9
8
7
6
5
4
3
2
1
0
Ratio
INode2 needs more space to store the document id.
Set3
(a) Query Elapsed Time Ratio for INode 1
Ratio
to retrieve the parent-child relationship instead of using
Set2
Q1
Q2
Q4
Q8
Q9
Q15
Q17
Q19
three data sets generated from the XML benchmark
Set1
Set2
Set3
project as listed in Table 6. We conducted the test using
(c) Query Elapsed Time Ratio for XParent
eight queries. Details of the eight selected queries are
Figure 9. Scalability Test with INode and XParent.
listed below.
 Query 1 Return the name of the person with ID `person0'.
 Query 2 Return the initial increases of all open auctions.
 Query 4 List the reserves of those open auctions where a
certain person issued a bid before another person.
 Query 8 List the names of persons and the number of items
they bought.
 Query 9 List the names of persons and the names of the
items they bought in Europe.
 Query 15 Print the keywords in emphasis in annotations of
closed auctions.
 Query 17 Which persons don't have a homepage?
 Query 19 Give an alphabetically ordered list of all items
along with their location.
Figure 9 shows the elapsed time ratios for two versions
of INode and XParent respectively. The query elapsed
time ratio is defined as t2/t1 where t1 is the elapsed time of
a query using set1 and t2 is the elapsed time of the same
query using set2 or set3. The finding is that the scalability
of INode, in terms of data sizes, is superior to XParent for
the queries with more retrieval of ancestor-descendant
relationships, such as Q8 and Q9. The main factor is the
XParent needs more table joins to retrieve the
relationship.
7. Conclusions
[7]
In this paper, we proposed a new model-mapping
approach INode, which is based on a numbering scheme.
Instead of table joins, it can retrieve the parent-child and
[8]
ancestor-descendant relationships by calculation. In the
first experimental study, we studied the storage usage of
[9]
INode using two data sets in comparison with Edge, XRel
and XParent. Although INode use less storage space, it
[10]
outperforms Edge, XRel and XParent in most cases in the
query performance study. As our future work, we will
[11]
investigate the indexing scheme based on INode to further
improve the query performance.
[12]
8. Acknowledgement
The work of the authors are supported in part by the
[13]
Central Grant of The Hong Kong Polytechnics University,
[14]
research project code HZJ89.
9. References
[1]
[2]
[3]
[4]
[5]
[6]
Chamberlin, D., Florescu, D., and et al, J. R. (2001).
XQuery: A query language for XML. In W3C Working
Draft, http://www.w3.org/TR/xquery
Clark, J. and DeRose, S. (1999). XML path language
(XPath). In W3C Recommendation 16 November 1999,
http://www.w3.org/TR/xpath
Deutsch, A., Fernandex, M., and Florescu, D. (1999). A
query language for XML. In Proceedings of the 8th
International World Wide Web Conference.
Abiteboul, S., Quass, D., McHugh, J., Widom, J., and
Wiener, J. L. (1997). The lorel query language for
semistructured data. International Journal on Digital
Libraries, 1(1), pages 68-88.
Chamberlin, D. D., Robie, J., and Florescu, D. (2000).
Quilt: An XML query language for heterogeneous data
sources. In WebDB (Informal Proceedings), pages 53-62.
YoshiKawa, M. and Amagasa, T. (2001). XRel: A
Path-Based Approach to Storage and Retrieval of XML
Documents Using Relational Databases. ACM
Transactions on Internet Technology, 1(1), pages
[15]
110-141.
Kappel G., Kapsammer E., Rausch-Schott S.,
Retschitzegger W. (2000). X-Ray Towards Integrating
XML and Relational Database Systems, International
Conference on Conceptual Modeling (ER), pages
339-353.
Florescu and Kossmann (1999), A Performance
Evaluation of Alternative Mapping Schemes for Storing
XML Data in a Relational Database, Technical report.
H. Jiang, H. Lu, W. Wang and J. X. Yu (2002), Path
Materialization Revisited: An Efficient Storage Model
for XML Data, Thirteenth Australasian Database
Conference (ADC2002).
World Wide Web Consortium (2000). Extensible Markup
Language
(XML)
1.0
(Second
Edition).
http://www.w3.org/TR/2000/REC-xml-20001006
Lee, Y.K. Yoo, S.J. Yoon, K. Berra, P.B. (1996). Index
Structures for Structured Documents”, Proc. Digital
Library ’96 (1996) pp. 91-99.
Kha, D. D., Yoshikawa, M., and Uemura, S. (2001). An
XML indexing structure with relative region coordinate.
In Proceedings of the 17th IEEE International
Conference on Data Engineering. IEEE Computer
Society Press, Los Alamitos, CA, pages: 313-320.
The Bosak Shakespeare collection,
http://metalab.unc.edu/bosak/xml/eg/shaks200.zip
Schmidt, A. R., Waas, F., Kersten, M. L., Florescu, D.,
Manolescu, I., Carey, M. J., and Busse, R. (2001). The
XML benchmark project. Technical report, CWI,
Amsterdam, The Netherlands.
SigmodRecord archive,
http://www.acm.org/sigmod/record/