INode: An Effective Numbering Scheme for Storing XML Data in Relational Databases Lau Ho Kit and Vincent Ng Department of Computing Hong Kong Polytechnic University Abstract been proposed and the common feature of the languages is the use of regular path expressions to query XML data XML has become the standard for representing and such as XQuery [1], XPath [2], XML_QL [3], Lorel [4] exchanging information on Internet. This poses a and Quilt [5]. challenge to efficiently store and query XML data. Several As most of the existing systems use relational database model-mapping approaches have been proposed to store systems, a large fraction of the XML documents will be XML data in relational database systems. The key features stored in relational DBMS in order to minimize the of model-mapping approach are that fixed database management costs to use XML as data exchange. schemas are used for all XML documents and DTD Therefore, the modeling issue between XML data and information is not needed. In this paper, we present a new relational database has received special attention. model-mapping approach, called INode. This approach is In [6], the authors classified the mapping approach of based on a numbering scheme for elements. It enables XML documents to relational DBMS into 2 classes. They quick retrieval of parent-child and ancestor-descendant are structure-mapping approach and model-mapping relationships between elements in the XML data graph. approach. The database schemas of the former approach Experiments with two data sets and two query sets show are based on the logical structure of the XML documents. that INode can reduce the storage requirement while Mostly, they are based on the DTD (Document Type having a better query performance. Definition) such as X-ray [7]. Since the schemas are based on the logical structure, it is not suitable for 1. Introduction dynamic structure data. The database schemas of the latter approach are fixed for all XML documents. Examples Extensible Markup Language (XML) has become the include Edge [8], XRel [6] and XParent [9]. It is capable standard for exchanging information in Internet. XML to support XML documents whose DTDs are not known documents comprise hierarchically nested collections of in the design phase or without DTDs. Therefore, it is elements and tags describing the semantics of the data. It more flexible and convenient to manage the XML data in provides a flexible way to exchange data between relational DBMS. different platforms. Several XML query languages have In this paper, we propose a new model-mapping approach. The approach uses unique element identifiers (UIDs) as a numbering scheme for node IDs. With this scheme, we can obtain the UIDs of the ancestors and descendents directly from the UID of an element. Each element in the XML document will be assigned an UID as an identifier. Therefore, we can find the relationship between different elements based on their UIDs. The paper is structured as follows. Section 2 briefly discusses a sample XML document. Section 3 reviews three existing model-mapping approaches. Section 4 <SigmodRecord> <issue> <volume>11</volume> <number>1</number> <articles> <article> <title>Annotated Bibliography on Data Design.</title> <initPage>45</initPage> <endPage>77</endPage> <authors> <author position="00">Anthony I. Wasserman</author> <author position="01">Karen Botnich</author> </authors> </article> </articles> </issue> </SigmodRecord> introduces our new model-mapping approach, INode. In Figure 1. A Simplified XML Document of Section 5, we will discuss the query processing issue. In SigmodRecord. Section 6, experimental results are shown. Finally, we element node Root conclude the paper in Section 7. SigmodRecord text node 1 2. Overview of XML document issue attribute node 2 Extensible Markup Language (XML) is a simplified volume number articles subset of SGML that is created by the World Wide Web 7 8 11 1 9 Consortium (W3C) [10]. XML documents comprise article hierarchically nested collections of elements, where each 42 element can be either atomic or composite. Further, tags title stored with elements in an XML document describe the semantics of the data rather than simply specifying how the elements are to be displayed (as in HTML). Figure 1 initPage endPage 207 208 209 Annotated Bibliography on Data Design. 45 77 authors 210 author author shows a simplified XML document of SigmodRecord [15] 1047 and the corresponding data graph in Figure 2. @position 5232 00 1048 @position Anthony I Wasserman 5237 Karen Botnich 01 Figure 2. Data graph of the Simplified XML Document of SigmodRecord. 3. Three Model-Mapping Approaches the ancestor and descendant relationships by joining that table. n0 l1 n1 3.1. Edge l3 l2 n2 n3 The Edge approach [8] stores the XML data graph of l4 n4 ... ... Figure 1 in a single table called Edge. Src Ord Tgt Label Flag ... ln ... . . . . . . ... nn Figure 3. A Sample XML Data Graph. 0 1 2 2 2 5 6 1 1 1 2 3 1 1 1 2 3 4 5 6 7 SigmodRecord Issue Volume number Articles Article Title ref ref val val ref ref val 6 6 6 10 2 3 4 1 8 9 10 11 initPage endPage Authors Author val val ref val 11 10 1 2 12 13 @position author val val 13 1 14 @position val Edge [8], XRel [6] and XParent [9] are three model-mapping approaches that store different structures of XML documents in relational DBMS. Consider an XML data graph as shown in Figure 3, where the nodes in the graph can either be element nodes or attribute nodes, and text nodes are not included. The data graph contains nodes with IDs n0, n1, n2,…, nn where n0 is the root and l1, l2, l3,…,ln as the labels between 11 1 Annotated Bibliograph y on Data Design. 45 77 Anthony I Wasserman 00 Karen Botnich 01 Table 1. Edge Table for the XML Data Graph in Figure 2. a pair of nodes. The Edge approach records the pair of node IDs with the corresponding label in a single table. Value For example, the label between the nodes n0 and n1 is l1 Edge(Source, Ordinal, Target, Label, Flag, Value) Each node in the data graph is assigned to a number. and they are stored in the attributes Source, Target and Each tuple in the table corresponds to an edge in the data Label, respectively. Unlike Edge, XRel stores all the graph. For each edge in the data graph, it stores the source simple path expressions that are represented as a sequence ID, the target ID and the label of the edge. The ordinal of the labels, such as (l1, l4), in a table and keeps the attribute keeps the ordinal of the edge among its siblings. region of each node to preserve the precedence and the The flag attribute indicates whether the attribute refers to relation between ancestor and descendant among nodes. an inter-object reference (ref) or a value (val). Further, it The region is specified by a pair of numbers. They are the uses the inlining approach to store the text in the value start and end positions of a node in an XML document. attribute when the node has a text child. The XParent approach also uses a table to store all the path expressions of an XML document. Instead of using 3.2. XRel the region to maintain the ancestor and descendant relationship, XParent uses a separate table to keep the The XRel approach [6] uses a schema of four tables to parent and child relationships among nodes and retrieve store the XML data graph. The tables are Path, Element, Text and Attribute. PathID 1 2 3 4 5 6 7 8 9 10 11 12 PathExp #/SigmodRecord #/SigmodRecord#/issue #/SigmodRecord#/issue#/volume #/SigmodRecord#/issue#/number #/SigmodRecord#/issue#/articles #/SigmodRecord#/issue#/articles#/article #/SigmodRecord#/issue#/articles#/article#/title #/SigmodRecord#/issue#/articles#/article#/initPag e #/SigmodRecord#/issue#/articles#/article#/endPa ge #/SigmodRecord#/issue#/articles#/article#/author s #/SigmodRecord#/issue#/articles#/article#/author s#/author #/SigmodRecord#/issue#/articles#/article#/author s#/author#/@position (i) Path Table The database attributes DocID, PathID, Start, End and Value represent document identifier, simple path expression identifier, start position of a region, end position of a region and string value, respectively [6]. The region is used to uniquely identify the occurrence of an element node or a text node. The region of a node is identified by the start and end position of this node in the XML document. As given in [12], the region can be computed based on the Absolute Region Coordinate (ARC) and Relative Region Coordinate (RRC). ARC expresses the absolute location of a node in relation to the DocID PathID Start End Value 0 0 12 12 184 235 184 235 00 01 (ii) Attribute Table root node in the XML document. RRC expresses the location of a node in relation to its parent node location rather than that of the root node. The advantage of using DocID PathID Start End Ordinal 0 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 11 0 14 21 40 58 68 77 130 153 174 183 234 332 317 40 58 309 298 130 153 174 288 234 278 1 1 1 2 3 1 1 2 3 4 1 2 RRC is that it can minimize the cost of document updates on positions. The key feature of XRel is that each node in the XML document is represented by the combination of simple path expression and region. 3.3. XParent The XParent approach [9] uses a four-table schema to (iii) Element Table store XML data. They are LabelPath, DataPath, Element DocID PathID Start End Value 0 0 0 3 4 7 29 48 84 31 49 122 0 0 0 8 9 11 140 162 191 142 164 211 0 11 242 255 11 1 Annotated Bibliography on Data Design. 45 77 Anthony I. Wasserman Karen Botnich (iv) Text Table Table 2. The XRel Schema for the XML Data Graph in Figure 2. Path(PathID, PathExp) Element(DocID, PathID, Start, End, Ordinal) Text(DocID, PathID, Start, End, Value) Attribute(DocID, PathID, Start, End, Value) and Data. PathID Len 1 2 3 4 5 6 7 8 1 2 3 3 3 4 5 5 9 5 10 5 11 6 12 7 Path ./SigmodRecord ./SigmodRecord./issue ./SigmodRecord./issue./volume ./SigmodRecord./issue./number ./SigmodRecord./issue./articles ./SigmodRecord./issue./articles./article ./SigmodRecord./issue./articles./article./title ./SigmodRecord./issue./articles./article./InitP age ./SigmodRecord./issue./articles./article./endP age ./SigmodRecord./issue./articles./article./auth ors ./SigmodRecord./issue./articles./article./auth ors./author ./SigmodRecord./issue./articles./article./auth ors./author./@position (i) LabelPath Table 4. INode Pid Cid PathID Ordinal Did 1 2 2 2 5 5 6 6 6 6 10 10 10 10 2 3 4 5 6 15 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 11 12 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 In this section, we introduce a new model-mapping approach called INode. For Edge approach, it is easy to maintain because it uses a singe table schema. Since it only has edges individually, it needs a large number of equijoins to check the edge-connections. When a query needs to retrieve the ancestor-descendent relationship, the query performance becomes bad. (ii) DataPath (iii) Element Table Table Table 3. The XParent Schema for the XML Data This improves the query performance and performs Graph in Figure 2. regular path expressions easily. It also uses the concept of XRel uses a table to store all simple path expressions. region to maintain the ancestor-descendent relationship. LabelPath(PathID, Len, Path) DataPath(Pid, Cid) Element(PathID, Ordinal, Did) Data(PathID, Did, Ordinal, Value) For a node i, it is reachable from another node j if the region of i is included in the region of j. As a result, it can The database attributes PathID, Len, Pid, Cid, Did and Value represent label-path identifier, number of edges of identify the containment relationship by using θ–joins. However, θ–join is more costly than an equijoin. the label path, parent-node id, child-node id, data-path Unlike XRel, XParent uses a table to maintain the identifier and string value, respectively. Since the parent-child relationship instead of using the concept of DataPath table stores the attributes Pid and Cid, it region. Therefore, it can use equijoins to test the maintains the parent-child relationships. It needs table relationship and the performance can be increased. joins ancestor-descendent However, for some complex queries that require checking relationship. To speed up this processing, the authors the ancestor-descendent relationship, XParent requires a proposed large number of equijoins. Although it can use another in order to to use ancestor-descendent check another the table relationship to named keep this Ancestor. However, it needs more disk space to store this table and table, Ancestor, to keep the ancestor-descendent relationship, it requires more storage space. increases the overhead for update operations. The key Therefore, we introduce a new model-mapping feature of XParent is that it uses LabelPath and DataPath approach that has a good query performance for complex tables to maintain the structural information of the XML queries while reducing the storage usage. documents. The DataPath table is also used to keep the Lee et al. [11] proposed an inverted index and parent-child relationship instead of using region as in signature file schemes in order to reduce the storage to XRel. store all index entries. They interpreted a document structure as a k-ary tree where k is the maximum number of child nodes of a node in the structure. Since the document tree made to be be complete, there are some virtual nodes in the document tree that do not exist. Secondly, they assigned UIDs to each node according to the order of the level-order tree traversal. Figure 4 shows 2 ,k 2 a i f ( path ) i 1 2 k 3 n ck 2 a i n ck 2 n ci a k i 1 , k 2 i 1 i0 (1) a document tree and the assignment of UIDs to nodes. For In our numbering scheme, we can embed the document a node whose UID is i, the parent’s UID can be calculated id in the node id in order to save the storage space. The directly by the following function. function to calculate the node id is shown in Equation 2. It i 2 Parent (i ) 1 k has two parts. The first part is the path function and the second part is about the document id. Given a 1 document-id, docid, and m, where m is the number of decimal places for the document id, we have 2 3 4 nodeid f ( path ) 14 5 6 15 16 7 8 9 10 11 12 13 With the node-ids, docid 10m we (2) can retrieve the ancestor-descendent relationship by calculation. Given the real node virtual node Figure 4. 3-ary document tree with UIDs. Our approach uses the concept of UID for assigning the ID to each node in the XML data graph. It represents an XML document as an nc-ary complete tree where nc is the maximum number of child nodes of a node in the node ids of i and j, where i is the ancestor of j, and the depth difference between these two nodes is d, we can use Equation 3 to confirm that relationship. d 1 i nodeid j 2 nc i 1 nodeid i 1 nodeid j nodeid j ncd (3) structure. Given a node n, the path of n will be denoted as Since the path sequence of each node in an XML (a1, a2, a3,…, ak), where k is the depth of n and 1 ai nc. document tree is increased from the top to bottom and Figure 5 shows an example with nc equals to 3. from the left to right, the result of the path function will always increase as the path sequence increased. Therefore, the path function is a monotonic increasing function. As the document-id is embedded in the node-id, the function must be reversible in order to retrieve the document-id. The requirement to let the function reversible is the document-id less than one when it is embedded in the Figure 5. 3-ary Document Tree with Path Sequences. Based on the Parent(i) function, we can calculate the node id with the path of a node as shown in equation 1. node-id. Therefore, we use decimal places to store the document-id. Here are the reversible functions to retrieve the value of path function and document-id. docid nodeid nodeid *10 m f ( path ) nodeid INode uses a three-table schema. They are Path, Element and Attribute. PathID 1 2 3 4 5 6 7 8 9 10 11 12 table and uses to identify the path expression of a node. The database attribute Ordinal in the Element table PathExp #/SigmodRecord #/SigmodRecord#/issue #/SigmodRecord#/issue#/volume #/SigmodRecord#/issue#/number #/SigmodRecord#/issue#/articles #/SigmodRecord#/issue#/articles#/article #/SigmodRecord#/issue#/articles#/article#/title #/SigmodRecord#/issue#/articles#/article#/initPag e #/SigmodRecord#/issue#/articles#/article#/endPa ge #/SigmodRecord#/issue#/articles#/article#/author s #/SigmodRecord#/issue#/articles#/article#/author s#/author #/SigmodRecord#/issue#/articles#/article#/author s#/author#/@position represents the occurrence order of a node among the sibling nodes in document order. Table 4 shows the tables that store the XML data graph with the nc = 5. Like XRel [6], we use ‘#/’ to separate the tag names in order to perform regular path expression with SQL correctly. The key features of INode schema can be summarized as follows: 1. INode is a node-oriented approach. The schema does not maintain the edge information explicitly. (i) Path Table Therefore, it does not need to concatenate the edges NodeID PathID Ordinal Value 7.00 8.00 207.00 3 4 7 1 2 1 11 1 Annotated Bibliography on Data Design. 45 77 Anthony I. Wasserman Karen Botnich 208.00 209.00 1047.00 8 9 11 2 3 1 1048.00 210.00 42.00 9.00 2.00 1.00 11 10 6 5 2 1 2 4 1 3 1 1 NodeID (ii) Element Table PathID Value 5232.00 5237.00 12 12 to form a simple path for query processing. 2. INode uses a table to stores all simple path expressions. This reduces the database size and it is more efficient to perform queries with regular path expressions. 3. Unlike XRel, INode uses the numbering scheme to replace the region concept. The parent-child relationship is embedded in the NodeID. In addition, the document identifier is also embedded in this attribute instead of using an extra column to store. 00 01 (iii) Attribute Table Table 4. The INode Schema for the XML Data Graph in Figure 2. INode 1 Path(PathID, PathExp) Element(NodeID, PathID, Ordinal, Value) Attribute(NodeID, PathID, Value) The Path table stores the path information of the XML document collection. Each path expression is stored in the attribute PathExp and a unique ID is assigned. In Element and Attribute tables, NodeID is the node identifier (node id) that is calculated based on Equation 1. The attribute PathID serves as the foreign key of the ID in the Path This further reduces the database size. 4. INode uses the attribute NodeID to retrieve the parent-child relationship by calculation. It can reduce the number of equijoins or θ–joins. To assess the effectiveness of the embedding the document-id in the node-id (denoted as INode1), we have implemented another schema that uses a new attribute to store the document-id. This schema is denoted as INode2. Besides the Path table, the new schema has changes in the Element table and Attribute table as shown below. INode 2 Path(PathID, PathExp) Element(NodeID, DocID, PathID, Ordinal, Value) Attribute(NodeID, DocID, PathID, Value) 5. Query Processing In [9], the authors had a discussion on the translation between XML queries and SQL statements for the approaches Edge, XRel and XParent. Here, we focus on the translation for INode. Example 1 shows a XML query using the XPath syntax [2]. Example 1. “Select the authors of all articles that the end pages are equal to 77” with the XML data shown in Figure 2. Q1: /SigmodRecord/issue/articles/article[endPage=77]/authors SQL-1 A translated SQL query for the XPath query Q1 using INode 1 select e1.value from path p1, path p2, element e1, element e2 where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors' and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage' and e1.pathid = p1.pathid and e2.pathid = p2.pathid and mod(e1.nodeid, 1) = mod(e2.nodeid, 1) and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1) and e2.value = '77' SQL-1 shows the translated SQL query using the INode with embedding the document id in the node id. and p2.pathexp = '#/SigmodRecord#/issue#/articles#/article#/endPage' and e1.pathid = p1.pathid and e2.pathid = p2.pathid and e1.docid = e2.docid and floor(((e1.nodeid – 2) / 5) + 1) = floor(((e2.nodeid - 2) / 5) + 1) and e2.value = '77' SQL-3 A translated SQL query for the XPath query Q1 using Edge select authors.value from edge sigmodrecord, edge issue, edge articles, edge article, edge authors, edge endpage where sigmodreacord.label = ‘SigmodRecord’ and issue.label = ‘issue’ and articles.label = ‘articles’ and article.label = ‘article’ and authors.label = ‘authors’ and endpage.label = ‘endPage’ and sigmodrecord.source = ‘0’ and sigmodrecord.target = issue.source and issue.target = articles.source and articles.target = article.source and article.target = authors.source and endpages.source = authors.source and endpage.value = ‘77’ SQL-4 A translated SQL query for the XPath query Q1 using XRel select t2.value from path p1, path p2, path p3, element e1, text t1, text t2 where p1.pathexp = ‘#/SigmodRecord#/issue#/articles#/article' and p2.pathexp = ‘#/SigmodRecord#/issue#/articles#/article#/authors' and p3.pathexp = ‘#/SigmodRecord#/issue#/articles#/article#/endPage’ and e1.pathid = p1.pathid and t1.pathid = p2.pathid and t3.pathid = p3.pathid and e1.start < t1.start and e1.end > t1.end and e1.start < t2.start and e1.end > t2.end and t3.value = ‘77’ SQL-1 uses two equijoins and two selections to identify the two path identifiers. Then it uses two equijoins to check the edge connections. In total, four equijoins and three selections are used. SQL-2 shows the translated SQL query using the INode without embedding the document id in the node-id. Similar to the SQL-1, it uses four equijoins to process the query. However, it can check the document condition directly whithout using calculation. SQL-2 A translated SQL query for the XPath query Q1 using INode 2 select e1.value from path p1, path p2, element e1, element e2 where p1.pathexp = '#/SigmodRecord#/issue#/articles#/article#/authors' SQL-5 A translated SQL query for the XPath query Q1 using XParent select d1.value from labelpath lp1, labelpath lp2, datapath dp1, datapath dp2, data d1, data d2 where lp1.path = ‘./SigmodRecord./issue./articles./article./authors' and lp2.path = ‘./SigmodRecord./issue./articles./article./endPage’ and d1.pathid = lp1.pathid and d2.pathid = lp2.pathid and d1.did = dp1.cid and d2.did = dp2.cid and dp1.pid = dp2.pid and d2.value = ‘77’ As mention in [9], the number of equijoins and selections of Edge approach are determined by the number of edge connections. For the above sample XPath query, totally five equijoins and eight selections are used for Edge as shown in SQL-3. SQL-4 shows the translated 14 SQL query for XRel, three equijoins, four θ–joins and SQL-5, XParent uses five equijoins and three selections. Therefore, INode uses less table joins to process the query Storage Usage (MB) four selections are used to process the sample query. In 12 10 and this can improve the performance. 8 6 4 2 0 Edge XRel XParent INode 1 INode 2 6. Experiment Results Figure 6. Storage Usage with Edge, XRel, XParent and INode. To evaluate the performance of INode, experimental studies have been conducted. We studied Edge, XRel, INode uses less storage than the other three approaches. XParent and two versions of the proposed approaches. All For Edge, each tuple has an attribute to store the experiments were conducted on a 733MHz Pentium III corresponding label that increases the storage usage. with 256M RAM, 20G hard disk. The RDBMS used was Unlike Edge, INode reduces the storage by storing the Oracle 8i Enterprise Edition. We conducted our path expressions in a table. For XRel and XParent, they experiments using the data set of the Bosak Shakespeare use the region and a table to maintain the parent-child and collection [13] and the XML benchmark project [14]. ancestor-descendent relationships respectively. This also Detailed information about the former data set is shown increases the storage usage. INode reduces the storage below: usage by using the node id to maintain the parent-child Number of documents Total Size Total number of element nodes Total number of attribute nodes Total number of text nodes Total number of unique paths The largest number of child elements of an element in the structure Maximum depth of documents 37 7.5MB 179,689 0 147,442 57 434 6 Table 5: Data Set Details of the Bosak Shakespeare Collection and ancestor-descendent relationships. For the second data set, we generated the data with different scale factors. Three different sizes of data are used. Table 6 shows the details of the second data set. The largest Parameters Size Maximum Name number Used (MB) depth of child elements As shown in the Table 5, the largest number of child elements of a node in the structure is 434, we choose 435 as the maximum number of child nodes. We use 2 decimal Set 1 Set 2 places to store the document-id since the total number of XML documents is only 37. Considering the storage Set 3 scale factor=0.05, split=10 scale factor=0.1, split=10 scale factor=0.2, split=10 5.61 12 50 11.3 12 100 22.8 12 200 usage of the tables in relational DBMS using 5 different Table 6. Data Set Details of the XML Benchmark approaches for the first data set. Figure 6 shows the detail. Project. Data Set INode 1 / XParent INode 2 / XParent Set 1 Set 2 Set 3 0.819 0.832 0.835 0.844 0.856 0.860 indexes were created on Element(nodeid, ((nodeid - 2) / nc) + 1) and Attribute(nodeid, ((nodeid - 2) / nc) + 1). Two B+-tree indexes Table 7. Storage Usage Ratio between INode and were XParent. created on Element(Value) and Attribute(Value). 40 INode 1 Storage Usage (MB) 35 6.1. Query Performances INode 2 30 XParent 25 20 We conducted our experiments using the Bosak 15 10 Shakespeare collection and the same set of queries in [6] 5 and [9]. The set of queries are listed below. 0 Set 1 Set 2 Set 3 Data Set Figure 7: Storage Usage of The XML Benchmark Project Data Sets. Table 7 shows that INode reduces about 15% of the storage usage comparing with XParent. Figure 7 shows the storage usage of the three data sets by using XParent QS1: /PLAY/ACT QS2: /PLAY/ACT/SCENE/SPEECH/LINE/STAGEDIR QS3: //SCENE/TITLE QS4: //ACT//TITLE QS5: /PLAY/ACT[2] QS6: (/PLAY/ACT)[2]/TITLE QS7: /PLAY/ACT/SCENE/SPEECH[SPEAKER = 'CURIO'] QS8: /PLAY/ACT/SCENE[//SPEAKER = 'Steward']/TITLE and two versions of INode. By observation, the increasing an extra table to stores the parent-child relationship and the number of tuples of this table increases when the number of nodes increases. For INode, the parent-child relationship is embedded in the attribute NodeID. This reduces the storage usage and hence reduces the Time (second) rate of INode is lower than XParent because XParent uses 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Edge XRel XParent INode 1 INode 2 QS1 increasing rate. Indexes on the relational DBMS were built to improve query processing as follows: QS3 QS4 QS5 Queries QS6 QS7 QS8 Figure 8. Query Elapsed Time: using The Bosak Shakespeare Collection. For Edge, we created indexes as proposed in [8] The query elapsed times are shown in Figure 8. Every respectively. query is run ten times for each approach and the average For XParent, three B+-tree indexes ware created elapsed time is taken. For the queries with short and on simple paths, such as QS1, QS3, QS5 and QS6, Edge DataPath(Pid), DataPath(Cid) and Data(Value). QS2 performs similarly to INode because the number of table + For XRel, a B -tree index was created on joins is not too large to form the simple path. For QS2 and Text(Value). QS7, Edge needs a lot of table joins to connecting the In proposed approach, two function-based edges. For the queries with regular path queries, such as QS4 and SQ8, Edge needs to traverse nearly the whole 9 8 7 6 XML data tree. INode stored simple path expressions to connection. Therefore, INode outperforms Edge Ratio limit the search space and no table join is need for edge significantly for these queries. 5 4 3 2 Q1 Q2 Q4 Q8 Q9 Q15 Q17 Q19 1 XRel performs similarly to INode for the queries 0 Set1 QS1-QS5, which are one-path queries. For QS6-QS8, INode outperforms XRel because INode uses calculation θ –joins. Comparing INode with XParent, we found that they have similar performance except QS7. The reason is that INode does not need table joins to retrieve the parent-child relationship. 10 9 8 7 6 5 4 3 2 1 0 Q1 Q2 Q4 Q8 Q9 Q15 Q17 Q19 Set1 For the two versions of INode, they have similar performance for all queries, the main difference is that 6.2. Scalability Test: INode vs XParent In this section, we investigate the scalability of the two versions of INode in comparison with XParent using the Set2 Set3 (b) Query Elapsed Time Ratio for INode 2 10 9 8 7 6 5 4 3 2 1 0 Ratio INode2 needs more space to store the document id. Set3 (a) Query Elapsed Time Ratio for INode 1 Ratio to retrieve the parent-child relationship instead of using Set2 Q1 Q2 Q4 Q8 Q9 Q15 Q17 Q19 three data sets generated from the XML benchmark Set1 Set2 Set3 project as listed in Table 6. We conducted the test using (c) Query Elapsed Time Ratio for XParent eight queries. Details of the eight selected queries are Figure 9. Scalability Test with INode and XParent. listed below. Query 1 Return the name of the person with ID `person0'. Query 2 Return the initial increases of all open auctions. Query 4 List the reserves of those open auctions where a certain person issued a bid before another person. Query 8 List the names of persons and the number of items they bought. Query 9 List the names of persons and the names of the items they bought in Europe. Query 15 Print the keywords in emphasis in annotations of closed auctions. Query 17 Which persons don't have a homepage? Query 19 Give an alphabetically ordered list of all items along with their location. Figure 9 shows the elapsed time ratios for two versions of INode and XParent respectively. The query elapsed time ratio is defined as t2/t1 where t1 is the elapsed time of a query using set1 and t2 is the elapsed time of the same query using set2 or set3. The finding is that the scalability of INode, in terms of data sizes, is superior to XParent for the queries with more retrieval of ancestor-descendant relationships, such as Q8 and Q9. The main factor is the XParent needs more table joins to retrieve the relationship. 7. Conclusions [7] In this paper, we proposed a new model-mapping approach INode, which is based on a numbering scheme. Instead of table joins, it can retrieve the parent-child and [8] ancestor-descendant relationships by calculation. In the first experimental study, we studied the storage usage of [9] INode using two data sets in comparison with Edge, XRel and XParent. Although INode use less storage space, it [10] outperforms Edge, XRel and XParent in most cases in the query performance study. As our future work, we will [11] investigate the indexing scheme based on INode to further improve the query performance. [12] 8. Acknowledgement The work of the authors are supported in part by the [13] Central Grant of The Hong Kong Polytechnics University, [14] research project code HZJ89. 9. References [1] [2] [3] [4] [5] [6] Chamberlin, D., Florescu, D., and et al, J. R. (2001). XQuery: A query language for XML. In W3C Working Draft, http://www.w3.org/TR/xquery Clark, J. and DeRose, S. (1999). XML path language (XPath). In W3C Recommendation 16 November 1999, http://www.w3.org/TR/xpath Deutsch, A., Fernandex, M., and Florescu, D. (1999). A query language for XML. In Proceedings of the 8th International World Wide Web Conference. Abiteboul, S., Quass, D., McHugh, J., Widom, J., and Wiener, J. L. (1997). The lorel query language for semistructured data. International Journal on Digital Libraries, 1(1), pages 68-88. Chamberlin, D. D., Robie, J., and Florescu, D. (2000). Quilt: An XML query language for heterogeneous data sources. In WebDB (Informal Proceedings), pages 53-62. YoshiKawa, M. and Amagasa, T. (2001). XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases. ACM Transactions on Internet Technology, 1(1), pages [15] 110-141. Kappel G., Kapsammer E., Rausch-Schott S., Retschitzegger W. (2000). X-Ray Towards Integrating XML and Relational Database Systems, International Conference on Conceptual Modeling (ER), pages 339-353. Florescu and Kossmann (1999), A Performance Evaluation of Alternative Mapping Schemes for Storing XML Data in a Relational Database, Technical report. H. Jiang, H. Lu, W. Wang and J. X. Yu (2002), Path Materialization Revisited: An Efficient Storage Model for XML Data, Thirteenth Australasian Database Conference (ADC2002). World Wide Web Consortium (2000). Extensible Markup Language (XML) 1.0 (Second Edition). http://www.w3.org/TR/2000/REC-xml-20001006 Lee, Y.K. Yoo, S.J. Yoon, K. Berra, P.B. (1996). Index Structures for Structured Documents”, Proc. Digital Library ’96 (1996) pp. 91-99. Kha, D. D., Yoshikawa, M., and Uemura, S. (2001). An XML indexing structure with relative region coordinate. In Proceedings of the 17th IEEE International Conference on Data Engineering. IEEE Computer Society Press, Los Alamitos, CA, pages: 313-320. The Bosak Shakespeare collection, http://metalab.unc.edu/bosak/xml/eg/shaks200.zip Schmidt, A. R., Waas, F., Kersten, M. L., Florescu, D., Manolescu, I., Carey, M. J., and Busse, R. (2001). The XML benchmark project. Technical report, CWI, Amsterdam, The Netherlands. SigmodRecord archive, http://www.acm.org/sigmod/record/
© Copyright 2024 Paperzz