International Journal of Research In Science & Engineering Volume: 2 Issue: 3 e-ISSN: 2394-8299 p-ISSN: 2394-8280 ISSUES IN DATABASE DESIGN THROUGH XML Akanksha Goel, Abha Jain,Mily Lal 1 Assistant Professor,Computer department,DYPIEMR, [email protected] 2 Assistant Professor,Computer department,DYPIEMR, [email protected] 3 Assistant Professor,Computer department,DYPIEMR, [email protected] ABSTRACT Recent years have seen a surge in the popularity of XML, a markup language for representing semistructured data. Some of this popularity can be attributed to the success that the semi-structured data model has had in environments where the relational data model has been insufficiently expressive. It is thus natural to consider native XML databases, which are designed from the ground up to support XML data. Developing a native XML database introduces many challenges, some of which we consider here. The first major problem is that XML data is ordered, whereas relational databases operate on set-based data. We examine the ordering problem in great detail in this thesis, and show that while it imposes an unavoidable performance penalty on the database, this penalty can be reduced to an acceptable level in practice. We do this by making use of type information, which is often present in XML data, and by improving existing results in the literature. XML data is frequently queried using XPath, a de facto standard query language. It is widely believed that the one of the most promising approaches to evaluating XPath queries is through the use of structural joins. Previous work in the literature has found theoretically optimal algorithms for such joins. We provide practical improvements to these algorithms, which result in significant speed-ups. Keywords: XML,X-Path, X-Query, Structural joints ----------------------------------------------------------------------------------------------------------------------------1. Introduction XML [12] has achieved widespread popularity as a data representation language in a remarkably short period of time, due to its ability to model heterogenous data in a homogenous fashion. Along with this rising popularity comes an increasing need to store and query large XML repositories efficiently. This has proved to be an extremely challenging task which has provoked several years of active database research, with XML current a major research area within the field of databases. While XML has many peculiarities and subtle issues arising from the design-by-committee approach of the W3C, at heart it is an extremely simple data markup language. Ignoring the many uninteresting subtleties present in the specification, we can describe XML as a markup language for the representation of ordered, unranked, finite, nodelabeled trees of data. Figure 1.1(a) gives a sample XML document, with the corresponding logical tree given in Figure 1.1 (b). This small XML document represents a portion of a database of books, and highlights several of the reasons XML has proved popular as a data representation language: XML is semi-structured: as can be seen from the figure, there is no fixed tuple structure for a record in the database of books. For instance, in the first record, there is a date element, which does not occur in the second record. In a relational database, the standard approach to this variability is the use of optional attributes, implemented through the use of null values. The presence of null values makes query optimization and formulation considerably more difficult. Similarly, while both records have a genre element, and both records refer to science fiction books, the actual text of the elements are different ("Science Fiction" versus "Sci-Fi"). In a traditional relational database, this would complicate the storage and querying of this attribute, because it would most likely have to be stored as a string, instead of, for instance, an enumeration. This heterogeneity also complicates the process of querying the database in a consistent fashion. IJRISE| www.ijrise.org|[email protected] International Journal of Research In Science & Engineering Volume: 2 Issue: 3 e-ISSN: 2394-8299 p-ISSN: 2394-8280 XML is ordered: in the record for Garden of Rama, two authors are listed. In practice, it is frequently the case that the order of the authors is significant, with the first author generally being the principal author. Semi-structured data [1], on the other hand, is unordered, and hence ordering constraints would have to be manually specified by the user, e.g., through the use of a numeric attribute specifying the relative order. Similarly, traditional relational databases operate on unordered sets of tuples, which is a considerably easier task.. XML is hierarchical: the figure demonstrates a simple two-level XML document which could be mapped into a relational table with little difficulty. However, XML has no restrictions on depth, and trees with three or more levels arise frequently in practice. For example, instead of a simple flat structure as given in the figure, the data could be given grouped by publisher or genre. While this could be mapped into a group of relational tables, it is frequently the case in practice that decomposing + <author>Frank Herbert</author> <title>Dune</title> <genre>Science Fiction</genre> <date>January 2003</date> </book > <book> <author>Arthur C. Clarke</author> <author>Gentry Lee</author> <title>Garden of Rama</title> <genre>Sci-Fi</genre> </book> </books> books ~ ·t'~'I" ~'t~T Frank Herbert I\ January 2003 Arthur C. Clarke ti*e Dune Science Fiction ger aU10r I\ Sci-Fi ti*e Gentry Lee Garden of Rama (b) The corresponding tree (a) An XML document Figure 1.1: A sample XML document 2.Experimental Results We performed two sets of experiments, each measuring different aspects of the algorithms presented above. In our first set of experimental results, we evaluated the quality of the average case for the randomized algorithm above. In the second set, we evaluated the worst case of all the algorithms. All experiments were run on a dual processor 750 MHz Pentium III machine with 512 MB RAM and a 30 GB, 10,000 rpm SCSI hard drive. Experiment Set 1: Evaluation of Randomized Algorithm We performed our first set of experiments using the DBLP database. We tested both Bender algorithms (the O(logn) and 0(1) variants) and the randomized algorithm of Section 2.4. For each algorithm, we inserted 100, 1000, and 10,000 DBLP records into a new database. The insertions were done in two stages. The first half of the insertions were appended to the end of the database, and hence simulated a bulk load. The second half of the insertions were done at random locations in the database; that is, if we consider the document as a linked list in document order, the insertions happened at random locations throughout the list - this stage simulated further updates upon a pre-initialized database. While the inserts were distributed over the database, at the physical level the database records were still inserted at the end of the database file. This resulted in a database which was not clustered in document order, which meant that traversing through the database in document order possibly incurs many disk accesses. We hypothesize that while many document-centric XML databases will be clustered in document order, data-centric XML databases will not be, as they will most likely be clustered through the use of indices such as B-trees on the values of particular elements. Hence, our tests were structured to simulate these kinds of environments, in which the document ordering problem is more difficult. At the end of each set of insertions, there were n records in the database, where n E {100, 1000, 10,000}. IJRISE| www.ijrise.org|[email protected] International Journal of Research In Science & Engineering Volume: 2 Issue: 3 e-ISSN: 2394-8299 p-ISSN: 2394-8280 We then additionally performed IOn and lOOn reads upon the database. Each read operation chose two random nodes from the database and compared their document order. The nodes were not chosen uniformly, as this does not accurately reflect real-world database access patterns Experiment Set 2: Evaluation of all Algorithms Experimental Setup Our experiments were designed to investigate the worst case behavior of each of the document maintenance algorithms implemented. We chose to focus on the worst case for several reasons. Firstly, our previous work found an algorithm which has very good average case performance (for a particular definition of average), but very poor worst case performance. As our new work has good worst case performance, we emphasize this in our experiments IJRISE| www.ijrise.org|[email protected] International Journal of Research In Science & Engineering Volume: 2 Issue: 3 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Performance on a Bulk Insert In this experiment, we evaluated the performance of the algorithms under a uniform query distribution. The experiment began with an empty database, which was then gradually initialized with the DBLP database. After every insertion, on average r reads were performed, where r was a fixed parameter taken from the set {O.OI, 0.10,1.00, 10.0}. Each read operation picked two nodes at random from the underlying database, using a uniform probability distribution, and compared their document order. In all of our experiments, we measured the total time of the combined read and write operations, the number of read and write operations, and the number of relabelings. However, due to space considerations, and the fact that the other results were fairly predictable, we only include the graphs for total time. We note that, as the ratio of reads increases, the performance of both Bender's 0(1) algorithm and the schema based algorithm degrades relative to the other algorithms. We attribute this in both cases to the extra indirection involved in reading from the index. Also, because of the extremely heavy paging, even the small paging overhead incurred by an algorithm such as the schema based algorithm, which only infrequently loads in an additional page in due to a read from the index, has a massive effect on the performance. Thus, although this experiment is slightly contrived, it does demonstrate that in some circumstances the indirection involved becomes unacceptable, given that values of r in real life will often be 100 or 1000. We note that one advantage of the schema based algorithm is that we can remove the indirection if necessary, whereas with the 0(1) algorithm, we cannot. Performance on a N on- Uniform Query Distribution This experiment was identical to the first experiment, except that the reads were sampled from a normal distribution with mean I~I,and variance I~I,and we took r E {O.OI, 0.10, 1.00, 100.0}. The idea was to reduce the heavy paging of the first experiment, and instead simulate a database "hot-spot", a phenomenom which occurs in practice. As can be seen from the results of Figure 2.11, this experiment took substantially less time to complete than the first experiment. It can be seen that, apart from the randomized algorithm (which again performed the minimal possible work), the schema based algorithm is clearly the best algorithm. Indeed, it is impressive that it came so close to the randomized algorithm in performance. IJRISE| www.ijrise.org|[email protected] International Journal of Research In Science & Engineering Volume: 2 Issue: 3 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Experiment 3: Worst-Case Performance for the Randomized Algorithm The previous two experiments showed that the randomized algorithm had very good performance. We demonstrate in this experiment that, in some cases, it has very bad performance, far worse than the other algorithms. In spite of this, we believe the randomized algorithm is worth considering, as in many situations it is almost unbeatable. This experiment was identical to the first experiment, except that experiment 2 total time. Experiment 4: Verifying the Theory Our final experiment was performed to verify the theory that,the expected runtime cost of the randomized variant. For this experiment, we simply loaded DBLP into a database, maintaining ordering information using Algorithm 2.5, for values of c from the set {1, 5,10,50,100,200, 1000}. The total running time of each of these algorithms, as a proportion of the c = 1 running time, is found in Figure 2.13. The line of best fit was determined using a standard logarithmic regression. As can be seen, the expected logarithmic dependence on c is exhibited very closely. 3. CONCLUSION The contributions of this paper is three-fold. Firstly, we have presented a simple randomized algorithm which performs well in many cases, but has poor worst case performance. Secondly, we have presented an improvement to the work of Bender et al, which yields a parameterized family of algorithms giving fine control over the cost of updates to the document ordering index versus querying this index. We believe this work is of particular relevance in XML databases, and anticipate a wide range of application domains which will utilize such databases in the future, with each domain having different requirements. Our final contribution is a general scheme to utilize type information to improve the speed of document ordering indices. This work is especially significant due to its wide applicability, as it is not tied to any particular ordering algorithm. We have found that in practice many large XML repositories have a DTD or some other schema for constraint purposes, and hence we expect that this work will have great practical impact. Applications Ancestor-Descendant Relationships our method can be applied to efficiently determine ancestor-descendant relationships. The key insight is due to Dietz [36], who noted that the ancestor query problem can be answered using the following fact: for two given nodes x IJRISE| www.ijrise.org|[email protected] International Journal of Research In Science & Engineering Volume: 2 Issue: 3 e-ISSN: 2394-8299 p-ISSN: 2394-8280 and y of a tree T, x is an ancestor of y if and only if x occurs before y in the preorder traversal of T and after y in the postorder traversal of T. We note that while we have framed our discussion in terms of document order (that is, preorder traversal), our results could be equally well applied to the postorder traversal as well. Therefore, by maintaining two indices, one for preorder traversal, and one for postorder traversal, which allow ordering queries to be executed quickly, we can determine ancestor-descendant relationships efficiently. Query Optimization A future application of our work is in the area of XML query optimization. As it is now possible to efficiently sort sets of nodes from an XML database into document order, it is also possible to use storage mechanisms in the database which cluster the data along common access paths. For instance, it may be possible to use B-trees to allow fast access to numerical data, as we can now sort the result back into document order quickly. In order to do take advantage of this, query optimizers will have to be augmented with an accurate cost estimation function for the ordering operator. This will allow the optimizer to choose optimal locations in the query plan to sort the intermediate results into document order. REFERENCES [1] Serge Abiteboul. Querying semi-structured data. In Proceedings of ICDT, volume 1186 of Lecture Notes in Computer Science, pages 1-18, Delphi, Greece, 8-10 January 1997. Springer. [2] Serge Abiteboul, Haim Kaplan, and Tova Milo. Compact labeling schemes for ancestor queries. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547-556. Society for Industrial and Applied Mathematics, 2001. [3] Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. Estimating the selectivity of XML path expressions for internet scale applications. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB '01), pages 591-600, Orlando, September 2001. Morgan Kaufmann. [4] Ashraf Aboulnaga and Surajit Chaudhuri. Self-tuning histograms: building histograms without looking at data. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 181-192. ACM Press, 1999. [5] Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 275-286. ACM Press, 1999. [6] Rakesh Agrawal, Tomasz Imieliski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207216. ACM Press, 1993. [7] Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, and Jignesh M. Patel. Structural Joins: A Primitive for Efficient XML Query Pattern Matching. In ICDE. IEEE Computer Society, 2002. [8] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20-29. ACM Press, 1996. [9] Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. QRS: A Robust Numbering Scheme for XML Documents. In Proceedings of IEEE ICDE, March 2003. [10] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a data stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1-10, 2002. [11] Michael A. Bender, Richard Cole, Erik D. Demaine, Martin Farach-Colton, and Jack Zito. Two simplified algorithms for maintaining order in a list. In Proceedings of the 10th Annual European Symposium on Algorithms (ESA 2002), volume 2461 of Lecture Notes in Computer Science, pages 152-164, Rome, Italy, September 17-21 2002. IJRISE| www.ijrise.org|[email protected]
© Copyright 2026 Paperzz