issues in database design through xml

International Journal of Research In Science & Engineering
Volume: 2 Issue: 3
e-ISSN: 2394-8299
p-ISSN: 2394-8280
ISSUES IN DATABASE DESIGN THROUGH XML
Akanksha Goel, Abha Jain,Mily Lal
1
Assistant Professor,Computer department,DYPIEMR, [email protected]
2
Assistant Professor,Computer department,DYPIEMR, [email protected]
3
Assistant Professor,Computer department,DYPIEMR, [email protected]
ABSTRACT
Recent years have seen a surge in the popularity of XML, a markup language for representing semistructured
data. Some of this popularity can be attributed to the success that the semi-structured data model has had in
environments where the relational data model has been insufficiently expressive. It is thus natural to consider
native XML databases, which are designed from the ground up to support XML data.
Developing a native XML database introduces many challenges, some of which we consider here.
The first major problem is that XML data is ordered, whereas relational databases operate on set-based data. We
examine the ordering problem in great detail in this thesis, and show that while it imposes an unavoidable
performance penalty on the database, this penalty can be reduced to an acceptable level in practice. We do this by
making use of type information, which is often present in XML data, and by improving existing results in the
literature.
XML data is frequently queried using XPath, a de facto standard query language. It is widely believed that the
one of the most promising approaches to evaluating XPath queries is through the use of structural joins.
Previous work in the literature has found theoretically optimal algorithms for such joins. We provide practical
improvements to these algorithms, which result in significant speed-ups.
Keywords: XML,X-Path, X-Query, Structural joints
----------------------------------------------------------------------------------------------------------------------------1. Introduction
XML [12] has achieved widespread popularity as a data representation language in a remarkably short period of
time, due to its ability to model heterogenous data in a homogenous fashion. Along with this rising popularity comes
an increasing need to store and query large XML repositories efficiently. This has proved to be an extremely
challenging task which has provoked several years of active database research, with XML current a major research
area within the field of databases.
While XML has many peculiarities and subtle issues arising from the design-by-committee approach of the W3C, at
heart it is an extremely simple data markup language. Ignoring the many uninteresting subtleties present in the
specification, we can describe XML as a markup language for the representation of ordered, unranked, finite, nodelabeled trees of data. Figure 1.1(a) gives a sample XML document, with the corresponding logical tree given in
Figure 1.1 (b). This small XML document represents a portion of a database of books, and highlights several of the
reasons XML has proved popular as a data representation language:
XML is semi-structured: as can be seen from the figure, there is no fixed tuple structure for a record in the
database of books. For instance, in the first record, there is a date element, which does not occur in the second
record. In a relational database, the standard approach to this variability is the use of optional attributes,
implemented through the use of null values. The presence of null values makes query optimization and formulation
considerably more difficult.
Similarly, while both records have a genre element, and both records refer to science fiction books, the actual text of
the elements are different ("Science Fiction" versus "Sci-Fi"). In a traditional relational database, this would
complicate the storage and querying of this attribute, because it would most likely have to be stored as a string,
instead of, for instance, an enumeration. This heterogeneity also complicates the process of querying the database in
a consistent fashion.
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
Volume: 2 Issue: 3
e-ISSN: 2394-8299
p-ISSN: 2394-8280
XML is ordered: in the record for Garden of Rama, two authors are listed. In practice, it is frequently the case that
the order of the authors is significant, with the first author generally being the principal author. Semi-structured data
[1], on the other hand, is unordered, and hence ordering constraints would have to be manually specified by the user,
e.g., through the use of a numeric attribute specifying the relative order. Similarly, traditional relational databases
operate on unordered sets of tuples, which is a considerably easier task..
XML is hierarchical: the figure demonstrates a simple two-level XML document which could be mapped into a
relational table with little difficulty. However, XML has no restrictions on depth, and trees with three or more levels
arise frequently in practice. For example, instead of a simple flat structure as given in the figure, the data could be
given grouped by publisher or genre. While this could be mapped into a group of relational tables, it is frequently
the case in practice that decomposing
+
<author>Frank Herbert</author>
<title>Dune</title>
<genre>Science Fiction</genre>
<date>January 2003</date>
</book
>
<book>
<author>Arthur C. Clarke</author>
<author>Gentry Lee</author>
<title>Garden of Rama</title>
<genre>Sci-Fi</genre>
</book> </books>
books
~
·t'~'I" ~'t~T
Frank Herbert
I\
January 2003 Arthur C. Clarke

ti*e

Dune Science Fiction
ger
aU10r
I\
Sci-Fi
ti*e
Gentry Lee Garden of
Rama
(b)
The corresponding tree
 (a) An XML document
Figure 1.1: A sample XML document
2.Experimental Results
We performed two sets of experiments, each measuring different aspects of the algorithms presented above. In our
first set of experimental results, we evaluated the quality of the average case for the randomized algorithm above. In
the second set, we evaluated the worst case of all the algorithms. All experiments were run on a dual processor 750
MHz Pentium III machine with 512 MB RAM and a 30 GB, 10,000 rpm SCSI hard drive.
Experiment Set 1: Evaluation of Randomized Algorithm
We performed our first set of experiments using the DBLP database. We tested both Bender algorithms (the O(logn)
and 0(1) variants) and the randomized algorithm of Section 2.4.
For each algorithm, we inserted 100, 1000, and 10,000 DBLP records into a new database. The insertions were done
in two stages. The first half of the insertions were appended to the end of the database, and hence simulated a bulk
load. The second half of the insertions were done at random locations in the database; that is, if we consider the
document as a linked list in document order, the insertions happened at random locations throughout the list - this
stage simulated further updates upon a pre-initialized database. While the inserts were distributed over the database,
at the physical level the database records were still inserted at the end of the database file. This resulted in a database
which was not clustered in document order, which meant that traversing through the database in document order
possibly incurs many disk accesses. We hypothesize that while many document-centric XML databases will be
clustered in document order, data-centric XML databases will not be, as they will most likely be clustered through
the use of indices such as B-trees on the values of particular elements. Hence, our tests were structured to simulate
these kinds of environments, in which the document ordering problem is more difficult.
At the end of each set of insertions, there were n records in the database, where n E {100, 1000, 10,000}.
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
Volume: 2 Issue: 3
e-ISSN: 2394-8299
p-ISSN: 2394-8280
We then additionally performed IOn and lOOn reads upon the database. Each read operation chose two random
nodes from the database and compared their document order. The nodes were not chosen uniformly, as this does not
accurately reflect real-world database access patterns
Experiment Set 2: Evaluation of all Algorithms
Experimental Setup
Our experiments were designed to investigate the worst case behavior of each of the document maintenance
algorithms implemented. We chose to focus on the worst case for several reasons. Firstly, our previous work found
an algorithm which has very good average case performance (for a particular definition of average), but very poor
worst case performance. As our new work has good worst case performance, we emphasize this in our experiments
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
Volume: 2 Issue: 3
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Performance on a Bulk Insert
In this experiment, we evaluated the performance of the algorithms under a uniform query distribution. The
experiment began with an empty database, which was then gradually initialized with the DBLP database. After
every insertion, on average r reads were performed, where r was a fixed parameter taken from the set {O.OI,
0.10,1.00, 10.0}. Each read operation picked two nodes at random from the underlying database, using a uniform
probability distribution, and compared their document order. In all of our experiments, we measured the total time
of the combined read and write operations, the number of read and write operations, and the number of relabelings.
However, due to space considerations, and the fact that the other results were fairly predictable, we only include the
graphs for total time.
We note that, as the ratio of reads increases, the performance of both Bender's 0(1) algorithm and the schema based
algorithm degrades relative to the other algorithms. We attribute this in both cases to the extra indirection involved
in reading from the index. Also, because of the extremely heavy paging, even the small paging overhead incurred by
an algorithm such as the schema based algorithm, which only infrequently loads in an additional page in due to a
read from the index, has a massive effect on the performance. Thus, although this experiment is slightly contrived, it
does demonstrate that in some circumstances the indirection involved becomes unacceptable, given that values of r
in real life will often be 100 or 1000. We note that one advantage of the schema based algorithm is that we can
remove the indirection if necessary, whereas with the 0(1) algorithm, we cannot.
Performance on a N on- Uniform Query Distribution
This experiment was identical to the first experiment, except that the reads were sampled from a normal distribution
with mean I~I,and variance I~I,and we took r E {O.OI, 0.10, 1.00, 100.0}. The idea was to reduce the heavy paging
of the first experiment, and instead simulate a database "hot-spot", a phenomenom which occurs in practice.
As can be seen from the results of Figure 2.11, this experiment took substantially less time to complete than the first
experiment. It can be seen that, apart from the randomized algorithm (which again performed the minimal possible
work), the schema based algorithm is clearly the best algorithm. Indeed, it is impressive that it came so close to the
randomized algorithm in performance.
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
Volume: 2 Issue: 3
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Experiment 3: Worst-Case Performance for the Randomized Algorithm
The previous two experiments showed that the randomized algorithm had very good performance. We demonstrate
in this experiment that, in some cases, it has very bad performance, far worse than the other algorithms. In spite of
this, we believe the randomized algorithm is worth considering, as in many situations it is almost unbeatable. This
experiment was identical to the first experiment, except that experiment 2 total time.
Experiment 4: Verifying the Theory
Our final experiment was performed to verify the theory that,the expected runtime cost of the randomized variant.
For this experiment, we simply loaded DBLP into a database, maintaining ordering information using Algorithm
2.5, for values of c from the set {1, 5,10,50,100,200, 1000}. The total running time of each of these algorithms, as a
proportion of the c = 1 running time, is found in Figure 2.13. The line of best fit was determined using a standard
logarithmic regression. As can be seen, the expected logarithmic dependence on c is exhibited very closely.
3. CONCLUSION
The contributions of this paper is three-fold. Firstly, we have presented a simple randomized algorithm which
performs well in many cases, but has poor worst case performance. Secondly, we have presented an improvement to
the work of Bender et al, which yields a parameterized family of algorithms giving fine control over the cost of
updates to the document ordering index versus querying this index. We believe this work is of particular relevance
in XML databases, and anticipate a wide range of application domains which will utilize such databases in the
future, with each domain having different requirements.
Our final contribution is a general scheme to utilize type information to improve the speed of document ordering
indices. This work is especially significant due to its wide applicability, as it is not tied to any particular ordering
algorithm. We have found that in practice many large XML repositories have a DTD or some other schema for
constraint purposes, and hence we expect that this work will have great practical impact.
Applications
Ancestor-Descendant Relationships
our method can be applied to efficiently determine ancestor-descendant relationships. The key insight is due to Dietz
[36], who noted that the ancestor query problem can be answered using the following fact: for two given nodes x
IJRISE| www.ijrise.org|[email protected]
International Journal of Research In Science & Engineering
Volume: 2 Issue: 3
e-ISSN: 2394-8299
p-ISSN: 2394-8280
and y of a tree T, x is an ancestor of y if and only if x occurs before y in the preorder traversal of T and after y in the
postorder traversal of T.
We note that while we have framed our discussion in terms of document order (that is, preorder traversal), our
results could be equally well applied to the postorder traversal as well. Therefore, by maintaining two indices, one
for preorder traversal, and one for postorder traversal, which allow ordering queries to be executed quickly, we can
determine ancestor-descendant relationships efficiently.
Query Optimization
A future application of our work is in the area of XML query optimization. As it is now possible to efficiently sort
sets of nodes from an XML database into document order, it is also possible to use storage mechanisms in the
database which cluster the data along common access paths. For instance, it may be possible to use B-trees to allow
fast access to numerical data, as we can now sort the result back into document order quickly. In order to do take
advantage of this, query optimizers will have to be augmented with an accurate cost estimation function for the
ordering operator. This will allow the optimizer to choose optimal locations in the query plan to sort the intermediate
results into document order.
REFERENCES
[1] Serge Abiteboul. Querying semi-structured data. In Proceedings of ICDT, volume 1186 of Lecture Notes in
Computer Science, pages 1-18, Delphi, Greece, 8-10 January 1997. Springer.
[2] Serge Abiteboul, Haim Kaplan, and Tova Milo. Compact labeling schemes for ancestor queries. In Proceedings
of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 547-556. Society for Industrial and
Applied Mathematics, 2001.
[3] Ashraf Aboulnaga, Alaa R. Alameldeen, and Jeffrey F. Naughton. Estimating the selectivity of XML path
expressions for internet scale applications. In Proceedings of the 27th International Conference on Very Large Data
Bases (VLDB '01), pages 591-600, Orlando, September 2001. Morgan Kaufmann.
[4] Ashraf Aboulnaga and Surajit Chaudhuri. Self-tuning histograms: building histograms without looking at data.
In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 181-192. ACM
Press, 1999.
[5] Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. Join synopses for
approximate query answering. In Proceedings of the 1999 ACM SIGMOD international conference on Management
of data, pages 275-286. ACM Press, 1999.
[6] Rakesh Agrawal, Tomasz Imieliski, and Arun Swami. Mining association rules between sets of items in large
databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pages 207216. ACM Press, 1993.
[7] Shurug Al-Khalifa, H. V. Jagadish, Nick Koudas, and Jignesh M. Patel. Structural Joins: A Primitive for
Efficient XML Query Pattern Matching. In ICDE. IEEE Computer Society, 2002.
[8] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments.
In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages 20-29. ACM Press,
1996.
[9] Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. QRS: A Robust Numbering Scheme for
XML Documents. In Proceedings of IEEE ICDE, March 2003.
[10] Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. Counting distinct elements in a
data stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques in Computer
Science, pages 1-10, 2002.
[11] Michael A. Bender, Richard Cole, Erik D. Demaine, Martin Farach-Colton, and Jack Zito. Two simplified
algorithms for maintaining order in a list. In Proceedings of the 10th Annual European Symposium on Algorithms
(ESA 2002), volume 2461 of Lecture Notes in Computer Science, pages 152-164, Rome, Italy, September 17-21
2002.
IJRISE| www.ijrise.org|[email protected]