A Cost-based Approach For Converting Relational Schemas to XML

A Cost-based Approach For
Converting
Relational Schemas To XML
Ramon Lawrence
University of Iowa
[email protected]
http://www.cs.uiowa.edu/~rlawrenc/
Overview
The goal of this research is to convert existing relational
schemas to XML schemas with minimal user input.
The contributions of this work are:
A fully automated relational to XML schema conversion
algorithm that minimizes a user-specified cost function (in this
case space efficiency) that does not require the user annotate
the relational schema or map to an intermediate model.
The algorithm incorporates user preferences during the
mapping if they are present. Thus, the mapping can be
completed with no user involvement, or as much involvement
as the user desires.
Empirical results demonstrating how XML nesting improves on
flat translation by creating more space efficient schemas
without introducing redundancy.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 2
Motivation
Many applications are being migrated from using relational
databases to using XML. This conversion can be very costly if
the application designers must also re-engineer the schema.
The ability to automatically convert a relational schema to XML
schema is valuable. Further, it is useful if the translation is not
a “flat translation” but rather uses nesting to improve the
efficiency and readability of the resulting XML schema.
Previous conversion algorithms required the user to specify a
complete mapping or did not allow user-specified constraints.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 3
Previous Work
The previous work can be put in 4 categories:
1) Flat translation - converts to XML without using nesting.
2) Query-based translation - conversion occurs by using a query
extraction language that is an extension of SQL or a new XML
query language. e.g. SilkRoute [Fernandez02] and XML
publishing [Carey00].
3) Model-based translation - converts the relational model to an
intermediate model and then maps to XML using translation
rules. e.g. [Bird00,Du01,Embley01]
4) Dependency-based translation - uses dependency information
in schema to translate to XML. e.g. NeT and CoT [Lee02].
No system is fully automated and yet at the same time respects
user constraints on the final schema result. No previous work
uses a cost function to return the optimal nesting.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 4
Nesting and Efficiency
Exploiting nesting in XML has three major advantages:
1) Improves readability by grouping related concepts.
2) Improves space efficiency by avoiding repetition and
encoding of foreign keys that can be implicitly defined by the
hierarchical structure.
3) Improves query efficiency by clustering concepts together
and avoiding joins.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 5
Space Efficiency Metric
Space efficient XML documents can be achieved by nesting as it
eliminates foreign keys. For example, in TPC-H, order
information can be nested under customer information, and the
foreign key orders.o_custkey is not represented explicitly:
<!ELEMENT customer (c_custkey, c_name, ..., orders*)>
<!ELEMENT orders (o_orderkey, o_orderdate, ...)>
Amount of space savings:
Number of data values not stored is equal to the size of the
relation (in this case orders) *(# of foreign key attributes).
Let |R| denote the number of tuples of R. The space savings of
nesting relational schema R under S on key K is denoted by
savings(R,S,K) and is approximated by sizeOf(K)*|R|, where
sizeOf(K) is the maximum schema size of the foreign key K of R.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 6
Nest Graph
A nest graph represents the nesting possibilities and their
desirability by using existing relational schema information.
A Nest-Graph G = (V,E) is a directed, weighted graph consisting
of a set of nodes V and a directed edge set E, such that for
each relation R in the database schema, there exists a node VR
V, and for each non-nullable foreign key FKR(S) there is an
edge e=(VS,VR,w)  E, where w=savings(R,S,FKR(S)).
Nest-graph for TPC-H:
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 7
XML Mapping Algorithm
The mapping from the relational model to an XML schema has
these general steps:
Extract relational schema information including foreign keys
and relation sizes from the database.
Use the information to build a Nest-Graph G.
Use the classical algorithm by Edmonds to calculate the
maximum weight arborescence T of G.
The maximum weight arborescence is a maximal spanning tree
(MST) T. Use T to generate the nesting in the XML schema.
Assume that given an edge e=(S,R,w) of T, the two relations are
S(A1,A2,...,Am) and R(B1,B2,...,Bn) respectively. Let A1=PKS, B1=PKR, and
Bk=FKR(S).
The primary keys and foreign keys are encoded as XML attributes, and
all other relational attributes are encoded as XML elements.
Note: Introduce a new node r' to G, and a set of edges E' where each
e=(r',Vi,0)  E' for all i=1..|V|. This guarantees that a MST will exist.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 8
XML Mapping Algorithm (2)
Using the maximal spanning tree T:
For each edge e=(S,R,w) of T, we nest R under S (and omit Bk):
<!ELEMENT S (A_2, ..., A_m, R*)>
<!ATTLIST S A_1 ID>
<!ELEMENT R (B_2, B_3,..., B_k-1, B_k+1,...B_n)>
<!ATTLIST R B_1 ID>
each edge e=(S,R,w) where e G and e  T, this
relationship will be captured using ID/IDREF. Under the element
R is a IDREF attribute like:
For
<!ELEMENT R (B_2, B_3,..., B_k-1, B_k+1,..., B_n>
<!ATTLIST R B_k IDREF>
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 9
Maximal Spanning Tree Result
Maximal spanning tree:
Space savings achieved:
For TPC-H at scale factor 1, the space savings is 57,849,820
characters or 14,462,455 data values. Equivalently, the savings
is 43.5% of all foreign keys or 12.4% of all data values.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 10
XML Result DTD
<!ELEMENT root (part*, region*)>
<!ELEMENT part (p_name, … , p_retailprice, partsupp*)>
<!ATTLIST part p_partkey ID>
<!ELEMENT region (r_name, r_comment, nation*)>
<!ATTLIST region r_regionkey ID>
<!ELEMENT partsupp (ps_availqty, … , lineitem*)>
<!ATTLIST partsupp ps_suppkey IDREF>
<!ELEMENT nation (n_name, … , customer*, supplier*)>
<!ATTLIST nation n_nationkey ID>
<!ELEMENT lineitem (l_linenumber, …, l_comment)>
<!ATTLIST lineitem l_orderkey IDREF>
<!ELEMENT customer(c_name, … , c_comment, orders*)>
<!ATTLIST customer c_custkey ID>
<!ELEMENT supplier (s_name, …, s_acctbal)>
<!ATTLIST supplier s_suppkey ID>
<!ELEMENT orders (o_orderdate, … , o_comment)>
<!ATTLIST orders o_orderkey ID>
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 11
User-Directed Mapping
The algorithm determines the optimal space efficient nesting of
relational schemas into an XML schema given no constraints.
However, the user can control the XML schema output by
specifying constraints on the XML schema that should be
satisfied without having to specify a total mapping.
Four types of constraints can be incorporated into the mapping
algorithm by appropriate modifications to the Nest-Graph:
1) Non-nested nodes - by removing node’s incoming edges.
2) Edges (nestings) that must be present - edge weight = 
3) Edges (nestings) that must not be present - remove edge
4)Handling nullable foreign keys - user decides on encoding;
may use null-record.
User may also specify any cost function, just not the space
efficiency one discussed here.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 12
Experimental Results
Results summary:
40-50% of foreign keys do not have to be encoded. Elimination
of all foreign keys is only possible if the Nest-Graph is a tree.
10 to 30% of all attributes are not encoded.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 13
Future Work and Conclusions
Developed an efficient and user-friendly tool for automating
construction of XML schemas from relational schemas.
System is an improvement over current approaches which either
do not have formal methods for capturing user constraints or
require the user to exactly specify the entire XML schema.
The space efficiency cost-metric proposed is useful, although
other metrics are possible including those that factor in query
costs as well.
Future work involves expanding the algorithm to consider cost
functions and nestings that introduce redundancy to increase
query performance at the sacrifice of space efficiency. Also, a
investigation on nesting query results will be performed.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 14
A Cost-based Approach For
Converting
Relational Schemas To XML
Ramon Lawrence
University of Iowa
[email protected]
http://www.cs.uiowa.edu/~rlawrenc/
Thank You!
Extra Slides...
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 16
TPC-H Schema
Table
Part
Attributes
partkey, name, mfgr, brand, type, size, container, retailprice, comment
Supplier
supkey, name, address, nationkey, phone, acctbal, comment
PartSupp
partkey, suppkey, availqty, supplycost, comment
Customer
custkey, name, address, nationkey, phone, acctbal, mktsegment, comment
Order
orderkey, custkey, orderstatus, totalprice, orderdate, orderpriority, clerk,
shippriority, comment
LineItem
orderkey, partkey, suppkey, linenumber, quantity, extendedprice, discount,
returnflag, tax, linestatus, shipdate, commitdate, receiptdate, shipinstruct,
shipmode, comment
Nation
nationkey, name, regionkey, comment
Region
regionkey, name, comment
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 17
Foreign Keys in TPC-H
Table with
Foreign Key
Table
Referenced
LineItem
Part
LineItem
Supplier
l_suppkey  s_suppkey
LineItem
PartSupp
l_partkey, l_suppkey  ps_partkey, ps_suppkey
LineItem
Order
l_orderkey  o_orderkey
PartSupp
Part
ps_partkey  p_partkey
PartSupp
Supplier
Supplier
Nation
s_nationkey  n_nationkey
Customer
Nation
c_nationkey  n_nationkey
Order
Customer
Nation
Region
Primary Keys
Functional Dependencies
l_partkey  p_partkey
ps_suppkey  s_suppkey
o_custkey  c_custkey
n_regionkey  r_regionkey
Foreign Keys
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 18
Query Result Nesting
A related problem is the problem of converting the result of a
query over a relational schema into XML. Space savings can be
achieved by taking the de-normalized query result and applying
some normalization by exploiting nesting. A query example:
SELECT n_name, c_name, o_orderdate, o_orderkey,
l_partkey, l_quantity, s_name
FROM customer C, orders O,lineitem L,supplier S, nation N
WHERE C.c_custkey = O.o_custkey AND L.l_orderkey =
O.o_orderkey AND L.l_suppkey = S.s_suppkey
AND C.c_nationkey = N.n_nationkey
This
query returns over 6 million records. Using flat translation
wastes space. For instance, since the cardinality of nation is 25,
the XML encoding would redundantly encode approximately 6
million nation names.
The algorithm can be used with queries involving selection,
projection, and joins between relations on primary/foreign keys.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 19
Query Nesting Results
Results summary:
For the example query, nesting the query result reduces the
number of data values in the XML document by 50%, which
corresponds to 20,904,839 fewer values.
For other benchmark TPC-H queries, 20-30% of data values
were not encoded. Even with compression, this is significant as
it results in smaller files to compress.
Ramon Lawrence A Cost-based Approach For Converting
Relational Schemas to XML
The University of Iowa.
Copyright© 2005
Page 20