here - CS HUJI Home Page

Relational to XML Transformations
 Background & Issues
 Preliminaries
Execution strategies
The SilkRoute System
2005
rel-xml-i
1
Background & Issues
•
XML – a de-facto standard for data exchange (B2B)
• Business data is will be stored in relational db’s :
reliability, optimized query processing, established applications
Need efficient generation of XML data from relational db’s
Subject was investigated in several projects (early 2000’s) :
• Xperanto (IBM Almaden)
• SilkRoute (AT&T)
• PRATA (Bell Labs/Lucent)
2005
rel-xml-i
2
Issues :
Relations are flat, unordered, tuples are pure data
XML is nested, tagged, and ordered
• Need a language/interface to specify needed data and its form
• Which parts of the work should be performed
Inside/outside the relational engine
Relational engines are very good at
• Optimizing SQL queries (efficient execution)
• Sorting
But do not deal with tagging, do not generate XML
2005
rel-xml-i
3
Xperanto :
The IBM team had access to DB2 internals
• Extended SQL by a few primitives to generate XML,
Implemented these in the DB2 relational engine
• Analyzed space of execution strategies, using simulations,
concluded that 2 strategies, both doing almost all work as one
(extended) SQL query is best
But:
We do not have access to db internals
 The interesting part is their analysis/simulations
2005
rel-xml-i
4
SilkRoute :
• Relational db is presented as an XML view (standard transformation)
• Desired XML specified in XML query language
– Initial version: home-brewed XML query language
– Last version: QXuery
Main idea:
• Query composition with the query that define the standard view
allow to generate the data by SQL queries + tagging
• Found that one big SQL query is not always best
2005
rel-xml-i
5
PRATA :
• Use DTD’s as description of desired XML, & a generalization
of attribute grammars with query actions to specify the needed
data
• Can handle recursion in DTD’s (former approaches cannot)
• Can still optimize to use a small number of SQL queries for data
generation
2005
rel-xml-i
6
Comment :
We are now back in the GAV approach
• View: standard XML view of relational
• Desired XML: query on this view
• Main idea: query composition
Complications:
• XML data is tagged, relational is not
• Nested data
• Different nesting in view and query target
 Need to change structure
 May need fusion
2005
rel-xml-i
7
Execution strategies
The issue:
Data is stored in relational tables, can be retrieved with
one/few/many SQL queries
• Which approach is more efficient?
• How can the approach be implemented, assuming the
transformation is in some XML-ish l query language
2005
rel-xml-i
8
Xperanto execution strategies :
Base example :
2005
rel-xml-i
9
The source relational schema:
2005
rel-xml-i
10
Space of evaluation strategies :
• Early/late tagging
• Early/late structuring (to form the nested XML structure)
• All work inside the engine, or (at least part) outside the engine
Some combinations are meaningless, e.g.
early tagging/late structuring
2005
rel-xml-i
11
1st strategy : early tagging, early structuring, outside the engine
An application issues a sequence of SQL queries, matching the
structure of the result e.g:
For each customer do
1. retrieve root – customer info –
cust. name & id retrieved, tagged & output
2. retrieve, tag, & output customer account info
3. retrieve, tag, & output customer purchase orders
4. for each PO, retrieve, tag, output items, then payment info
Early structuring : queries follow structure of generated doc
Early tagging : each element is tagged when retrieved
Outside the engine : obvious
2005
rel-xml-i
12
Shortcomings :
• Many small granularity queries – several queries per
“object” – serious performance problems
• Performs a nested loop join – a fixed join order and join
strategy – the relational engine might explore others
2005
rel-xml-i
13
2nd strategy: Early structuring, tagging inside the machine :
For this, augment the db engine with
• New data type : xml
• Constructors for the kinds of elements in the document, e.g.
2005
rel-xml-i
14
Now, can express query as:
XMLAGG aggregates several XML fragments into one
For example, two accounts for the customer
2005
rel-xml-i
15
The XML fragments have variable size  represented as
Character large objects (CLOBs)
Problems:
•
CLOBS are stored separately of their tuples, hence may need
separate fetches
•
Each XML constructor copies the inpuyt CLOBs to form its
output CLOB – a lot of copying
Advantage :
One large query, rather than many small ones
Still nested loop join, but possibly engine can select another
strategy (?)
2005
rel-xml-i
16
3rd strategy: Late structuring & tagging :
If both structuring and tagging are done late (possibly outside engine),
we can separate process into two
•
Content creation – retrieve the data from the db, inside the
engine
•
Structuring and tagging – the 1st possibly inside, the 2nd outside
2005
rel-xml-i
17
Contents creation – outer join approach :
Select cust.*, acct.*, porder.*, pay.*, item.*
From Customer cust
left join Account acct on cust.id = acc.custId
left join PurchaseOrder porder on cust.id = porder.custId
left join Item item on porder.id = item.poId
left join payment pay on porder.id = pay,poId
Left join: a customer should occur in result even if no account, or
has an account but no purchase orders, etc.
Result for a customer w/o some fields is padded with nulls
Join is performed for each path in tree (root to leaf) in some order
Disadvantages : ??
2005
rel-xml-i
18
Contents creation – (unsorted) path outer union approach :
Select cust.*, acct.*, type =1
From Customer cust
left join Account acct on cust.id = acc.custId
Outer union
Select cust.*, porder.*, item.*, type = 2
From Customer cust left join PurchaseOrder porder on cust.id = porder.poId
left join Item item on porder.id = item.poId
Outer union
Select cust.*, porder.*, pay.*, type = 3
From Customer cust left join PurchaseOrder porder on cust.id = porder.poId
left join payment pay on porder.id = pay,poId
Outer union: pads with nulls, like the left/right join, but does not
duplicate data (as much)
Each sub-query is a join for one leaf-to-root path
Note: a sub-query repeated twice
2005
rel-xml-i
19
Contents creation – (unsorted) node outer union approach :
The previous strategy still ahs some redundnacy:
A parents info is replicated with the descendents
Can avoid by using id’s of parents in the descendents
2005
rel-xml-i
20
Structuring & tagging (for unsorted outer union) :
Use a hash table, with hash key type and ancestor id’s for an
element to in the XML tree
hash-based tagger
For each tuple in relational result, that defines a node in tree, find
out (hash) if parent is present:
Yes – just add this new node (tag)
No – add nodes for all the missing ancestors along the path to
root (hashing repeatedly for shorter paths)
Main disadvantage :
For large outputs, main memory shortage
Also, does not necessarily satisfy required order – may need sorting
2005
rel-xml-i
21
Last strategy : early structuring & late tagging :
The idea: order the relational contents in the same order it needs
to appear in the (flattened) XML file
Then, tagging (& nesting) can be performed in constant space
• All info about a node X appears before/with the information of
its children
• The info of X & its descendents appears together (no mixing
with descendents of other nodes)
• The children are ordered as required by the XML def
2005
rel-xml-i
22
Contents creation – sorted path/node outer union approach :
Same as the outer union approach, but add a final sort step
Relational engines are sorting experts, including external sorting
• Sort on id fields, with id of higher nodes preceding those of
lower nodes
In example: CustId, AcctId, PoId, ItemId, PayementId
• Nulls should be accounted for in sorting, and null values should
precede non-nulls
2005
rel-xml-i
23
Performance comparisons
(fig. 13):
• Outer join performs badly – too much data redundancy (not shown)
• One large query is better than many small ones (stored proc.)
• Inside the engine outperforms out the engine (for similar strategies)
outside the engine needs to copy data and bind to external
variables --- binding out time is a significant component for all
approaches
Binding out
In black
2005
rel-xml-i
24