SAL: An Algebra for Semistructured Data and XML

SAL: An Algebra for Semistructured Data and XML
Catriel Beeri and Yariv Tzaban
The Hebrew University
{beeri,yarivi}@cs.huji.ac.il
1
Introduction
Semistructured Data (SSD) in general, and XML [14]
in particular, are receiving growing attention in contemporary research in databases and related fields. A
data model for SSD has been established and widely
accepted, namely OEM [1], the data model of the
Lore system. Several query languages have been proposed for SSD [1, 2, 9] along with systems for storing, browsing and querying such data. Algebras for
oo databases were extended to deal with features relevant to SSD; e.g., the algebra for handling nested
queries presented in [6, 7] was extended in [4] to handle general path expressions (GPEs). [12] extended
the traditional SPJ algebra into a navigational algebra for the Web, by adding two new operators
(unnesting a nested list inside an HTML page, and
following an HTML link). It also proposed a cost
model for query execution plans, and gave some costbased optimizations. In Lore an algebra was not presented, but they did sketch a semi-algebraic query
execution model with pipelined execution and optimization possibilities.
Lately, with the emergence of XML as a standard
format for Web data, attempts are made to apply the
research of SSD to XML issues such as modeling and
querying. [8] proposed XML-QL, a query language
for XML based on pattern matching. In [10] an OQLbased language in the spirit of Lorel was suggested.
Many other languages and insights about this subject
can be found in [13].
This paper is a preliminary report on SAL, a
Semistructured ALgebra. SAL is an algebraic query
language that we are designing to serve as a target language for translation from declarative useroriented query languages for XML. Like existing algebras, it can serve to provide a concise representation
of query execution, that still leaves room for various
decisions about actual execution plans, and supports
rewrite-based and cost-based optimizations. It has
been influenced by the work in [6, 7], with extensions
that we believe are relevant to the XML format.
The XML data format presents some challenges.
Being a textual format, it has order built into it. A
query language needs to allow the users to view the
data as ordered. Although lists have been treated
in languages for complex values and oo models, order
has not been a central issue. Additionally, the format,
as in the OEM model, combines lists and records, in
that attributes can be multi-valued; often a request
for an attribute value returns a list of values. Our
algebra has been designed with these issues in mind.
While XML does not ipose a rigid structure, as in
classical databases, it does have a notion of schema,
namely a DTD. In case a schema is given, there is
room for static detection of type errors, which can
help to detect many user mistakes. There is also
room for type-based optimizations, like ignoring uninteresting sections of a graph, or algebraic rewriting
optimizations based on knowledge of the structure of
the input data. DTD-based schemas are much more
flexible than classical database schema, see [3], hence
schema-based optimization needs to be studied carefully. Although an important subject, it is not considered in this paper.
In classical databases, a query system could detect
all type errors statically, before accessing the data.
We believe this is no longer the case in SSD: Schemas
may allow quite irregular structure; sometimes, no
schema is given. Thus, one has to be prepared for
run-time errors, such as missing field values; in some
situations these are not real errors, in others, by being notified about the errors possibly with some sample data, a user might improve the query formulation. We propose a mechanism, called data exceptions, for dealing with such errors. We believe this
mechanism opens the way for flexible treatment of
various kinds of exceptional data, but its development
requires more work.
In short, our algebra can handle multi-valued attributes, has support for detection and handling of
run-time errors, and most of its operations preserve
order (if present in the input).
3.1
Remark The algebra is still under development.
This paper represents the current state of our work.
SAL assumes an extensible collection of functions and
predicates (boolean functions). We assume given an
equality/identity predicate, ’=’, that accepts any two
values, relational comparison (’<’, ’>’ etc.) on numbers and strings, and closure of predicates under the
boolean connectives (’and’, ’or’, ’not’). We also assume given the append operation on lists, denoted ’||’,
and membership and containment predicates (where
the lists are interpreted as sets) denoted ’∈’, ’⊂’,
and a dupe lim operator, which makes a set out of
a list (based on the equality predicate). String functions, e.g. like (regular expression matching between
strings) and contains (between strings), can be included. The collection of functions is closed under
composition; furthermore, operators of the algebra,
applied to functions or predicates, are also functions.
2
The Data Model
Our data model for XML is essentially a variation
of OEM [1]. It views the data as an edge-labeled
directed graph, with a distinguished single root, and
order on the outgoing edges of every node. A node
in the graph may have an ID (which serves as its
identity), but does not have to have one (such a node
is treated as a complex value). A node with an ID can
be referenced from more than one node in the graph,
thereby allowing the graph to contain directed cycles.
In [11] the notions of identity test vs. equality are
discussed. In SAL, both are defined; whether one or
both are used depends on the query language being
translated to the algebra.
Some comments: (i) We treat XML elements and
attributes in the same manner; it is easy to extend
the algebra to make a distinction, (ii) we assume
XML entities and default values (for attributes) are
eliminated during the document processing, and (iii)
IDREFs are represented as edges in the graph, hence
an XML document is a general graph rather than a
tree. For a short discussion of the model and of the
view of XML documents as graphs, see [8].
Here is our model in brief: an XmlN ode is either an AtomicValue, or a list of XmlV alues. An
XmlV alue is a pair of the form [XmlLabel, XmlNode] where XmlLabel is of type L (the set of labels,
as defined in [14]):
XmlV alue := 0 [0 XmlLabel 0 ,0 XmlN ode 0 ]0
XmlN ode
:= AtomicV alue |
0
<0 XmlV alue (0 ,0 XmlV alue) ∗ 0 >0
XmlLabel
:= defined in [14]
AtomicV alue := defined in [14]
A graph is an Xmlvalue. The concepts XmlV alue,
XmlN ode and XmlLabel correlate to the element,
content and tag variables used in [8].
3
The Algebra
SAL consists of operators defined on lists: selection
(σ), mapping (χ), extend or list-mapping (χl ), join
(1), group-by (Γ), regular-expression matching (ρ)
and variable binding (V Bind). Most preserve order,
and unordered versions are easily defined. Many of
them are well-known, we present them to clarify their
behavior on lists, All the operators are polymorphic,
to support the arbitrarily nested structure of XML.
Functions and Predicates
In addition, the following simple operators and abbreviations are used:
• Where p is an XmlV alue, p.label and p.node denote p’s XmlLabel and XmlN ode respectively.
• Where a is an XmlLabel and n1 , . . . , nm are
XmlN odes, a =< n1 , . . . , nm > abbreviates
< [a, n1 ], . . . , [a, nm ] >, and it is an XmlN ode.
• Where p is an XmlV alue, < p > is the singletonlist containing p, and it is an XmlN ode.
3.2
The Operators
We now present signatures and definitions of the algebraic operators. We use variables for types (τ for
XmlV alue, ν for XmlN ode) and for values (p, q. for
XmlV alues, N, n, for XmlN odes, a for XmlLabels).
Selection Operator
serves order.
The selection operator pre-
σ : if P red : τ → B,
σ(P red) < τ >→< τ >,
σ(P red) (N ) ≡ < p |p ⇐ N, P red(p) >
then
where
Mapping Operators We introduce two mapping
operators. χ is the classical mapping operator on
lists, and χl is a generalized version of χ, for the case
when the mapping function maps a value to a list of
values.
χ : if f : τ1 → τ2 ,
χ(f ) :< τ1 >→< τ2 >,
χ(f ) (N ) ≡ < f (p)|p ⇐ N >
χl :
if f : τ1 →< τ2 >,
then
where
then
χl (f ) < τ1 >→< τ2 >,
χl (f ) (N ) ≡ appendp⇐N f (p)
where
When the mapping is onto a set of given fields, we
denote it by Π
Join Operator The join preserves order, using lexicographic ordering.
1:
if P red : τ1 , τ2 → B and f : τ1 , τ2 → τ3 , then
1 (P red, f ) :< τ1 >, < τ2 >→< τ3 >,
where
N1 1 (P red, f )N2 ≡
< f (p1 , p2 )|p1 ⇐ N1 , p2 ⇐ N2 , P red(p1 , p2 ) >
Group-By Operator The Group By operator also
has an ordered version, although we believe the unordered version will be used more often.
Γ:
if f : τ1 → τ2 ,
then
Γ(a, f ) :< τ1 >→
< [0 groupedpair0 , < τ2 , [a, < τ1 >] >] >, where
Γ(a, f ) (N ) ≡ < [0 groupedpair0 ,
< ni , [a, < p |p ⇐ N, ni = f (p) >] >] |
ni ⇐ f (N ) >
Variable Binding Operator This operator is definable by means of χ and χl , but is defined separately
since it plays a special part in translating queries into
the algebra. Assume that an environment, a mapping
of variables to nodes, is given in XML format. Then
V Bind adds to it a binding for another variable.
V Bind :
if f : τ →< ν >,
then
V Bind(a, f ) :< τ >→< τ, [a, ν] >,
where
V Bind(a, f ) (N ) ≡
χl (λp.χ (λq.[0 binding 0 , p.node || < [a = q] >])
(f (p.node))) (N )
The operation of V Bind is similar to that of the
d-join (dependent join) defined in [6] and [1].
First-Order Logic Quantifiers The quantifiers
exists(P red), all(P red), both of type < τ >→ B,
are defined as usual (details ommitted).
Regular-Expression Matching Operator The
definition of the ρ operator is declarative, and it is
given by means of the data graph: The result of the
evaluation of ρ(RegExp)(G) is the list of all nodes in
G s.t. there exists a path that leads to them from
rootG (the root of G), that matches the given regular expression RegExp. We provide ρ operator as a
“black box”, so that smart implementations can be
used for it.
We mention now, without formal definition, two
operations that we are now considering. Both can be
used as a basis for more efficient scans of data.
Scan Operator In XML data, items of different
types may occur in the input in any order. For example, a bibliography database might contain books
and articles. A query that joins books and articles
might require two scans of the input. This can be
avoided if the algebra has a scan operator that has
one input, but several outputs. While such an operator has not been previously included in known algebras, in a model that has pairs and admits nesting,
there is no reason not to have an operator that returns a pair of lists. Such an operator can be used
for situations where multiple scans of a collection are
used as input to one operation. Cost-based optimization for this operator will need to take issues such as
buffering into account.
Pattern Matching Operator Pattern matching
is used in functional languages to compare a treestructured input against a tree structured pattern,
simultaneously testing for some simple conditions and
producing variable bindings. It is used in the query
language XML-QL to compare a pattern against a
tree-structured item, and to produce bindings to components. This operator is introduced to support
query languages based on patterns, but having it in
the algebra opens the way to group several bindings
that are performed on a localized part of the input
into one binding-producing scan. Our operator produces lexicographically-ordered bindings. To be effective at the algebra level, the operator has to be
restricted to operate on items that are close together;
in the user-level QL, one may use patterns that span
IDREF’s, just connecting far-apart items. This issue,
as well as optimization issues are now under consideration. An interesting idea here is to take a general pattern, evaluate its local part, and dynamically
transform the remainder into select boxes to be evaluated later.
3.3
Abbreviations
Many operators are definable by means of other operators, e.g., ordered versions set-theoretic operators
are definable using append and dupe lim. We mention
some, that are used below, and that may be implemented as primitives for optimization reasons.
• f latten :
let id0 : [L, < τ >] →< τ >,
0
id ([a, n]) ≡ n,
then
f latten(N ) :< [L, < τ >] >→< τ > ≡ χl (id0 ) (N )
• GetF ld(ai ) :< [a1 , τ1 ], . . . , [am , τm ] >→< [ai , τi ] >,
where
GetF ld(ai ) (N ) ≡ σ(λp.(p.label = ai )) (N ),
• GetLabel(n)1 :< [a1 , τ1 ], . . . , [am , τm ] >→< L >
where
GetLabel(n)(N ) ≡
χ(λp.[0 label0 , p.label]) ◦ σ(λp.(p.node = n)) (N )
Wesley”. In each V Bind step we extend the variable
environment with one binding. Since attributes are
multi-valued, each environment produces a list of environments; the resulting environments are combined
in lexicographical ordering.
• e[x]2 :< τ >→< [0 binding, < [x, < τ >] >] >
select t, a
from book b, b.publisher p, p.name n,
b.title t, b.author a
where n = "Addison-Wesley"
• → ai :< [L, < [a1 , τ1 ], . . . , [am , τm ] >] > →
< [ai , τi ] >,
where
N → ai ≡ GetF ld(ai ) (f latten(N ))
where
χ(λp.[0 binding 0 , < [x, < p >] >]) (e)
We also abbreviate GetF ld(ai )(N ) by ai (N ).
4
Data Exceptions
Even in classical databases, data items may be missing, and this requires special treatment; witness the
null values of SQL. In SSD, flexibility is a virtue, not
a deficiency, but it requires flexible handling of exceptional situations, such as ‘a restaurant without an address’. LoreL offer a standard treatment, essentially
a missing value in a selection predicates is evaluated
to false.
We offer a more general mechanism: the data
exception. Unlike exceptions in programming languages, such an exception is bound to a data item. It
has a name, and possibly more associated data. To
use such exceptions, one needs to define which operators produce them, how operators treat data with
exceptions in their input, and so on. An interesting
issue here is to be able to generate queries from a
given query and a set of exceptions, to allow a user
to see how real data deviates from his original query.
5
Translation Examples
In this section we demonstrate translations of queries
in an SQL-like language into the algebra. The choice
of queries is inspired by [8] and [5].
The method used is similar to the one presented in
[6]: (i) a e[x] map operation binds the first variable
(e.g. book(IN )[b]), (ii) subsequent V Bind map operations bind the other variables (e.g. V Bind(p, b →
book)), one at a time (before optimizations), (iii) selections (σ), joins (1) between different branches of
the query and groupings (Γ) are performed, and (iv) a
final Π map operation formats the output as required.
Selection Query This query retrieves titles and
authors of books whose publisher’s name is “Addison1 The
operators GetF ld and GetLabel are two sides of the
same operation: retrieving the nodes with a specified label, and
retrieving the labels with a specified node. GetLabel enables
simple querying of the schema, as is sometimes expected from
an SSD query language.
2 The notation e[x] is consistent with the notation of [6].
Π ([f latten(p → t), f latten(p → a)]) ◦
σ (λp.p → n = ”Addison-W esley”) ◦
V Bind (a, b → book → author)
◦ V Bind (t, b → book → title) ◦
V Bind (n, p → publisher → name) ◦
V Bind (p, b → book → publisher) ◦
book(IN )[b]
An obvious optimization of this plan is to push the
selection right after the binding of n. A more powerful optimization, a scan optimization, is to bind the
variables p (the book’s publisher), t (the book’s title)
and a (the book’s author) in one step, thereby doing
one scan for each book element, instead of three.
Join Query This is a join between articles and
books, on one of the authors.
select a, b
from article a, book b, a.author aa,
b.author ba
where aa = ba
Π (a, b) ◦
V Bind (aa, a → author) ◦
article(IN )[a]
1
(aa = ba, append)
V Bind (ba, b → author) ◦
book(IN )[b]
A different scan optimization can be used here; we can
combine article(IN )[a] and book(IN )[b], and bind a
and b with a single scan of the root.
Group-By Query This simple group-by query
counts the number of books per author, while in the
input database the authors are nested inside the book
elements.
select a, count(b)
from book b, b.author a
group-by a
Π (a, count(articles → b)) ◦ Γ (articles, a) ◦
V Bind (a, b → author) ◦ book(IN )[b]
GPE Query This query uses the ρ operator for
a general path expression searching in the database.
Note that most of the work is done by ρ.
select t
from document.*.title t
where t contains "Web"
[5] S. Cluet. Modeling and querying semi-structured
data. LNIA 97. 1997.
[6] S. Cluet and G. Moerkotte. Nested queries in object
bases. In Proc. DBPL, 1993.
Π (t) ◦ σ (t contains ”W eb”) ◦
ρ(document → ∗ → title)[t]
[7] S. Cluet and G. Moerkotte. Nested queries in object
bases. Technical Report 95-6, RWTH-Aachen, 1995.
Nested Query
select t
from document d, d.section s, s.title t
where exists (select t2
from d.title t2
where t2 contains "Web")
Π (t) ◦ σ (exists (T RU E) ◦
σ (t2 contains ”W eb”) ◦
(d → title)[t2 ])
V Bind (t, s → title) ◦
◦
V Bind (s, d → section) ◦
document(IN )[d]
A significant optimization of this query plan can be
obtained by pushing the σ just after the binding of
d, and thus possibly saving redundant binding for s
and t.
6
Conclusions
Work
[4] V. Christophides, S. Cluet and G. Moerkotte. Evaluating queries with generalized path expressions. In
Proc. ACM SIGMOD, Int. Conf. on Management
of Data, pages 413-422, Montreal, Quebec, Canada,
June 1996.
and
Future
Although our algebra is a generalization of previous
algebras for OODB’s, the XML data format poses
special challenges, first because of of its specific data
model, and second because of its flexibility. We have
hinted at some of those, and we are currently investigating some of them: detection and handling of data
exceptions, exploiting schema information when it is
available, and designing special operators.
References
[1] S. Abiteboul, D. Quass, J. McHugh, J. Widom and
J.L. Wiener. The lorel query language for semistructured data. In Journal on Digital Libraries, 1(1),
1997.
[2] P. Buneman, S. Davidson, G. Hillebrand and D. Suciu. Adding structure to unstructured data. In Proc.
ACM SIGMOD Int. Conf. on Management of Data,
pages 505-516, June 1996.
[3] C. Beeri and T. Milo. Schemas for integration and
translation of structured and semi-structured data.
In Proc. Int. Conf. on Database Theory, ICDT’99,
1999.
[8] A. Deutsch, M. Fernandez, D. Florescu, A. Levy
and D. Suciu. XML-QL: A query language for
XML. http://www.w3.org/TR/1998/NOTE-xml-ql19980819.
[9] M. Fernandez, D. Flourescu, A. Levy and D. Suciu. A
query language for a Web-site management system.
SIGMOD Record, 26(3):4-11, September 1997.
[10] T. Lahiri, S. Abiteboul and J. Widom. Ozone:
Integrating structured and semistructured data.
http://www-db.stanford.edu/pub/papers/ozone.ps
[11] T.W. Leung, G. Mitchell, B. Subramanian, B. Vance,
S.L. Vandenberg and S.B. Zdonik. The AQUA data
model and algebra. In Proc. DBPL, 1993.
[12] G. Mecca, A. O. Mendelzon and P. Merialdo. Efficient queries over Web views. In EDBT’98.
[13] QL’98
Position
Papers.
http://www.w3.org/TandS/QL/QL98/pp.html
[14] Extensible
Markup
Language
http://www.w3.org/TR/REC-xml
(XML)
1.0.