Taxonomy of XML Schema Languages using Formal Language

Taxonomy of XML Schema Languages using
Formal Language Theory
Makato Murata, Dongwon Lee, Murali Mani, Kohsuke Kawaguchi
3. Dezember 2004
.
Norman May
[email protected]
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 1/20
Outline
Introduction
● Outline
● Motivation
n
n
Definitions
Properties of Tree Grammars
n
XML Schema Languages
n
Validation
n
Summary
Norman May, 3. Dezember 2004
n
Motivation
Definitions of Tree Grammars
Properties of Tree Grammars
Properties of XML Schema Languages
Document Validation Algorithms
Summary
Taxonomy of XML Schema Languages using Formal Language Theory - p. 2/20
Motivation
Grammar
Introduction
● Outline
● Motivation
Definitions
Properties of Tree Grammars
XML Schema Languages
Validation
Summary
Validator
Norman May, 3. Dezember 2004
Language
Taxonomy of XML Schema Languages using Formal Language Theory - p. 3/20
Regular Tree Grammar
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
Definition 1 A regular tree grammar (RTG) is a 4-tuple
G = (N, T, S, P), where:
n N is a finite set of non-terminals,
n T is a finite set of terminals,
n S is a set of start symbols,
n where S ⊆ N, P is the set of production rules of the form
X → a r, where X ∈ N, a ∈ T , and r is a regular expression
over N; r is called the content model of this production rule.
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 4/20
Interpretation of Grammar
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
Definition 2 An interpretation I of a tree t against a regular
tree grammar G is a mapping from each node e in t to a
non-terminal, denoted I(e), such that:
n I(e) is a start symbol when e is the root of t, and
n for each node e and its child nodes e0 , e1 , . . . , em , there exists
a production rule X → a r in G such that
u I(e) is X,
u the terminal symbol (label) of e is a, and
u I(e0 )I(e1 ) . . . I(em ) matches r.
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 5/20
Validation, Regular Tree Language
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
Definition 3 A tree t is valid against a regular tree grammar G
if there is an interpretation of t against G.
A set of trees is a (regular) tree language if, for some (regular)
tree grammar, all trees in this set are valid and no other trees
are valid.
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 6/20
Example: Regular Tree Grammar
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
N
T
S
P
=
=
=
=
{Doc, Para1 , Para2 , Pcdata}
{doc, para, pcdata}
{Doc}
{Doc → doc(Para1 , Para2 ), Para1 → para(ε ),
Para2 → para(Pcdata), Pcdata → pcdata(ε )}
interpretation:
tree:
Validation
doc
Summary
para
Doc
para
pcdata
Norman May, 3. Dezember 2004
Para1
Para2
Pcdata
Taxonomy of XML Schema Languages using Formal Language Theory - p. 7/20
Local Tree Grammar
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
Validation
Definition 4 Two different non-terminals A and B are said to
be competing with each other if
n one production rule has A in the left-hand side,
n another production rule has B in the left-hand side, and
n these two production rules share the same terminal in the
right-hand side.
Definition 5 A local tree grammar is a regular tree grammar
without competing non-terminals.
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 8/20
Single-Type Tree Grammar
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
Definition 6 A single-type tree grammar is a regular tree
grammar such that
n for each production rule, non-terminals in its content model
do not compete with each other, and
n start symbols do not compete with each other.
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 9/20
Example: Local/Single-Type Tree Grammar
Introduction
Definitions
● Regular Tree Grammar
● Interpretation of Grammar
● Validation, Regular Tree
Language
● Example: Regular Tree
Grammar
● Local Tree Grammar
● Single-Type Tree Grammar
● Example: Local/Single-Type
Tree Grammar
Properties of Tree Grammars
XML Schema Languages
Validation
Summary
Norman May, 3. Dezember 2004
N
T
S
P
=
=
=
=
{Book, Author1 , Author2 , Son, Article, Daughter}
{book, author, son, article, daughter}
{Book, Article}
{Book → book(Author1 ), Article → article(Author2 ),
Author1 → author(Son), Author2 → author(Daughter),
Son → son(ε ), Daughter → daughter(ε )}
. . . is not a local tree grammar because Author1 and Author2 are
competing
. . . is a single-type tree grammar because Author1 and Author2
do not appear in the same content model
Taxonomy of XML Schema Languages using Formal Language Theory - p. 10/20
Expressive Power
Introduction
Definitions
Properties of Tree Grammars
● Expressive Power
● Uniqueness of Interpretation
● Boolean Closure
XML Schema Languages
Validation
Summary
Local Tree Grammar
Single−Type Tree Grammar
Regular Tree Grammar
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 11/20
Uniqueness of Interpretation
Introduction
Definitions
⇒ A interpretation is unique, if validation of a tree against a
grammar yields exactly one interpretation.
Properties of Tree Grammars
● Expressive Power
● Uniqueness of Interpretation
● Boolean Closure
XML Schema Languages
Validation
Summary
Norman May, 3. Dezember 2004
Grammar
Unique Interpretation
Local Tree Grammar
Single-Type Tree Grammar
Regular Tree Grammar
yes
yes
not always
Taxonomy of XML Schema Languages using Formal Language Theory - p. 12/20
Boolean Closure
Introduction
Definitions
Properties of Tree Grammars
● Expressive Power
● Uniqueness of Interpretation
● Boolean Closure
XML Schema Languages
Validation
Language
Local Tree
Single-Type Tree
Regular Tree
Boolean Operation
Union
Difference Intersection
Not Closed
Not Closed
Closed
Not Closed
Not Closed
Closed
Closed
Closed
Closed
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 13/20
Appoach
Introduction
n
Definitions
Properties of Tree Grammars
XML Schema Languages
● Appoach
● Results
● Restrictions of the Framework
n
n
n
Map structural features of XML Schema Languages to
production rules
All simple types are Pcdata
Ignore integrity constraints etc.
Classify by the structure of production rules
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 14/20
Results
Introduction
Definitions
Properties of Tree Grammars
XML Schema Languages
● Appoach
● Results
● Restrictions of the Framework
XML Schema Language
Grammar
DTD
W3C XML Schema
Relax NG
Local Tree Grammar
Single-Type Tree Grammar
Regular Tree Grammar
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 15/20
Restrictions of the Framework
Introduction
n
Definitions
Properties of Tree Grammars
XML Schema Languages
● Appoach
n
Wildcards in W3C XML Schema make it more expressive
(Restrained Competition Grammar)
Attribute-element constraints and interleaving in Relax NG
cannot be captured in this framework
● Results
● Restrictions of the Framework
Validation
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 16/20
Approach
Introduction
n
Definitions
n
Properties of Tree Grammars
n
XML Schema Languages
Validation
● Approach
● Validation of Local &
Single-Type Tree Grammars
● Validation of Regular Tree
Grammars
n
depth first traversal of the tree
find a (some) production rule(s) for the current element
check content models derived from validating child nodes
against the production rule(s)
use automaton for checking content models
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 17/20
Validation of Local & Single-Type Tree Grammars
Introduction
Definitions
Properties of Tree Grammars
XML Schema Languages
Validation
● Approach
● Validation of Local &
Single-Type Tree Grammars
● Validation of Regular Tree
Grammars
Summary
Norman May, 3. Dezember 2004
Validation is easy because
. . . the interpretation of a tree is unique
. . . restrictions on the production rules:
u for local tree grammars the production rule is determined
by the terminal
u for single-type tree grammars: no competing
non-terminals for the start symbol and within content
models.
Taxonomy of XML Schema Languages using Formal Language Theory - p. 18/20
Validation of Regular Tree Grammars
Introduction
Definitions
Properties of Tree Grammars
XML Schema Languages
Validation
● Approach
● Validation of Local &
Single-Type Tree Grammars
● Validation of Regular Tree
Grammars
Validation is more complicated because
. . . no unique interpretation exists
⇒ must keep track of multiple interpretations of the input tree.
⇒ extend the automaton to check against a sequence of sets
of non-terminals (instead of sequences of non-terminals).
⇒ when the last set in the sequence contains a final state of
the automaton.
Summary
Norman May, 3. Dezember 2004
Taxonomy of XML Schema Languages using Formal Language Theory - p. 19/20
Summary
Introduction
n
Definitions
Properties of Tree Grammars
n
XML Schema Languages
Validation
Summary
● Summary
Norman May, 3. Dezember 2004
n
We have a framework for XML Schema languages based on
tree grammars
Local Tree Grammars (DTD) are too restricted in their
expressiveness
Regular Tree Grammars (RelaxNG) are most expressive and
have nice closure properties but allow ambiguous
interpretations (is this needed? . . . )
Taxonomy of XML Schema Languages using Formal Language Theory - p. 20/20