A Normal Form for XML Documents

Well-designed XML Data
Marcelo Arenas and Leonid Libkin
University of Toronto
Outline
 Part 1 - Database Normalization from the 1970s and
1980s
 Part 2 - Classical theory revisited: normalizing XML
documents
 Part 3 - Classical theory re-done: new justifications
for normalization
Part 1: Classical Normalization
 Design: decide how to represent the information in a
particular data model.
• Even for simple application domains there is a large number
of ways of representing the data of interest.
 We have to design the schema of the database.
• Set of relations.
• Set of attributes for each relation.
• Set of data dependencies.
Designing a Database: An Example
 Attributes: number, title, section, room.
 Data dependency: every course number is
associated with only one title.
 Relational Schema:
BAD alternative:
R(number, title, section, room), number  title
GOOD alternative: S(number, title),
T(number, section, room),
number  title

Problems with BAD: Update Anomaly
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC258
Computer Organization
3
GB248
CSC434
Database Systems
1
GB248
Title of CSC258 is changed to Computer Organization I.
Problems with BAD: Update Anomaly
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC258
Computer Organization
3
GB248
CSC434
Database Systems
1
GB248
Title of CSC258 is changed to Computer Organization I.
Problems with BAD: Update Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
Title of CSC258 is changed to Computer Organization I.
The instance stores redundant information.
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
CSC434 is not given in this term.
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
CSC434 is not given in this term.
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434 is not given in this term.
Additional effect: all the information about CSC434 was
deleted.
Insertion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
A new course is created: (CSC336, Numerical Methods)
Insertion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
A new course is created: (CSC336, Numerical Methods)
Insertion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC336
Numerical Methods
?
?
A new course is created: (CSC336, Numerical Methods)
The instance stores attributes that are not directly
related.
Avoiding Update Anomalies
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Title of CSC258 is changed to Computer Organization I.
Avoiding Update Anomalies
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Title of CSC258 is changed to Computer Organization I.
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Database Systems
Title of CSC258
CSC434
is not given
is changed
in this term.
to Computer Organization I.
The instance does not store redundant information.
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Database Systems
CSC434 is not given in this term.
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
Database Systems
CSC434
A
new course
is not isgiven
created:
in this
(CSC336,
term. Numerical Methods)
The title of CSC434 is not removed from the instance.
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
Database Systems
A new course is created: (CSC336, Numerical Methods)
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC336
Numerical Methods
CSC258
3
GB248
A new course is created: (CSC336, Numerical Methods)
No information about sections has to be provided.
Each relation stores attributes that are directly related.
Normalization Theory
 Main idea: a normal form defines a condition that a
well designed database should satisfy.
 Normal form: syntactic condition on the database
schema.
•
Defined for a class of data dependencies.
 Main problems:
•
•
How to test whether a database schema is in a particular
normal form.
How to transform a database schema into an equivalent
one satisfying a particular normal form.
BCNF: a Normal Form for FDs
 Functional dependency (FD) over R(A1, …, An) :
XY,
X, Y  {A1, …, An}.
 X  Y : two rows with the same X-values must
have the same Y-values.
• number  title : two rows with the same course number
must have the same title.
 Key dependency : X  A1  An
•X
is a key: two distinct rows must have distinct X-values.
BCNF: a Normal Form for FDs
  is a set of FD over R(A1, …, An).
 Relation schema R(A1, …, An),  is in BCNF if for
every X  Y in , X is a key.
Not in BCNF: R(number, title, section, room), number  title
 A relational schema is in BCNF if every relation
schema is in BCNF.
In BCNF:
S(number, title),
T(number, section, room),
number  title

Normalization Theory Today
 Normalization theory for relational databases was
developed in the 70s and 80s.
 Why do we need normalization theory today?
• New data models have emerged: XML.
• XML documents can contain redundant information.
 Redundant information in XML documents:
• Can be discovered if the user provides semantic
•
information.
Can be eliminated.
XML Documents
courses
course
course
@cno
taken_by
“CSC258”
@cno
taken_by
“CSC434”
student
...
student
student
@sno @name @grade @sno @name @grade
“st1” “Fox” “B+”
“A+”
“st1” “Fox”
XML Databases
XML Schema: (D, )
D:
courses
 course*
course
 @cno
course
 taken_by
taken_by  student*
student
 @sno,
@name,
@grade
student
 ε
 : Two students with the same
@sno value must have the
same name.
Redundancy in XML
courses
course
course
info
@cno
taken_by
“CSC258”
@cno
taken_by
“CSC434”
@sno @name
“st1” “Fox”
student
...
student
student
@sno @name @grade @sno @name @grade
“st1” “Fox” “B+”
“A+”
“st1” “Fox”
XML Database Normalization
DTD:
Data dependency:
courses
 course*
course
 @cno
course
 taken_by
taken_by  student*
student
 @sno,
@name,
@grade
student
 ε
Two students with the same
@sno value must have the
same name.
XML Database Normalization
DTD:
Data dependency:
courses
 course*, info*
course
 @cno
course
 taken_by
taken_by  student*
student
 @sno,
@grade
student
 ε
info
 @sno,
@name
Two students
@sno
is the identifier
with theofsame
info
elements.
@sno
value must have the
same name.
A “Non-relational” Example
DBLP
conf
@title
“ICDT”
issue
article
article
@title
“. . .”
@year
“1999”
@title
“. . .”
conf
...
issue
@year
“1999”
@year
“1999”
article
@year
“2001”
@title
“. . .”
@year
“2001”
XNF: XML Normal Form
 Proposed in [AL02].
 It eliminates two types of anomalies.
 It was defined for XML functional
dependencies:
DBLP.conf.@title  DBLP.conf
DBLP.conf.issue  DBLP.conf.issue.article.@year
Part 3: What was Missing? Justification!
 What is a good database design?
• Well-known solutions: BCNF, 4NF, …
 But what is it that makes a database design good?
• Elimination of update anomalies.
• Existence of algorithms that produce good designs: lossless
decomposition, dependency preservation.
 Previous work was specific for the relational model.
• Classical problems have to be revisited in the XML context.
Justification of Normal Forms
 Problematic to evaluate XML normal forms.
• No XML update language has been standardized.
• No XML query language yet has the same “yardstick” status
•
as relational algebra.
We do not even know if implication of XML FDs is decidable!
 We need a different approach.
• It must be based on some intrinsic characteristics of the
•
•
data.
It must be applicable to new data models.
It must be independent of query/update/constraint issues.
 Our approach is based on information theory.
Information Theory
 Entropy measures the amount of information
provided by a certain event.
 Assume that an event can have n different outcomes
with probabilities p1, …, pn.
Amount of information gained
by knowing that event i occurred :
1
log
pi
Average amount of
information gained (entropy) :

Entropy is maximal if each pi = 1/n :
log n
n
i 1
1
pi log
pi
Entropy and Redundancies
 Database schema: R(A,B,C), A  B
 Instance I:
A
B
C
1
1
2
2
3
4
 Pick a domain properly containing adom(I) : {1,
{1, …,
…, 6}
6}
•• Probability
Probability distribution:
distribution: P(4)
P(2) =
= 01 and
and P(a)
P(a) =
= 1/5,
0, a a≠ ≠2 4
• Entropy: log 5 ≈ 2.322
• Entropy: log 1 = 0
Entropy and Normal Forms
 Let  be a set of FDs over a schema S.
Theorem (S,) is in BCNF if and only if for every instance of
(S,) and for every domain properly containing adom(I), each
position carries non-zero amount of information (entropy > 0).
 This is a clean characterization of BCNF , but the
measure is not accurate enough ...
Problems with the Measure
 The measure cannot distinguish between different
types of data dependencies.
 It cannot distinguish between different instances of
the same schema:
R(A,B,C), A  B
A
B
C
A
B
C
1
2
3
1
2
3
4
1
2
4
1
entropy = 0
1
5
entropy = 0
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
4
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
4
Initial setting: pick a position p  Pos(I) and pick k
such that adom(I)  {1, …, k}. For example, k = 7.
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
4
Initial setting: pick a position p  Pos(I) and pick k
such that adom(I)  {1, …, k}. For example, k = 7.
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
1
1
C
3
2
4
Computation:
Initial
setting: for
pickevery
a position
X  Pos(I)
p  Pos(I)
– {p}, compute
and pick k
such that adom(I)
probability
distribution
 {1, P(a
…, k}.
| X),
For
a
example,
{1, …, k}.k = 7.
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
1
1
C
3
2
4
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
3
1
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
3
1
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
1
B
C
2
3
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
1
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
4
2
3
1
2
7
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
3
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
3
1
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
1
B
C
a
3
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
2
a
3
1
2
7
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
a
3
1
2
6
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/ (48 + 6  42) = 0.16
For a ≠ 2, P(a | X) = 42/ (48 + 6  42) = 0.14
Entropy ≈ 2.8057
(log 7 ≈ 2.8073)
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
1
1
C
3
2
4
Value : we consider the average over all sets
X  Pos(I) – {p}.
• Average: 2.4558 < log 7 (maximal entropy)
• It corresponds to conditional entropy.
• It depends on the value of k ...
A General Measure
 Previous value:
Inf ( p | )
k
I
 For each k, we consider the ratio:
•

Inf Ik ( p | )
log k
How close the given position p is to having the maximum
possible information content.
Inf Ik ( p | )
General measure: Inf I ( p | )  lim
k 
log k
Basic Properties
 The measure is well defined:
For every set of firstorder constraints 
defined over a schema S, every I  inst(S,),
and every p  Pos(I): Inf I ( p | ) exists.
 Bounds:
0  Inf I ( p | )  1
Basic Properties
 The measure does not depend on a particular
representation of constraints. If 1 and 2 are
equivalent: Inf I ( p | 1 )  Inf I ( p | 2 )
 It overcomes the limitations of the simple measure:
R(A,B,C), A  B
A
B
C
A
B
C
1
2
3
1
2
3
1
2
4
4
1
1
0.875
5
0.781
Well-Designed Databases
Definition A database specification (S,) is welldesigned if for every I  inst(S,) and every p 
Pos(I), Inf I ( p | ) = 1.
In other words, every position in every instance
carries the maximum possible amount of
information.
We would like to test this definition in the relational
world ...
Relational Databases
 is a set of data dependencies over a schema S:
  = : (S,) is well-designed.
  is a set of FDs: (S,) is well-designed if and only if
(S,) is in BCNF.
  is a set of FDs and MVDs: (S,) is well-designed if
and only if (S,) is in 4NF.
  is a set of FDs and JDs:
• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed.
•
The converse is not true.
A syntactic characterization of being well-designed is given
in [AL03].
Relational Databases
 If (S,) is in DK/NF, then (S,) is well-designed. The
converse is not true.
 The problem of verifying whether a relational schema is
well-designed is undecidable.
 If the schema contains only universal constraints (FDs,
MVDs, JDs, …), then the problem is co-NEXPTIMEcomplete.
•
If each relation in S has at most m attributes, then the
problem is  2P -complete.
Now we would like to apply our definition in the XML
world ...
XML Databases
 XML schema: (D,).
•
•
D is a DTD.
 is a set of data dependencies over D.
 We would like to evaluate XML normal forms.
 The notion of being well-designed extends from
relations to XML.
• The measure is robust; we just need to define the set of
positions in an XML tree T: Pos(T).
Positions in an XML Tree
DBLP
conf
conf
@title
“ICDT”
issue
issue
article
article
article
@title
“. . .”
@year
“1999”
@title
“. . .”
@year
“1999”
@title
“. . .”
@year
“2001”
Well-Designed XML Data
 We consider k such that adom(T)  {1, …,k}.
 For each k :
Inf Tk ( p | )
 We consider the ratio:

InfTk ( p | ) / log k
Inf ( p | )
General measure: Inf T ( p | )  lim
k 
log k
k
T
XNF: XML Normal Form
 For arbitrary XML data dependencies:
Definition An XML specification (D,) is welldesigned if for every T  inst(D,) and every
p  Pos(T), Inf ( p | ) = 1.
T
 For functional dependencies:
Theorem An XML specification (D,) is in XNF if and
only if (D,) is well-designed.
Normalization Algorithms: BCNF
 Relation schema: R(X,Y,Z), 
• Not in BCNF:   X  Y
and   X  A, for every A  Z.
 Basic decomposition: replace R(X,Y,Z) by S(X,Y)
and T(X,Z).
 Example: R(number, title, section, room), number  title
S(number, title),
T(number, section, room),
number  title

Normalization Algorithms: BCNF
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC434
Database Systems
1
GB248
number, title (R)
number, section, room (R)
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC434
1
GB248
Normalization Algorithms: BCNF
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC434
Database Systems
1
GB248
S T
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC434
1
GB248
Normalization Algorithms: XNF
The algorithm applies two transformations until the
schema is in XNF.
 If there is an anomalous FD of the form:
DBLP.conf.issue  DBLP.conf.issue.article.@year
then apply the “DBLP example rule”.
 Otherwise: choose a minimal anomalous FD and apply
the “University example rule”.
Normalization Algorithms
 The information-theoretic measure can also
be used for reasoning about normalization
algorithms.
 For BCNF and XNF decomposition algorithms:
Theorem After each step of these decomposition
algorithms, the amount of information in each
position does not decrease.
Future Work
 We would like to consider more complex XML
constraints and characterize good designs they give
rise to.
 We would like to characterize 3NF by using the
measure developed in this paper.
• In general, we would like to characterize “non-perfect”
normal forms.
 We would like to develop better characterizations of
normalization algorithms using our measure.
• Why is the “usual” BCNF decomposition algorithm good?
Why does it always stop?
Backup Slides
XNF: XML Normal Form
 Given a DTD D and a set of functional dependencies
  {}:
(D, )   if for any XML tree T conforming to D and
satisfying  , it is the case that T  
 (D, )+ = {  | (D, )   }
 Functional dependency  is trivial if it is implied by
the DTD alone: (D, )  
XNF: XML Normal Form
 XML specification: a DTD D and a set of functional
dependencies .
 A Relational DB is in BCNF if for every non-trivial
functional dependency X  Y in the specification,
X is a key.
 (D, ) is in XNF if:
For each non-trivial FD X  p.@l in (D, )+, X  p is
in (D, )+.
A Normal Form for FDs and JDs
Let  be a set of FDs and JDs over a schema S:
Theorem (S,) is well-designed if and only if for every
R  S and every nontrivial JD:
( R( x1 )  R( x2 )    R( xm )  R( x ))
implied by , there exists M  {1, ..., m} such that:
1.
x   xi
iM
2. For every i,j  M,  implies ( R( x1 )  R( x2 )    R( xm )  xi  x j )
A Normal Form for FDs and JDs (cont’d)
Schema: S = { R(A,B,C) } and  = { [AB, AC, BC],
AB C, AC B }.
 (S, ) is not in PJ/NF: {AB  ABC, AC  ABC} does
not imply [AB, AC, BC].
 (S, ) is not in 5NFR: [AB, AC, BC] is strong-
reduced and BC is not a superkey.
 (S,) is well-designed.
Tree Tuples
 Paths(D): all paths in a DTD D
courses.course
courses.course.@cno
courses.course.student.@name
 We distinguish two kinds of elements: attributes (@)
and element types.
 FDs are defined by means of a relational representation
of XML documents.
XML Trees
courses
course
v0
course
v1
@cno
“cs100”
@sno
“123”
...
student
student
v2
@name
“Fox”
v3
@grade
“B+”
@sno
“456”
@name
“Smith”
@grade
“A-”
Tree Tuples
Relational representation: tree tuples - mappings
t : Paths(D)  Vertices  Strings  {}
A tree tuple represents an XML tree:
t(courses) = v0
t(courses.course) = v1
t(courses.course.@cno) = “cs100”
t(courses.course.student) = v2
t(p) = , for the remaining paths
courses
v0
course
v1
@cno
“cs100”
student
v2
XML Tree: set of Tree Tuples
courses
course
v0
course
v1
@cno
“cs100”
@sno
“123”
...
student
student
v2
@name
“Fox”
v3
@grade
“B+”
@sno
“456”
@name
“Smith”
@grade
“A-”
Functional Dependencies for XML
 Expressions of the form:
XY
defined over a DTD D, where X, Y are finite
non-empty subsets of Paths(D).
 XML tree T can be tested for satisfaction of X  Y if:
X  Y  Paths(T)  Paths(D)
 T  X  Y if for every pair u, v of tree tuples in T:
u.X = v.X and u.X ≠  implies u.Y = v.Y
FD: Examples
 University DTD:
courses  course*
course  @cno, student*
student  @sno, name, grade
 Two students with the same @sno value must have the same name:
courses.course.student.@sno  courses.course.student.@name
 Every student can have at most one grade in every course:
{ courses.course,
courses.course.student.@sno }  courses.course.student.@grade