A Normal Form for XML Documents

Database Normalization
Revisited:
An information-theoretic
approach
Leonid Libkin
Joint work with Marcelo Arenas and Solmaz Kolahi
Sources
 M. Arenas, L. An information-theoretic approach to
normal forms for relational and XML data,
PODS’03; J.ACM, 2005.
 S. Kolahi, L. Redundancy vs dependency-
preservation in normalisation: an informationtheoretic analysis of 3NF,
PODS’06.
Outline
 Part 1 - Database Normalization from the 1970s and
1980s
 Part 2: Classical theory re-done: new justification for
normal forms:
•
•
BCNF and relatives (academic, eliminate redundancies)
3NF (practical, may leave some redundancies)
 Part 3: An XML application
2
If you haven’t taught “Intro to DB” lately…
 Design: decide how to represent the information in a
particular data model.
• Even for simple application domains there is a large number
of ways of representing the data of interest.
 We have to design the schema of the database.
• Set of relations.
• Set of attributes for each relation.
• Set of data dependencies.
3
Normalization Theory Today
 Normalization theory for relational databases was
developed in the 70s and 80s.
 Why do we need normalization theory today?
• New data models have emerged: XML.
• XML documents can contain redundant information.
 Redundant information in XML documents:
• Can be discovered if the user provides semantic
•
information.
Can be eliminated.
15
Designing a Database: An Example
 Attributes: number, title, section, room.
 Data dependency: every course number is
associated with only one title.
 Relational Schema:
BAD alternative:
R(number, title, section, room), number  title
GOOD alternative: S(number, title),
T(number, section, room),
number  title
Ø
4
Problems with BAD:Redundancies and
Update Anomalies
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC258
Computer Organization
3
GB248
CSC434
Database Systems
1
GB248
5
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
CSC434 is not given in this term.
6
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
CSC434 is not given in this term.
6
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434 is not given in this term.
Additional effect: all the information about CSC434 was
deleted.
6
Avoiding Update Anomalies
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
8
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Database Systems
The instance does not store redundant information.
8
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Database Systems
CSC434 is not given in this term.
8
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
Database Systems
CSC434 is not given in this term.
The title of CSC434 is not removed from the instance.
8
Normalization Theory
 Main idea: a normal form defines a condition that a
well designed database should satisfy.
 Normal form: syntactic condition on the database
schema.
•
Defined for a class of data dependencies.
 Main problems:
•
•
How to test whether a database schema is in a particular
normal form.
How to transform a database schema into an equivalent
one satisfying a particular normal form.
10
BCNF: a Normal Form for FDs
 Functional dependency (FD) over R(A1, …, An) :
XY,
X, Y  {A1, …, An}.
 X  Y : two rows with the same X-values must
have the same Y-values.
• Number  Title in our example
 Key dependency : X  A1 …. An
•X
is a key: two distinct rows must have distinct X-values.
11
BCNF: a Normal Form for FDs
  is a set of FD over R(A1, …, An).
 Relation schema R(A1, …, An),  is in BCNF if for
every nontrivial X  Y in , X is a key.
Not in BCNF: R(number, title, section, room), number  title
 A relational schema is in BCNF if every relation
schema is in BCNF.
In BCNF:
S(number, title),
T(number, section, room),
number  title

Ø
12
BCNF Decomposition
 Relation schema: R(X,Y,Z), 
• Not in BCNF:  implies X  Y
A  Z.
and but not X  A, for every
 Basic decomposition: replace R(X,Y,Z) by S(X,Y)
and T(X,Z).
 Example:
R(number, title, section, room), number  title
S(number, title),
T(number, section, room),
number  title
Ø
13
Lossless Decomposition
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC434
Database Systems
1
GB248
∏number, title (R)
∏number, section, room (R)
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC434
1
GB248
14
Lossless Decomposition
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC434
Database Systems
1
GB248
S JoinT
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC434
1
GB248
14
How to justify good designs?
 What is a good database design?
• Well-known solutions: BCNF, 4NF, 3NF…
 But what is it that makes a database design good?
• Elimination of update anomalies.
• Existence of algorithms that produce good designs: lossless
decomposition, dependency preservation.
34
Problems with traditional approaches
 Many papers tried to justify normal forms.
 Problem: tied very closely to the relational model.
 Relied on well-defined notions of queries/updates.
 These days we want to deal with other data models,
in particular XML.
 We need an approach that extends to other models,
in particular, XML.
Justification of Normal Forms
 Problematic to evaluate XML normal forms.
• No XML update language has been standardized.
• No XML query language yet has the same “yardstick” status
•
as relational algebra.
We do not even know if implication of XML FDs is decidable!
 We need a different approach.
• It must be based on some intrinsic characteristics of the
•
•
data.
It must be applicable to new data models.
It must be independent of query/update/constraint issues.
 Our approach is based on information theory.
35
Information Theory
 Entropy measures the amount of information
provided by a certain event.
 Assume that an event can have n different outcomes
with probabilities p1, …, pn.
Amount of information gained
by knowing that event i occurred :
1
log
pi
Average amount of
information gained (entropy) :

Entropy is maximal if each pi = 1/n :
log n
n
i 1
1
pi log
pi
36
Entropy and Redundancies
 Database schema: R(A,B,C), A  B
 Instance I:
A
B
C
1
1
2
2
3
4
 Pick a domain properly containing adom(I) : {1,
{1, …,
…, 6}
6}
•• Probability
Probability distribution:
distribution: P(4)
P(2) =
= 01 and
and P(a)
P(a) =
= 1/5,
0, a a≠ ≠2 4
• Entropy: log 5 ≈ 2.322
• Entropy: log 1 = 0
37
Entropy and Normal Forms
 Let  be a set of FDs over a schema S.
Theorem (S,) is in BCNF if and only if for every instance of
(S,) and for every domain properly containing adom(I), each
position carries non-zero amount of information (entropy > 0).
 A similar result holds for 4NF and MVDs.
 This is a clean characterization of BCNF and 4NF, but
the measure is not accurate enough ...
38
Problems with the Measure
 The measure cannot distinguish between different
types of data dependencies.
 It cannot distinguish between different instances of
the same schema:
R(A,B,C), A  B
A
B
C
A
B
C
1
2
3
1
2
3
4
1
2
4
1
entropy = 0
1
5
entropy = 0
39
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
4
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
4
Initial setting: pick a position p  Pos(I) and pick k
such that adom(I)  {1, …, k}. For example, k = 7.
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
4
Initial setting: pick a position p  Pos(I) and pick k
such that adom(I)  {1, …, k}. For example, k = 7.
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
1
1
C
3
2
4
Computation:
Initial
setting: for
pickevery
a position
X  Pos(I)
p  Pos(I)
– {p}, compute
and pick k
such that adom(I)
probability
distribution
 {1, P(a
…, k}.
| X),
For
a
example,
{1, …, k}.k = 7.
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
1
1
C
3
2
4
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
3
1
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
3
1
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
1
B
C
2
3
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
1
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
4
2
3
1
2
7
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
2
3
1
2
3
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
3
1
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
1
B
C
a
3
2
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
2
a
3
1
2
7
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
C
1
a
3
1
2
6
Computation: for every X  Pos(I) – {p}, compute
probability distribution P(a | X), a  {1, …, k}.
P(2 | X) = 48/ (48 + 6 * 42) = 0.16
For a ≠ 2, P(a | X) = 42 / (48 + 6 * 42) = 0.14
Entropy ≈ 2.8057
(log 7 ≈ 2.8073)
40
A General Measure
Instance I of schema
R(A,B,C), A  B :
A
B
1
1
C
3
2
4
Value : we consider the average over all sets
X  Pos(I) – {p}.
• Average: 2.4558 < log 7 (maximal entropy)
• It corresponds to conditional entropy.
• It depends on the value of k ...
40
A General Measure: Relative Information
Content (RIC)
 Previous value: RICIk(Σ|p)
 For each k, we consider the ratio: RICIk(Σ|p) / log k
•
How close the given position p is to having the maximum possible
information content.
 General measure (Arenas, L. 2003):
RICI(Σ|p) = limk  ∞ RICIk(Σ|p) / log k
41
Basic Properties
 The measure is well defined:
For every set of firstorder constraints Σ, every
instance I of Σ, and every position p in I,
RICI(Σ|p) exists.
 Bounds: 0 ≤ RICI(Σ|p) ≤ 1
 Closer to 1 = Less redundancy
42
Basic Properties
 The measure does not depend on a particular
representation of constraints.
 It overcomes the limitations of the simple
measure: R(A,B,C), A  B
A
B
C
A
B
C
1
2
3
1
2
3
1
2
4
4
1
1
0.875
5
0.781
43
Well-Designed Databases
Definition A database specification (S,) is welldesigned if for every I  inst(S,) and every p 
Pos(I), RICI(Σ|p) = 1.
In other words, every position in every instance
carries the maximum possible amount of
information.
44
Relational Databases (Arenas, L.’03)
 is a set of data dependencies over a schema S:
  = Ø: (S,) is well-designed.
  is a set of FDs: (S,) is well-designed if and only if
(S,) is in BCNF.
  is a set of FDs and MVDs: (S,) is well-designed if
and only if (S,) is in 4NF.
  is a set of FDs and JDs:
• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed.
•
The converse is not true.
A syntactic characterization of being well-designed is given
in [AL03].
45
Decidability Issues
 If Σ is a set of First-Order integrity constraints, then
the problem of verifying whether a relational schema
is well-designed is undecidable.
 If Σ contains only universal constraints (FDs, MVDs,
JDs, …), then the problem becomes decidable.
 High complexity (coNEXPTIME) by reduction to the
(complement) of Bernays-Schönfinkel satisfiability.
46
3NF
 BCNF is the most popular textbook normal form.
 In practice 3NF is much more common.
 From Oracle's “General Database Design FAQ”:
after defining 1NF, 2NF, and 3NF, it says:
that there are other normal forms but “their
definitions are of academic concern only, and are
rarely required for practical purposes”
Reminder: 3NF
 A candidate key: a minimal (wrt subset) key
 A prime attribute: an attribute that belongs to a
candidate key.
 BCNF: For a nontrivial FD X  A, where A is an
attribute, X must be a key.
 3NF (Bernstein/Zaniolo): For a nontrivial FD X  A,
X must be a key OR A must be prime.
Why 3NF?
 Because some relational schemas do not have
decompositions that are both:
• In BCNF, and
• Preserve all functional dependencies
 Example: ABC, AB  C, C  A
 On the other hand, there always exists a lossless
dependency preserving 3NF decomposition.
Redundancies vs Dependency-Preservation
 To achieve complete elimination of redundancies
(BCNF), one has to pay in terms of dependency
preservation.
 Losing constraints is often undesirable (database
integrity must be enforced).
 What is we only consider normal form that
guarantee lossless dependency-preserving
decomposition?
 Which is best? Is it 3NF?
3NF: how low can one go?
 Is there a lower bound for RICI(Σ|p) if Σ is in 3NF?
PROPOSITION (Kolahi DBPL 2005)
For every ε > 0, there exists a 3NF schema Σ, an
instance I and a position p so that
RICI(Σ|p) < ε.
 BUT: I has many attributes (increasing with 1/ε)
 Σ can be further decomposed into better 3NF
designs using the standard synthesis algorithm.
How good is 3NF?
 Let NF be a dependency-preserving normal form (guaranteeing
lossless dependency-preserving decompositions) based on
functional dependencies.
 The guaranteed information content of NF is
inf { c in [0,1] | for all schemas Σ,
exists an NF-decomposition
Σ1,…, Σm such that RI(Σi|p) ≥ c
for positions p in all instances I of Σi}
 PRICE(NF) = 1 – Guaranteed Information Content(NF)
Why 3NF? -- Answer
 PRICE(NF): the smallest amount of information
content loss one needs to tolerate to achieve
dependency-preservation.
 PRICE(NF) > 0 (BCNF isn’t dependency-preserving)
 PRICE(NF) is lower ==> NF is better.
 THEOREM (Kolahi, L.)
•
PRICE(3NF) = ½.
• PRICE(NF) ≥ ½ for every other dependencypreserving NF.
Why is PRICE(3NF)=1/2?
 We said earlier that RICI(Σ|p) could be below any
given ε > 0.
 But those schemas can are “bad” 3NFs that can be
further decomposed into “good” 3NFs, and for
“good” 3NFs we guarantee PRICE=1/2.
 “Good” 3NF = 3NF schemas produced by the
standard synthesis algorithm.
 So the result justifies not only 3NF but also the
algorithm that is most commonly used to produce
3NF designs.
Comparing normal forms
 We can use the information-theoretic measure to
compare normal forms.
 Define, for a condition P, the set of possible values
of the information-theoretic measure:
POSSP(m) = { RICI(Σ|p) | I has m attributes,
Σ satisfies P }
 Define the GAIN function:
inf POSSP1(m)
GAINP1/P2 (m) = -----------------inf POSSP2(m)
Comparison of normal forms
THEOREM (Kolahi, L.) For all m > 2:
• GAIN3NF/All (m) = 2
• GAIN”good” 3NF/All = 2m-2
• GAIN”good” 3NF/3NF = 2m-3
The measure extends beyond relations
 It can be used to reason about designs in other
datamodels:
• Nested relational
• XML
 In particular it can be used to justify a normal form
proposed recently for XML:
•
•
Called XNF (Arenas, L., 2002)
Generalizes BCNF to XML documents
XML Databases
 XML schema: (D,).
•
•
D is a DTD.
 is a set of data dependencies over D.
 We would like to evaluate XML normal forms.
 The notion of being well-designed extends from
relations to XML.
• The measure is robust; we just need to define the set of
positions in an XML tree T: Pos(T).
47
Positions in an XML Tree
DBLP
conf
conf
title
“ICDT”
issue
issue
article
article
article
author title @year author title @year
“Dong” “. . .” “1999” “Jarke” “. . .” “1999”
title
“. . .”
@year
“2001”
48
XML normalization
DBLP
conf
conf
title
“ICDT”
issue
issue
article
article
@year article
“1999”
author title @year author title @year
“Dong” “. . .” “1999” “Jarke” “. . .” “1999”
title
“. . .”
@year
“2001”
@year
“2001”
20
XNF: an XML normal form
 XNF is achieved by repeated transformations of two
kinds:
• As above in the DBLP example, and
• Splitting multiple attributes of the same element
type in the same manner as in the case of BCNF
normalization for relations.
 There is also a formal definition which is a natural
analog of BCNF in the XML context.
Well-Designed XML Data
 We consider k such that adom(T)  {1, …,k}.
 For each k : RICTk(Σ|p)
 We consider the ratio: RICTk(Σ|p) / log k
 General measure:
RICT(Σ|p) = limk  ∞ RICTk(Σ|p) / log k
49
XNF: XML Normal Form
 For arbitrary XML data dependencies:
Definition An XML specification (D,) is welldesigned if for every T  inst(D,) and every
p  Pos(T), RICT(Σ|p) = 1.
 For functional dependencies:
Theorem An XML specification (D,) is in XNF if and
only if (D,) is well-designed.
50
Future Work
 What is an analog of 3NF for XML?
 We would like to develop better characterizations of
normalization algorithms using our measure.
• Why is the “usual” BCNF decomposition algorithm good?
• Why does it always stop?
 What else can this measure be used for?
 What about nonuniform distributions?
• Are they meaningful here?
• If so, how do the results change?
52