Database Normalization
Revisited:
An information-theoretic
approach
Leonid Libkin
Joint work with Marcelo Arenas and Solmaz Kolahi
Sources
M. Arenas, L. An information-theoretic approach to
normal forms for relational and XML data,
PODS’03; J.ACM, 2005.
S. Kolahi, L. Redundancy vs dependency-
preservation in normalisation: an informationtheoretic analysis of 3NF,
PODS’06.
Outline
Part 1 - Database Normalization from the 1970s and
1980s
Part 2: Classical theory re-done: new justification for
normal forms:
•
•
BCNF and relatives (academic, eliminate redundancies)
3NF (practical, may leave some redundancies)
Part 3: An XML application
2
If you haven’t taught “Intro to DB” lately…
Design: decide how to represent the information in a
particular data model.
• Even for simple application domains there is a large number
of ways of representing the data of interest.
We have to design the schema of the database.
• Set of relations.
• Set of attributes for each relation.
• Set of data dependencies.
3
Normalization Theory Today
Normalization theory for relational databases was
developed in the 70s and 80s.
Why do we need normalization theory today?
• New data models have emerged: XML.
• XML documents can contain redundant information.
Redundant information in XML documents:
• Can be discovered if the user provides semantic
•
information.
Can be eliminated.
15
Designing a Database: An Example
Attributes: number, title, section, room.
Data dependency: every course number is
associated with only one title.
Relational Schema:
BAD alternative:
R(number, title, section, room), number title
GOOD alternative: S(number, title),
T(number, section, room),
number title
Ø
4
Problems with BAD:Redundancies and
Update Anomalies
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC258
Computer Organization
3
GB248
CSC434
Database Systems
1
GB248
5
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
CSC434 is not given in this term.
6
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434
Database Systems
1
GB248
CSC434 is not given in this term.
6
Deletion Anomaly
number
title
section
room
CSC258
Computer Organization I
1
LP266
CSC258
Computer Organization I
2
GB258
CSC258
Computer Organization I
3
GB248
CSC434 is not given in this term.
Additional effect: all the information about CSC434 was
deleted.
6
Avoiding Update Anomalies
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
8
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Database Systems
The instance does not store redundant information.
8
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
CSC434
1
GB248
Database Systems
CSC434 is not given in this term.
8
Avoiding Update Anomalies
number
title
number
section
room
CSC258 Computer Organization I
CSC258
1
LP266
CSC434
CSC258
2
GB258
CSC258
3
GB248
Database Systems
CSC434 is not given in this term.
The title of CSC434 is not removed from the instance.
8
Normalization Theory
Main idea: a normal form defines a condition that a
well designed database should satisfy.
Normal form: syntactic condition on the database
schema.
•
Defined for a class of data dependencies.
Main problems:
•
•
How to test whether a database schema is in a particular
normal form.
How to transform a database schema into an equivalent
one satisfying a particular normal form.
10
BCNF: a Normal Form for FDs
Functional dependency (FD) over R(A1, …, An) :
XY,
X, Y {A1, …, An}.
X Y : two rows with the same X-values must
have the same Y-values.
• Number Title in our example
Key dependency : X A1 …. An
•X
is a key: two distinct rows must have distinct X-values.
11
BCNF: a Normal Form for FDs
is a set of FD over R(A1, …, An).
Relation schema R(A1, …, An), is in BCNF if for
every nontrivial X Y in , X is a key.
Not in BCNF: R(number, title, section, room), number title
A relational schema is in BCNF if every relation
schema is in BCNF.
In BCNF:
S(number, title),
T(number, section, room),
number title
Ø
12
BCNF Decomposition
Relation schema: R(X,Y,Z),
• Not in BCNF: implies X Y
A Z.
and but not X A, for every
Basic decomposition: replace R(X,Y,Z) by S(X,Y)
and T(X,Z).
Example:
R(number, title, section, room), number title
S(number, title),
T(number, section, room),
number title
Ø
13
Lossless Decomposition
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC434
Database Systems
1
GB248
∏number, title (R)
∏number, section, room (R)
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC434
1
GB248
14
Lossless Decomposition
number
title
section
room
CSC258
Computer Organization
1
LP266
CSC258
Computer Organization
2
GB258
CSC434
Database Systems
1
GB248
S JoinT
number
title
number
section
room
CSC258
Computer Organization
CSC258
1
LP266
CSC434
Database Systems
CSC258
2
GB258
CSC434
1
GB248
14
How to justify good designs?
What is a good database design?
• Well-known solutions: BCNF, 4NF, 3NF…
But what is it that makes a database design good?
• Elimination of update anomalies.
• Existence of algorithms that produce good designs: lossless
decomposition, dependency preservation.
34
Problems with traditional approaches
Many papers tried to justify normal forms.
Problem: tied very closely to the relational model.
Relied on well-defined notions of queries/updates.
These days we want to deal with other data models,
in particular XML.
We need an approach that extends to other models,
in particular, XML.
Justification of Normal Forms
Problematic to evaluate XML normal forms.
• No XML update language has been standardized.
• No XML query language yet has the same “yardstick” status
•
as relational algebra.
We do not even know if implication of XML FDs is decidable!
We need a different approach.
• It must be based on some intrinsic characteristics of the
•
•
data.
It must be applicable to new data models.
It must be independent of query/update/constraint issues.
Our approach is based on information theory.
35
Information Theory
Entropy measures the amount of information
provided by a certain event.
Assume that an event can have n different outcomes
with probabilities p1, …, pn.
Amount of information gained
by knowing that event i occurred :
1
log
pi
Average amount of
information gained (entropy) :
Entropy is maximal if each pi = 1/n :
log n
n
i 1
1
pi log
pi
36
Entropy and Redundancies
Database schema: R(A,B,C), A B
Instance I:
A
B
C
1
1
2
2
3
4
Pick a domain properly containing adom(I) : {1,
{1, …,
…, 6}
6}
•• Probability
Probability distribution:
distribution: P(4)
P(2) =
= 01 and
and P(a)
P(a) =
= 1/5,
0, a a≠ ≠2 4
• Entropy: log 5 ≈ 2.322
• Entropy: log 1 = 0
37
Entropy and Normal Forms
Let be a set of FDs over a schema S.
Theorem (S,) is in BCNF if and only if for every instance of
(S,) and for every domain properly containing adom(I), each
position carries non-zero amount of information (entropy > 0).
A similar result holds for 4NF and MVDs.
This is a clean characterization of BCNF and 4NF, but
the measure is not accurate enough ...
38
Problems with the Measure
The measure cannot distinguish between different
types of data dependencies.
It cannot distinguish between different instances of
the same schema:
R(A,B,C), A B
A
B
C
A
B
C
1
2
3
1
2
3
4
1
2
4
1
entropy = 0
1
5
entropy = 0
39
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
1
2
3
1
2
4
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
1
2
3
1
2
4
Initial setting: pick a position p Pos(I) and pick k
such that adom(I) {1, …, k}. For example, k = 7.
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
1
2
3
1
2
4
Initial setting: pick a position p Pos(I) and pick k
such that adom(I) {1, …, k}. For example, k = 7.
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
1
1
C
3
2
4
Computation:
Initial
setting: for
pickevery
a position
X Pos(I)
p Pos(I)
– {p}, compute
and pick k
such that adom(I)
probability
distribution
{1, P(a
…, k}.
| X),
For
a
example,
{1, …, k}.k = 7.
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
1
1
C
3
2
4
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
3
1
2
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
3
1
2
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
1
B
C
2
3
2
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
1
2
3
1
2
1
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
4
2
3
1
2
7
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
1
2
3
1
2
3
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
3
1
2
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
1
B
C
a
3
2
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
2
a
3
1
2
7
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/
For a ≠ 2, P(a | X) =
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
C
1
a
3
1
2
6
Computation: for every X Pos(I) – {p}, compute
probability distribution P(a | X), a {1, …, k}.
P(2 | X) = 48/ (48 + 6 * 42) = 0.16
For a ≠ 2, P(a | X) = 42 / (48 + 6 * 42) = 0.14
Entropy ≈ 2.8057
(log 7 ≈ 2.8073)
40
A General Measure
Instance I of schema
R(A,B,C), A B :
A
B
1
1
C
3
2
4
Value : we consider the average over all sets
X Pos(I) – {p}.
• Average: 2.4558 < log 7 (maximal entropy)
• It corresponds to conditional entropy.
• It depends on the value of k ...
40
A General Measure: Relative Information
Content (RIC)
Previous value: RICIk(Σ|p)
For each k, we consider the ratio: RICIk(Σ|p) / log k
•
How close the given position p is to having the maximum possible
information content.
General measure (Arenas, L. 2003):
RICI(Σ|p) = limk ∞ RICIk(Σ|p) / log k
41
Basic Properties
The measure is well defined:
For every set of firstorder constraints Σ, every
instance I of Σ, and every position p in I,
RICI(Σ|p) exists.
Bounds: 0 ≤ RICI(Σ|p) ≤ 1
Closer to 1 = Less redundancy
42
Basic Properties
The measure does not depend on a particular
representation of constraints.
It overcomes the limitations of the simple
measure: R(A,B,C), A B
A
B
C
A
B
C
1
2
3
1
2
3
1
2
4
4
1
1
0.875
5
0.781
43
Well-Designed Databases
Definition A database specification (S,) is welldesigned if for every I inst(S,) and every p
Pos(I), RICI(Σ|p) = 1.
In other words, every position in every instance
carries the maximum possible amount of
information.
44
Relational Databases (Arenas, L.’03)
is a set of data dependencies over a schema S:
= Ø: (S,) is well-designed.
is a set of FDs: (S,) is well-designed if and only if
(S,) is in BCNF.
is a set of FDs and MVDs: (S,) is well-designed if
and only if (S,) is in 4NF.
is a set of FDs and JDs:
• If (S,) is in PJ/NF or in 5NFR, then (S,) is well-designed.
•
The converse is not true.
A syntactic characterization of being well-designed is given
in [AL03].
45
Decidability Issues
If Σ is a set of First-Order integrity constraints, then
the problem of verifying whether a relational schema
is well-designed is undecidable.
If Σ contains only universal constraints (FDs, MVDs,
JDs, …), then the problem becomes decidable.
High complexity (coNEXPTIME) by reduction to the
(complement) of Bernays-Schönfinkel satisfiability.
46
3NF
BCNF is the most popular textbook normal form.
In practice 3NF is much more common.
From Oracle's “General Database Design FAQ”:
after defining 1NF, 2NF, and 3NF, it says:
that there are other normal forms but “their
definitions are of academic concern only, and are
rarely required for practical purposes”
Reminder: 3NF
A candidate key: a minimal (wrt subset) key
A prime attribute: an attribute that belongs to a
candidate key.
BCNF: For a nontrivial FD X A, where A is an
attribute, X must be a key.
3NF (Bernstein/Zaniolo): For a nontrivial FD X A,
X must be a key OR A must be prime.
Why 3NF?
Because some relational schemas do not have
decompositions that are both:
• In BCNF, and
• Preserve all functional dependencies
Example: ABC, AB C, C A
On the other hand, there always exists a lossless
dependency preserving 3NF decomposition.
Redundancies vs Dependency-Preservation
To achieve complete elimination of redundancies
(BCNF), one has to pay in terms of dependency
preservation.
Losing constraints is often undesirable (database
integrity must be enforced).
What is we only consider normal form that
guarantee lossless dependency-preserving
decomposition?
Which is best? Is it 3NF?
3NF: how low can one go?
Is there a lower bound for RICI(Σ|p) if Σ is in 3NF?
PROPOSITION (Kolahi DBPL 2005)
For every ε > 0, there exists a 3NF schema Σ, an
instance I and a position p so that
RICI(Σ|p) < ε.
BUT: I has many attributes (increasing with 1/ε)
Σ can be further decomposed into better 3NF
designs using the standard synthesis algorithm.
How good is 3NF?
Let NF be a dependency-preserving normal form (guaranteeing
lossless dependency-preserving decompositions) based on
functional dependencies.
The guaranteed information content of NF is
inf { c in [0,1] | for all schemas Σ,
exists an NF-decomposition
Σ1,…, Σm such that RI(Σi|p) ≥ c
for positions p in all instances I of Σi}
PRICE(NF) = 1 – Guaranteed Information Content(NF)
Why 3NF? -- Answer
PRICE(NF): the smallest amount of information
content loss one needs to tolerate to achieve
dependency-preservation.
PRICE(NF) > 0 (BCNF isn’t dependency-preserving)
PRICE(NF) is lower ==> NF is better.
THEOREM (Kolahi, L.)
•
PRICE(3NF) = ½.
• PRICE(NF) ≥ ½ for every other dependencypreserving NF.
Why is PRICE(3NF)=1/2?
We said earlier that RICI(Σ|p) could be below any
given ε > 0.
But those schemas can are “bad” 3NFs that can be
further decomposed into “good” 3NFs, and for
“good” 3NFs we guarantee PRICE=1/2.
“Good” 3NF = 3NF schemas produced by the
standard synthesis algorithm.
So the result justifies not only 3NF but also the
algorithm that is most commonly used to produce
3NF designs.
Comparing normal forms
We can use the information-theoretic measure to
compare normal forms.
Define, for a condition P, the set of possible values
of the information-theoretic measure:
POSSP(m) = { RICI(Σ|p) | I has m attributes,
Σ satisfies P }
Define the GAIN function:
inf POSSP1(m)
GAINP1/P2 (m) = -----------------inf POSSP2(m)
Comparison of normal forms
THEOREM (Kolahi, L.) For all m > 2:
• GAIN3NF/All (m) = 2
• GAIN”good” 3NF/All = 2m-2
• GAIN”good” 3NF/3NF = 2m-3
The measure extends beyond relations
It can be used to reason about designs in other
datamodels:
• Nested relational
• XML
In particular it can be used to justify a normal form
proposed recently for XML:
•
•
Called XNF (Arenas, L., 2002)
Generalizes BCNF to XML documents
XML Databases
XML schema: (D,).
•
•
D is a DTD.
is a set of data dependencies over D.
We would like to evaluate XML normal forms.
The notion of being well-designed extends from
relations to XML.
• The measure is robust; we just need to define the set of
positions in an XML tree T: Pos(T).
47
Positions in an XML Tree
DBLP
conf
conf
title
“ICDT”
issue
issue
article
article
article
author title @year author title @year
“Dong” “. . .” “1999” “Jarke” “. . .” “1999”
title
“. . .”
@year
“2001”
48
XML normalization
DBLP
conf
conf
title
“ICDT”
issue
issue
article
article
@year article
“1999”
author title @year author title @year
“Dong” “. . .” “1999” “Jarke” “. . .” “1999”
title
“. . .”
@year
“2001”
@year
“2001”
20
XNF: an XML normal form
XNF is achieved by repeated transformations of two
kinds:
• As above in the DBLP example, and
• Splitting multiple attributes of the same element
type in the same manner as in the case of BCNF
normalization for relations.
There is also a formal definition which is a natural
analog of BCNF in the XML context.
Well-Designed XML Data
We consider k such that adom(T) {1, …,k}.
For each k : RICTk(Σ|p)
We consider the ratio: RICTk(Σ|p) / log k
General measure:
RICT(Σ|p) = limk ∞ RICTk(Σ|p) / log k
49
XNF: XML Normal Form
For arbitrary XML data dependencies:
Definition An XML specification (D,) is welldesigned if for every T inst(D,) and every
p Pos(T), RICT(Σ|p) = 1.
For functional dependencies:
Theorem An XML specification (D,) is in XNF if and
only if (D,) is well-designed.
50
Future Work
What is an analog of 3NF for XML?
We would like to develop better characterizations of
normalization algorithms using our measure.
• Why is the “usual” BCNF decomposition algorithm good?
• Why does it always stop?
What else can this measure be used for?
What about nonuniform distributions?
• Are they meaningful here?
• If so, how do the results change?
52
© Copyright 2026 Paperzz