normalization

Normalization
David J. Stucki
1
Outline

Informal Design Guidelines

Normal Forms





1NF
2NF
3NF
BCNF
4NF
2
What makes a “good” model?

How do we judge the quality of a relational
model?


What makes one model of the data better than
another model?
Implicit goals:

Information preservation


Maintain all concepts described by our higher level models (ER)
Minimum redundancy

Keep the amount of redundant storage of the same information to
a minimum – fewer copies of things the better
3
Informal Design Guidelines (1)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
What entity does this relation describe?

Hard to put into words



Not “Parts” – same part shows up multiple times
Not “Suppliers” – same supplier shows up multiple times
Not easy to describe this as a single entity – it’s showing a list
of parts and the suppliers who supply those parts
4
Informal Design Guidelines (1)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
What entity does this relation describe?

Problem here is that the semantics of the relation are
unclear


Semantics – meaning behind the attribute values in the tuple
We’re better off if we design our relations to have clear
semantics
5
Informal Design Guidelines (1)

Guideline 1


Design relation schema so that it has an easy-toexplain meaning
Do not combine attributes from multiple entity types or
relationship types into a single relation
6
Informal Design Guidelines (1)
Poor Design:
SUPPLIER-PARTS
S#
SNAME
STATUS
CITY
P#
PNAME
COLOR
WEIGHT
QTY
Better Design:
SUPPLIER
S#
SNAME
STATUS
CITY
PNAME
COLOR
WEIGHT
PARTS
P#
QTY
7
Informal Design Guidelines (1)
Poor Design:
EMP_DEPT
Ssn
Ename
Bdate
Address
Dnumber
Ename
Bdate
Address
Dnumber
Dname
Dmgr_ssn
Dname
Dmgr_ssn
Better Design:
EMP
Ssn
DEPT
Dnumber
8
Informal Design Guidelines (2)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
What about this redundant information?



Part Number shows up multiple times
Supplier city shows up multiple times
Supplier name shows up multiple times
9
Informal Design Guidelines (2)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
What about this redundant information?

Every repeated entry is more storage you are wasting

More things in database also impacts performance


Retrieval time, Insertion time
Best designs waste the smallest amount of storage
10
Informal Design Guidelines (2)

Redundant information can cause other
problems too


“Update Anomalies”
Logical problems that stem from poor choices in
relational representation
11
Insertion Anomalies (2)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
If we want to add a new part, we need to:

Have a supplier OR
Put NULL values in for the supplier

But S# is part of the primary key for this relation!




Violates integrity constraints
Must have both supplier and part before we can add either of them
Even if S# were not part of the primary key, would still be bad

Need to have a Supplier for every Part – even though we’re talking about two different things
12
Deletion Anomalies (2)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
Suppose we stop needing Part P2 and delete it from the database


We’ve now deleted all of the information related to supplier S3
Not good – just because we’ve stopped using a part that doesn’t mean we’ve
dropped them as supplier entirely
13
Modification Anomalies (2)
S#
S1
S1
S1
S1
S1
S1
S2
S2
S3
S4
S4
S4

SNAME
Smith
Smith
Smith
Smith
Smith
Smith
Jones
Jones
Blake
Clark
Clark
Clark
STATUS
20
20
20
20
20
20
10
10
10
20
20
20
CITY
London
London
London
London
London
London
Paris
Paris
Paris
London
London
London
P#
P1
P2
P3
P4
P5
P6
P1
P2
P2
P2
P4
P5
PNAME
Nut
Bolt
Screw
Screw
Cam
Cog
Nut
Bolt
Bolt
Bolt
Screw
Cam
COLOR
Red
Green
Blue
Red
Blue
Red
Red
Green
Green
Green
Red
Blue
WEIGHT
12
17
17
14
12
19
12
17
17
17
14
12
QTY
300
200
400
200
100
100
300
400
200
200
300
400
Suppose Supplier S1 changes its name to “Consolidated Parts Inc.”


We now need to update all of the lines describing parts that come from supplier
S1
If we miss any, we leave our relation in an inconsistent state
14
Informal Design Guidelines (2)

Guideline 2


Design relation schemas so that no insertion, deletion
or modification anomalies are present in the relations
This ends up being equivalent to designing relation
schemas so that no redundant information exists in
tuples
15
Informal Design Guidelines (3)
EMP
Ssn
Ename
Bdate
Address
Dnumber
ManagesDno
123…
Bob…
…
…
05
NULL
345…
Mary…
…
…
05
05
789…
Tom…
…
…
05
NULL
444…
June…
…
…
05
NULL
076…
Alice…
…
…
05
NULL

What’s wrong here?


Every employee has an attribute for which department they
manage
Most employees are not managers


All of those employees have NULL values in the ManagesDno
column
These take up space for no good reason
16
Informal Design Guidelines (3)

Reasons to avoid attributes with many NULL
values


Take up storage space for no good reason
Can make entities harder to understand


More difficult to figure out how to JOIN properly
Different NULL values have different meanings



Does not apply
Unknown
Known but absent
17
Informal Design Guidelines (3)

Guideline 3


Design relation schemas to avoid attributes that will
frequently have NULL values
If NULLs must be used, make sure they are the
exceptional cases and not the typical value for the
attribute for the majority of tuples
18
Informal Design Guidelines (4)
SUPPLIER
S#
Name
SUPPLIER_INFO
Status
City
S#
Name
Status
SUPPLIER_LOCATION
Name

City
Consider the above rethinking of the SUPPLIER schema
into two separate schemas


Can we easily recover the relation SUPPLIER from
SUPPLIER_INFO and SUPPLIER_LOCATION?
No – note that Name does not have to be unique in SUPPLIER or
in SUPPLIER_INFO



But it IS unique in combination with City in SUPPLIER_LOCATION
This can lead to spurious tuples when we combine tables
Two different companies both with the same name
19
Informal Design Guidelines (4)
SUPPLIER_INFO
SUPPLIER_LOC
SUPPLIER
S#
Name
Status
Name
City
S#
Name
Status
City
01
Smith
Y
Smith
Columbus
01
Smith
Y
Columbus
02
Smith
N
Smith
Boston
02
Smith
N
Boston
SUPPLIER_INFO * SUPPLIER_LOC
S#
Name
Status
Name
City
01
Smith
Y
Smith
Columbus
01
Smith
Y
Smith
Boston
02
Smith
N
Smith
Columbus
02
Smith
N
Smith
Boston



Two different companies with the same name
Original SUPPLIER table shows one line for each
Querying modified schema – spurious tuples

Why? Poor choice of match condition for the SUPPLIER_LOC table
20
Informal Design Guidelines (4)

Guideline 4


Design relation schemas to join using equality
conditions on appropriate attributes – primary
key/foreign key
Don’t build relations that contain matching attributes
that are not primary key/foreign key matches
21
Normalization

What is normalization?

A process where we examine and revise our relation
schemas based on their functional dependencies and
primary keys

Normalization attempts to improve the quality of our database
by:



Minimizing redundancy in our stored data (relation sets)
Minimizing the number of update anomalies in our stored data
Provides database designers with:


A formal framework for analyzing relation schemas
A series of tests that allow us to have different degrees of
normalization on our data, depending on our needs
22
Normalization

Define: Normal Form


A criteria for determining how vulnerable a relation is
to logical inconsistencies (update anomalies or
redundancy)
Different levels of Normal Form


1st normal form (1NF), 2nd normal form (2NF), …
Highest Normal Form (HNF) of a relation – the highest level of
normal form criteria the relation meets
23
Normalization

Normal forms by themselves do not guarantee
good database design


No “magic formula” for making a good design
Designers must confirm additional properties in their
designs:


Lossless join property – guarantee that the spurious tuple
generation previously discussed does not occur
Dependency preservation – guarantee that if a dependency
existed before altering the schema it still is represented in the
altered schema
24
First Normal Form (1NF)

1NF is the most basic normal form


So basic that since it was defined it has become part
of the definition of a relation in the relational model
A relation is in 1NF if it:


has only atomic attributes AND
The value of any attribute must be a single value


No multivalued attributes
No composite attributes
25
First Normal Form (1NF)

Here’s an example of a schema that is NOT 1NF


Note that this violates our definition of a relation – Dlocations is a set rather than
an atomic value
To fix this we can do one of two things:


Change Dlocations to be atomic and expand tuples with multiple locations into multiple tuples (adding
redundant information)
Break Dlocations out into its own relation, using Dnumber as a foreign key
26
First Normal Form (1NF)

To fix this we can do one of two things:
1.
2.
Change Dlocations to be atomic and expand tuples with multiple locations into multiple tuples (adding
redundant information)
Break Dlocations out into its own relation, using Dnumber as a foreign key
27
Second Normal Form (2NF)

2NF is a stricter normal form than 1NF

A relation schema R is in 2NF if it is in 1NF and no
nonprime attribute A in R is dependent on a subset of
the primary key of R
28
Second Normal Form (2NF)

Testing for 2NF

No test needed if the primary key has a single attribute


Always 2NF if a relation is 1NF and has a primary key with a
single attribute
If primary key is made of multiple attributes:


Examine your primary key and nonprime attributes
Can you remove an attribute from your primary key and still
have a dependency with at least one nonprime attribute?

If so, relation is not 2NF
29
Second Normal Form (2NF)

Consider the EMP_PROJ relation above

Primary key is SSN + Pnumber


The pair uniquely determines Hours
Remove Pnumber from key

Ssn → Ename


Remove Ssn from key

Pnumber → {Pname, Plocation}


Unique Ssn determines what the Ename will be – no need for Pnumber at all
Pname and Plocation are both independent of the employee’s Ssn
EMP_PROJ is not 2NF
30
Second Normal Form (2NF)
INSTRUCTOR_SECTIONS
EmpId

Cname
Cnumber
Iname
Consider the above relation schema



SecId
Iname – Instructor Name
Cname, Cnumber – Course name and number
Is this schema in 2NF?

What are the dependencies?



EmpId→ Iname
SecId→ {Cname, Cnumber}
2NF?
31
Second Normal Form (2NF)
COURSES
Dept

Cname
Consider the above relation schema



CourseNo
Dept – Department Name
Cname – Course Name
Is this schema in 2NF?

What are the functional dependencies?


{Dept, CourseNo} → Cname
2NF?
32
Second Normal Form (2NF)

EMP_PROJ had three FDs




{Ssn,Pnumber} → Hours
Ssn → Ename
Pnumber → {Pname, Plocation}
Becomes three separate relations, one for each dependency
33
Second Normal Form (2NF)
INSTRUCTOR_SECTIONS
EmpId
SecId
IS1
Iname
Iname
SecId
IS3
Cname
Cnumber
EmpId
SecId
INSTRUCTOR_SECTIONS



Cnumber
IS2
EmpId

Cname
EmpId→ Iname
SecId→ {Cname, Cnumber}
Becomes two separate relations, one for each dependency


IS3 is our original primary key
Keeps relationship between Instructors and Sections, but nothing else
34
Third Normal Form (3NF)

3NF is even stricter than 2NF

A relation schema R is in 3NF if it is in 2NF and if no
nonprime attribute of R is transitively dependent on
the primary key
35
Third Normal Form (3NF)

Testing for 3NF

First make sure the relation schema is in 2NF


If it isn’t in 2NF, it can’t be in 3NF
Next determine if there are any nonkey attributes that
are functionally determined by other nonkey attributes

If so you have a transitive dependency and are not in 3NF
36
Third Normal Form (3NF)

Consider EMP_DEPT above


Is it in 3NF?
First, is it in 2NF?
37
Third Normal Form (3NF)

Consider EMP_DEPT above


Is it in 3NF?
First, is it in 2NF?


Yes – only a single attribute in primary key – 2NF
Are there any transitive dependencies?
38
Third Normal Form (3NF)

Consider EMP_DEPT above


Is it in 3NF?
First, is it in 2NF?


Yes – only a single attribute in primary key – 2NF
Are there any transitive dependencies?

Yes:



Dnumber → {Dname, Dmgr_ssn}
Ssn → Dnumber
Not in 3NF
39
Third Normal Form (3NF)
COURSE_SECTIONS
SecId

CourseId
Cname
Cnumber
Iname
Consider COURSE_SECTIONS above



SecId – section ID
CourseId – special unique id for courses
Is it in 3NF?

First, is it in 2NF?


Dependencies:

SecId → {CourseId, Cname, Cnumber, Iname}

CourseId → {Cname, Cnumber}
Are there any transitive dependencies?
40
Third Normal Form (3NF)
COURSES
Dept

Cname
Consider the above relation schema



CourseNo
Dept – Department Name
Cname – Course Name
Is this schema in 3NF?

Is it in 2NF?



Dependencies:
{Dept,CourseNo} → Cname
Are there any transitive dependencies?
41
Third Normal Form (3NF)

EMP_DEPT had one transitive dependency




Dnumber → {Dname, Dmgr_ssn}
Ssn → Dnumber
Break out Dnumber, Dname and Dmgr_ssn into its own relation schema
Keep the original relation schema with Ssn as primary key
42
Third Normal Form (3NF)
COURSE_SECTIONS
SecId
CourseId
Cname
CS1
SecId

Cnumber
Iname
CS2
CourseId
Iname
CourseId
Cname
Cnumber
COURSE_SECTIONS had one transitive dependency




CourseId → {Cname, Cnumber}
SecId → CourseId
Break out CourseId, Cname and Cnumber into its own relation schema
Keep the original relation schema with SecId as primary key
43
1NF, 2NF and 3NF based on Primary
Keys Summary
44