Week 10 - DCU School of Computing

Chapter 9:
Normalization
• Part 1: A Simple Example
• Part 2: Another Example & The
Formal Stuff
A Problem: Keeping Track of Invoices (cont’d)
Suppose we have some invoices that we may or may not
want to refer to later…
1
A Problem: Keeping Track of Invoices (cont’d)
Fig. 9.1
Could store in an excel file but, as seen, might have problems if have complex
questions relating to the data:
1. How many 4” bolts did Frankenstein Parts order in 2002?
2. What items were sold on a certain date?
Solution: A Normalized Database
• First Normal Form (NF1):
No Repeating Elements or Groups of Elements
• In Fig.
Fig 9.1,
9 1 rows 2
2, 3,
3 4 represent invoice 125,
125 which in
DB terms is a single tuple
• In NF1 want to get rid of repeating elements, which
are:
–
–
–
–
column H2 to H4, column J2 to J4, column K2 to K4 etc
these contain lists of values, and these are hated by NF1
NF1 wants atomicity: each attribute is simple & indivisible
the repeating data for invoice 125 is cells: H2-M2, H3-M3, H4M4
• Can satisfy NF1, simply by separating each item in
these lists into its own row (See Fig. 9.2).
2
Solution: NF1 Cont’d
Fig. 9.2
But, were trying to reduce & simplify, now have introduced more data!
No matter, this will be addressed later (with NF3)
Solution: NF1 Cont’d
• Have only done half of NF1. NF1 addresses:
1.
2.
Row of data can’t have repeat groups of similar data (atomicity) 9
Each row of data must have a unique identifier (or Primary Key)
• In order to look at 2., have to convert Fig 9.2 into a
RDBMS (see the orders table in MS Access Fig. 9.3)
Fig. 9.3
• As can be seen, no one column ids each row, so have to
use two together: order_id & item_id
• Together the concatenated primary key ids each row
3
Solution: NF1 Cont’d
• The underlying structure of the orders
table can be represented as Fig. 9.4
• Identify
Id tif the
th columns
l
th
thatt make
k up the
th
key with the PK notation.
• Fig. 9.4 begins the Entity Relationship
Diagram (or ERD).
• DB schema now satisfies the 2
requirements of NF1: atomicity
& uniqueness. Thus it meets the
most basic criteria of a relational db.
Fig. 9.4
orders
order_id(PK)
order_date
customer_id
customer_name
customer_address
customer_city
customer_state
item_id(PK)
item_description
item_qty
item_price
item total price
item_total_price
order_total_price
primary
i
Solution: NF2
• Second Normal Form (NF2):
No Partial Dependencies on a Concatenated Key
• Next have to test each table for partial dependencies on a
concatenated key
• Means that for a table with a concatenated primary key, each
column that is not part of the primary key must depend upon the
entire concatenated key for its existence.
• If a column depends upon only 1 part of the concatenated key,
then entire table has failed NF2 & must create another table to fix
it.
• For each column must ask the question:
q
– Can this column exist without one or the other part of the
concatenated primary key?
– If answer is “yes” – even once – table fails NF2
4
Solution: NF2 Cont’d
• Refer to Fig. 9.4 again to recall orders table structure.
orders
• Recall the meaning of the two columns
order_id(PK)
Fig. 9.4
in the primary key:
order_date
– order_id ids invoice this item comes from.
– item_id is the inventory items unique identifier.
Can think of it as a part number.
• Don't analyze these columns (since
they are part of the primary key).
• Instead consider the remaining columns...
columns
customer_id
customer_name
customer_address
customer_city
customer_state
item_id(PK)
item_description
item_qty
item_price
item total price
item_total_price
order_total_price
Solution: NF2 Cont’d
• order_date is the date on which the order was made.
– relies on order_id; an order date has to have an order,
otherwise it is only a date
– can an order date exist without an item_id? yes: order_date
relies
li on order_id,
d
id nott item_id
it
id (a
( specific
ifi order
d doesn’t
d
’t have
h
to have a specific item)
– so order_date fails NF28
• customer_id is ID of the customer who placed the order
– does it rely on order_id? No: a customer can exist without
placing any orders.
– does it rely on item_id?
item id? No (same reason).
reason)
– customer_id does not rely on either member of the PK
– What to do? NF3 will come to the rescue here, hence ? for all
the rest of the customer_* columns
5
Solution: NF2 Cont’d
• item_description is next column not itself part of PK. It is
the plain-language description of the inventory item.
– relies on item_id, but can it exist without an order_id?
– Yes! An inventory item (&
(&"description")
description ) could sit on a shelf,
shelf and
never be purchased... It can exist independent of an order.
– item_description fails the test. 8
• item_qty is no. of items purchased on a particular
invoice.
– can it exist without an item_id? No: cant have "amount of
nothing"
g
– can it exist without an order_id? No: a quantity purchased with
an invoice is meaningless without an invoice.
– So this column does not violate NF2
– item_qty depends on both parts of our concatenated PK.9
Solution: NF2 Cont’d
• item_price is similar to item_description. It depends on
the item_id but not on order_id, so it does violate NF2. 8
• item_total_price is tricky:
– seems to depend on both order_id & item_id, so passes NF2.
– but it is a derived value: it is item_qty times item_price.
– so, in fact, it doesn’t belong in the db at all.
– can easily be reconstructed outside of db; to include it would be
redundant (and could quite possibly introduce corruption).
– therefore can discard it
• order_total_price the
th sum off all
ll the
th item_total_price
fields for a particular order, is another derived value.
– can discard this field too for the same reason as
item_total_price
6
Solution: NF2 Cont’d
Fig. 9.4
orders
order_id(PK)
order_date
customer_id
customer_name
customer_address
customer_city
customer_state
item_id(PK)
item_description
item qty
item_qty
item_price
item_total_price
order_total_price
Fig. 9.4
(New)
orders
order_id(PK)
order_date 8
customer_id
customer_name
customer_address ?
customer_city
customer_state
item_id(PK)
item_description 8
item qty9
item_qty
item_price8
item_total_price
order_total_price
?
?
?
?
Solution: NF2 Cont’d
• What to do with a table that fails NF2, as this one has?
– First take out the second half of the concatenated PK (item_id)
&p
put it in its own table.
– All columns that depend on item_id - whether in whole or in
part - follow it into the new table, order_items (see Fig. 9.5).
– The other fields — those that rely on just the first half of the PK
(order_id) and those we aren't sure about — stay where they
are.
orders
Fig. 9.5
order_id(PK)
order_date
customer_id
customer_name
customer_address
customer_city
customer_state
order_items
order_id(PK)
item_id(PK)
item_description
item_qty
item_price
7
Solution: NF2 Cont’d
• things to notice abut Fig. 9.5:
1. have brought a copy of order_id to the order_items table to
allow each order_item to "remember" which order it is a part
of.
2 orders
2.
d
t bl h
table
has fewer
f
rows than
th
before
b f
& no longer
l
has
h a
concatenated PK. PK consists of a single column, order_id.
3. order_items table does have a concatenated primary key.
• Crows feet mean in Fig. 9.5:
–
–
each order can be associated with any number of order-items, but at
least one;
each order-item is associated with one order, and only one.
orders
Fig. 9.5
order_id(PK)
order_date
customer_id
customer_name
customer_address
customer_city
customer_state
order_items
order_id(PK)
item_id(PK)
item_description
item_qty
item_price
Solution: NF2 Phase II
• Remember, NF2 only applies to tables with a
concatenated PK. Now orders has a single-column PK,
it has passed NF2.
• order_items, however, still has a concatenated PK.
– have to pass it thro NF2 analysis again to see if it passes.
– ask the same question we did before:
– Can this column exist without one or the other part of the
concatenated PK?
Fig. 9.6
• Fig. 9.6 shows order_items table structure.
order_items
• item_description relies on item_id, but
order_id(PK)
d
id(PK)
not order_id, so this again fails NF28
item_id(PK)
item_description
• item_qty relies on both parts of PK,
item_qty
does not violate NF2 9
item_price
• item_price relies on item_id but not on order_id, so it
does violate NF2 8
8
Solution: NF2 Phase II Cont’d
Fig. 9.6
order_items
Fig. 9.6
(New)
order_id(PK)
item_id(PK)
item_description
p
item_qty
item_price
order_items
order_id(PK)
item_id(PK)
item description 8
item_description
item_qty9
item_price 8
• On first pass thro NF2 test, lost all fields relying on item_id & put
them into new table. This time, only taking fields failing the test:
ie item_qty stays. What's different this time?
• First p
pass,, removed item_id key
y from orders altogether
g
cos of
the 1:M relationship between orders & order-items.
– Therefore item_qty field had to follow item_id into the new table.
• Second pass, item_id wasn’t taken from order-items table cos of
the M:1 relationship between order-items & items.
– Therefore, since item_qty does not violate NF2 this time, it is
permitted to stay in the table with the two PK parts that it relies on.
Solution: NF2 Phase II Cont’d
• Crows feet mean in Fig. 9.7:
– each item can be associated with any number of lines on any number
of invoices, including zero;
– each order-item is associated with one item, and only one.
– These two lines are examples of 1:M relationships.
• This
h 3-table
3 bl structure, is h
how express a M:N relationship:
l
h
– Each order can have many items; each item can belong to many
orders.
• Notes:
– Didn’t bring a copy of order_id column into new table cos individual
items needn’t know the orders they are part of, as order_items
remembers this r’ship via the order_id & item_id columns. Taken
together these columns comprise the PK of order_items,
order items, but taken
separately they are FKs to rows in other tables.
– New table does not have a concatenated PK, so it passes NF2.
orders
Fig. 9.7
order_id(PK)
order_date
customer_id
customer_name
customer_address
customer_city
customer_state
order_items
order_id(PK)
item_id(PK)
item_qty
items
item_id(PK)
item_description
item_price
9
Solution: NF3
• Third Normal Form (NF3):
No Dependencies on Non-Key Attributes
• Can return to repeating Customer info problem. As db stands, if
customer places >1 order have to input customer
customer's
s contact info
again cos there are columns in orders that rely on "non-key
attributes".
• To understand this, consider order_date. Can it exist independent
of order_id?
– No!: an "order date" is meaningless without an order.
– order_date depends on a key attribute (order_id is "key attribute"
because it is table’s PK).
• What about customer_name — can it exist on its own, outside of
the orders table?
– Yes. It is meaningful to talk about a customer name without referring
to an order or invoice.
Solution: NF3 Cont’d
• Same goes for customer_address, customer_city, &
customer_state. These 4 columns actually rely on customer_id,
which is not a key in this table (it is a non-key attribute).
• These fields belong in their own table customers,
customers with
customer_id as PK (see Fig 9.8).
• However, notice in Fig 9.8 that relationship has been severed btw
orders table and the Customer data that used to inhabit it.
orders
order_id(PK)
customer_id(FK)
order_date
order_items
order_id(PK)
item_id(PK)
item_qty
items
item_id(PK)
item_description
item_price
Fig. 9.8
customers
customer_id(PK)
customer_name
customer_address
customer_city
customer_state
10
Solution: NF3 Cont’d
• Restore relationship by creating a foreign key (indicated by (FK))
in orders
– As know, FK is a column that points to the PK in another table.
– Fig 9.9 describes this relationship, and shows our completed ERD.
• Relationship between orders & customers may be expressed in
this way:
– each order is made by one, and only one customer;
– each customer can make any number of orders, including zero
order_items
orders
items
order_id(PK)
item_id(PK)
item_qty
order_id(PK)
customer_id(FK)
order_date
item_id(PK)
item_description
item_price
Fig. 9.9
customers
customer_id(PK)
customer_name
customer_address
customer_city
customer_state
Solution: NF3 Cont’d
• Last point to note:
– order_id and item_id columns in order_items perform a dual
purpose: not only do they function as the (concatenated) PK for
order_items
d
it
, they
th
also
l individually
i di id ll serve as FKs
FK to
t the
th orders
d
table and items table respectively.
– This is shown in Fig. 9.10
orders
order_id(PK)
customer_id(FK)
order_date
order_items
order_id(FK)
item_id(FK)
item_qty
items
PK
item_id(PK)
item_description
item_price
Fig. 9.10
customers
customer_id(PK)
customer_name
customer_address
customer_city
customer_state
11
Normalisation cont’d
Introduction to Database Design
• As we have seen, an important part of database design
is deciding on a suitable logical structure or schema to
implement ... called database design
design.
SP
• Considering supplier parts example (S,P,SP) S
there is a feeling of correctness.
• Normalisation theory is a
P
formalism of simple ideas with a
practical application
in logical database schema design.
• Normalisation theory should allow us to
recognise relations with undesirable
properties, tell us what is "wrong" & how to "correct" it.
S#
P#
QTY
S1
P1
300
S#
SName
Status
City
S1
P2
200
S1
Smith
20
Paris
S1
P3
400
S2
Jones
10
Paris
S2
P1
300
S3
Blake
30
Rome
S2
P2
400
S3
P2
200
P#
PName
Colour
Weight
City
P1
Nut
Red
12
London
P2
Bolt
Green
17
P3
Screw
Blue
27
Rome
P4
Screw
Red
14
London
Paris
12
Intro to Database Design Cont’d
• Normalisation theory is built around normal forms - each normal
form has a set of satisfiable criteria.
• Normal forms exist in a hierarchy:
– 1NF -> 2NF -> 3NF -> BCNF -> 4NF -> PJ/NF (5NF)
•
•
•
•
Codd defined 1NF, 2NF, 3NF in 1972.
3NF had inadequacies so revised in ‘74 by Boyce/Codd (BCNF).
1977 Fagin defined 4NF, 1979 defined 5NF.
6NF,7NF ?... dependencies theory suggests there may be higher
NFs but not practicable in database environment.
• DB designers should aim for higher NFs but this is not law - just
recommended as normalisation simply provides guidelines for
database design.
• There are often good reason for not using normalisation theory.
Introduction to Database Design Cont’d
• In order to describe the various normal forms we must
first introduce some definitions:
• Functional Dependency
– Given relation R, attribute Y of R is functionally dependent on X
of R, R.X -> R.Y, or R.X functionally determines R.Y ...
– ... iff each R.X value has associated with it precisely one R.Y
value, where X and/or Y may be composite.
– R.X called the determinant, R.Y called the dependent
• S.SNAME, S.STATUS and S.CITY are each functionally
dependent on S.S#
S S#
• If R.X is a candidate key or if R.X is the primary key,
then all R.Y must be functionally dependent on R.X
• In SP we have a composite primary key so
SP.(S#,P#) -> SP.QTY
13
Introduction to Database Design Cont’d
• There is no requirement in the definition of functional
dependence that R.X be a candidate key, thus:
R.X -> R.Y iff whenever 2 tuples of R.X are the same then the
corresponding R.Y values are also the same.
– R.Y is fully functionally dependent on R.X ….
– …. iff it is functionally dependent on R.X & not fully functionally
dependent on any subset of R.X
– Example:
S.(S#,STATUS) -> S.CITY is true but not full functional
dependence as S.S# ->
> S.CITY
– If R.X -> R.Y but not fully then R.X must be composite
Normalisation: Example 2
• Given the report in Fig 9.11, need to put it in a tidy DB.
• Problems with current form:
– PROJ_NUM is supposed to be PK or part of PK but contains nulls.
Maybe
aybe PROJ
OJ_NUM+EMP
U
_NUM
U will de
define
e eac
each row.
o
– The table entries contain inconsistencies (e.g. JOB_CLASS
“Elect. Engineer” could be “EE” or “E. Eng” or others)
Fig. 9.11
14
Normalisation: Example 2 Cont’d
• Further problems with current form:
– The table has data redundancies leading to the following
anomalies:
1. Update Anomalies: Modifying (e.g.) JOB_CLASS for Employee 105 requires
lots of alterations (one for each employee 105).
2. Insertion Anomalies: To complete a row definition, a new employee must
be given a project; if not yet assigned, this must be assumed to complete
the employee tuple.
3. Deletion Anomalies: If employee 103 quits, every row with EMP_NUM=103
must be deleted with the potential loss of other data.
– Inefficiency: If a large number of new employees are hired, a
l t off redundant/unassigned
lot
d d t/
i
dd
data
t mustt b
be assumed
d and
d input.
i
t
– Integrity: Possible data integrity problems may arise out of the
above.
Example 2: Conversion to NF1
• So… Problems with Fig. 9.11:
– Data cannot be as shown in Fig. 9.11 cos have to be able to
identify all tuples with a PK.
– PROJ_NUM cannot be PK in Fig. 9.11 cos of nulls
– Cannot have the repeating groups shown in Fig.
Fig 9.11
9 11 so have
to alter table to remove them.
• Step 1. Eliminate the repeating groups
– Eliminate the null values.
– Now have Fig. 9.12
Fig. 9.12
15
Example 2: Conversion to NF1 Cont’d
• Step 2. Identify the Primary Key
– Layout in Fig. 9.12 is only a cosmetic change – need a PK to
uniquely identify all tuples.
– This may be seen to be PROJ_NUM+EMP_NUM
• Step 3. Identify all dependencies
– The identification of the PK means already have the following:
PROJ_NUM,EMP_NUM
PROJ_NAME,EMP_NAME,JOB_CLASS,CHG_HOUR, HOURS
Fig. 8.12
Example 2: Conversion to NF1 Cont’d
• Step 3. Cont’d
– But there are additional dependencies:
1. The project number determines the project name:
PROJ_NUM
PROJ_NAME
2. If know employee number, also know their name, job classification and
their charge per hour:
EMP_NUM
EMP_NAME, JOB_CLASS, CHG_HOURS
3. Also knowing job classification means also know the charge per hour:
JOB_CLASS
CHG_HOURS
– These dependencies are shown in the Dependency Diagram in
Fig. 9.13
– Dependency Diagrams are useful for getting an overall view of
relationships among attributes.
Normal
Fig. 9.13
PROJ_
NUM
PROJ_
NAME
EMP_
NUM
EMP_
NAME
JOB_
CLASS
CHG_
HOUR
Partial
HOURS
Transitive
16
Example 2: Conversion to NF1 Cont’d
• Looking at Fig. 9.13, can see that:
1. PK attributes are bold, underlined and a different colour.
2. Arrows above (blue) denote desirable FDs (those based on PK)
3. Arrows below the diagram (red and green) are less desirable:
a)
Partial Dependencies: dependencies based on part of composite PK
–
Need only know PROJ_NUM to know PROJ_NAME, so PROJ_NAME is only
dependent on part of the PK.
–
Need only know EMP_NUM to find the EMP_NAME, JOB_CLASS,
CHG_HOUR.
b)
Transitive Dependencies: Dependency of 1 non-prime attribute on another
–
From Fig. 9.13, can see that CHG_HOUR is dependent on JOB_CLASS
–
Neither of these is part of PK (i.e. a Prime Attribute).
Normal
Fig. 9.13
PROJ_
NUM
PROJ_
NAME
EMP_
NUM
EMP_
NAME
JOB_
CLASS
CHG_
HOUR
Partial
HOURS
Transitive
Example 2: Conversion to NF1 Cont’d
• Properties of NF1: A table in NF1 must have:
1. All key attributes defined
1
2. No repeating groups in the table (i.e each row/column entry
must have only one value)
• Problem with Fig. 9.13 is the partial dependencies.
• This can be eliminated with NF2
17
Example 2: Conversion to NF2
• Step 1. Identify all key components:
PROJ_NUM
EMP_NUM
PROJ NUM EMP
PROJ_NUM,
EMP_NUM
NUM
– Each component becomes the key of a new table.
– Three new tables project, employee, assign
• Step 2. Identify the dependent attributes
– Use Fig. 9.13 to determine which attributes are dependent on
which others, using the arrows in the dependency diagram
project(PROJ NUM PROJ_NAME)
project(PROJ_NUM,
PROJ NAME)
employee(EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOURS)
assign(PROJ_NUM, EMP_NUM, ASSIGN_HOURS)
– Results are shown in Fig. 9.14
Example 2: Conversion to NF2 Cont’d
• At this point, most anomalies discussed above have been
eliminated e.g. if want to add/change/delete a project record,
only need to alter 1 row of project
• So a table is in NF2 iff
1. It is in NF1
And
2. It has no partial dependencies (can still have transitive
dependencies)
• Fig. 9.14 still has a transitive dependency which can generate
anomalies e.g. if charge per hour changes for a job classification
held by many employees, that change must be made for all
(leading to possible update anomalies)
• Resolve transitive dependencies in NF3
PROJ_
NUM
project
PROJ_
NAME
EMP_
NUM
Fig. 9.14
EMP_
NAME
JOB_
CLASS
CHG_
HOUR
employee
PROJ_
NUM
EMP_
NUM
ASSIGN
_HOURS
assign
18
Example 2: Conversion to NF3
• Step 1. Identify each new determinant
– For each transitive dependency, write its determinant as a PK for
a new table (recall: determinant is any attribute whose value
determines other values within a row).
– If have 3 transitive dependencies, have 3 different determinants
– Here only have one: JOB_CLASS
• Step 2. Identify the dependent attributes
– Identify the attributes dependent on each determinant identified
in Step 1. Here, have
JOB_CLASS
CHG_HOUR
– Name the table to reflect its contents & function, here JOB is ok
• Step 3. Remove dependent attrib from transitive
dependencies
– Remove all dependent attributes from dependent relationship(s)
from each table with transitive relationships
– JOB_CLASS remains in the employee table as FK
Example 2: Conversion to NF3
• Final dependency diagram is shown in Fig. 9.15
Fig. 9.15
PROJ_
NUM
PROJ_
NAME
project
EMP_
NUM
EMP_
NAME
JOB_
CLASS
employee
JOB_
CLASS
CHG_
HOUR
job
PROJ_
NUM
EMP_
NUM
ASSIGN
_HOURS
assign
• Or 4 Tables:
project(PROJ_NUM, PROJ_NAME)
assign(EMP_NUM, PROJ_NUM, ASSIGN_HOURS)
employee(EMP NUM EMP_NAME,
employee(EMP_NUM,
EMP NAME JOB_CLASS)
JOB CLASS)
job(JOB_CLASS, CHG_HOUR)
• A table is in NF3 iff
– It is in NF2
And
– It contains no transitive dependencies.
19