Relational Algebra Why the relational model? Comparing data

Why the relational model?
• Sounds good: matches how we think about data
• Real reason: data independence!
• Earlier models tied to physical data layout
Relational Algebra
– Procedural access to data (low‐level, explicit access)
– Relationships stored in in data (linked lists, trees, etc.)
– Change in data layout => application rewrite
Introduction to databases
CSCC43 Winter 2011
• Relational model
Ryan Johnson
– Declarative access to data (system optimizes for you)
– Relationships specified by queries (schemas help, too)
– Develop, maintain apps and data layout separately
Thanks to Arnold Rosenbloom and Renee Miller for material in these slides
3
Comparing data models
Student job example
Hierarchical (tree)
Mary (M) and Xiao (X) both work at Tim Hortons (T)
Jaspreet (J) works at both Bookstore (B) and Wind (W)
Network (graph)
E
T
M
B
X
2
B
X
S
E
B …
T …
W …
4
– Two‐dimensional tables (relations)
• Formal system for manipulating relations
W
J
– Relational algebra
J
• Result
Relational (table)
W
J
M
What is the relational model?
• Logical representation of data
E
T
Similar battle today with languages
S
J …
M …
X …
R
M
X
J
J
T
T
B
W
– High‐level (logical, declarative) description of data
– Mechanical rules for rewriting/optimizing low‐level access – Formal methods to reason about soundness
Relational algebra is the key
1
5
Relations and tuples
Attribute (column)
Relation Name
Tuple (row)
R
t1
a1
v1,1
…
am
• Expressions: – Variables, constants
– Closed domain
– Combine operations with parenthesis (explicit)
– OR using either precedence (implied)
• Operators
…
Body
Set of tuples
cardinality: n=|R|
…
tn
• Operands (values)
Heading (schema)
Set of attributes
arity: m=|schema(R)|
6
What is an algebra?
+ “Addition”
* “Multiplication”
• Laws
– Identify semantically equivalent expressions
– Commutativity, associativity, etc.
vn,m
Value (field, component)
Atomic (no sub‐tuples)
Set‐based: arbitrary row/col ordering
Logical: physical layout might be *very* different!
Example algebra: integer arithmetic
• Operators: ‐, +, *
• Expressions
7
8
Writing expressions
• Fully parenthesized
• Domain: integers
… ‐100, … ‐1, 0, 1, … 100, … Allows formal, sound, mechanical rewriting
• Tree form
((2*a) + ((5*(c + (‐d))) + e))
‘‐’ is unary negation
Why is division not there?
+
• Operator precedence
2*a + 5*(c ‐ d) + e
e
– ((2*a) + ((5*(c + (‐d))) + e))
*
• Laws
– a*b = b*a
– a*(b*c) = (a*b)*c
– a*(b+c) = a*b + a*c
+
Commutative
Associative
Distributive
• Linear form
x = 2*a
y = 5*(c‐d)
z = x + y + e
2
*
a
‐
c
5
d
Allows compilers to reason about, optimize
2
Relational algebra
9
Unary operators: select ()
10
• P(R) outputs tuples of R which satisfy P
• Values
– Finite relations (cardinality and arity both bounded)
– Attributes may or may not be typed
• Operators
–
–
–
–
Unary: , , 
“Additive” (set): U, ∩, “Multiplicative:” , [details to come]
σ
• Expressions
– Same as arithmetic, but called “queries”
• Laws
– Allow “query rewriting”
– Basis for query optimization
– [details to come]
Expressive power equivalent to 1st order logic
Unary operators: project ()
• A,B,C(R) outputs attributes A,B,C of relation R
Removes unwanted rows from relation
11
Unary operators: rename ()
12
• S(A,B,C)(R) renames attributes of R to A,B,C and calls the result S
– S(R) renames relation R (same attributes)
– A=X,C=Y(R) renames attributes A and C only
π
ρ
Removes unwanted columns from relation
Modifies schema only (body unchanged)
3
13
Additive operators (U, ∩, -)
• Standard set operators
• Operate on tuples within input relations
14
Cartesian product ()
• T = R  S contains every pairwise combination of tuples from R and S
– schema(T) = schema(R) U schema(S)
– |T| = |R|*|S|
1
2
3
U
A
B
C
Input schemas must match
x
1
1
1
2
2
2
3
3
3
A
B
C
A
B
C
A
B
C
Input schemas must not overlap
15
Natural join ( )
• T = R S merges tuples from R and S having equal values where their schemas overlap Comparison: ∩ vs. vs. 
16
• Same general operation
– Union schemas (as with )
schema(R) ∩ schema(S) ≠ Ø
– |T| ≤ |R|*|S|, usually ≈ max(|R|, |S|)
“join cardinality”
– Test “overlapping” parts of tuples for equality
– Combine “matching” pairs (ignore others)
• Differing degrees of schema overlap
• Degenerate cases
– No overlap: 
– Full overlap: ∩
1
2
3
4
5
6
Equivalent to ((R  (S)))
A
B
C
1
3
4
4
A
A
B
C
Intersection
(full overlap)
Natural join
(partial overlap)
Cartesian product
(no overlap)
“Generalized intersection”
4
Mathematical power vs. efficiency
17
18
• Written as R A=X,B=Y,… S
• Note that  expresses both ∩ and => Mathematically, intersection and joins unnecessary
– Attribute names in R and S can differ
– Still compare values for equality
• Why bother with them? Two big reasons
• Notation
– ((R  (S))) vs. R ∩ S
– Cartesian product seldom useful
Equijoin
• Like natural join, but using arbitrary attributes
– Very common due to foreign keys in relations
• Equivalent to R (S)
Why not?
• Performance
– Efficient algorithms compute result directly
=> |R|*|S| rows vs. min(|R|,|S|)
Consider |R|=|S|=106
19
Theta join
• Written as T = R P S
– Outputs pairwise combinations of tuples which satisfy P
– Join cardinality: |T| ≤ |R|*|S|
• Most general join
– Arbitrary join predicate (not just equality)
• Equivalent to P(R  (S))
– Schemas must not overlap
– Does not project away any attributes
Limitations of relational algebra
20
• Relational algebra is set‐based
• Real‐life applications need more
–
–
–
–
–
Expensive (and often unnecessary) to eliminate duplicates
Important (and often expensive) to order output
Need a way to apply scalar expressions to values
Database exists to distill data into useful knowledge
What’s *not* there often as important as what is
Why not?
Answer: non‐set extensions
5
Extension: bag semantics
21
– Concatenation (except unordered)
– {1, 1, 2, 3} U {2, 2, 3, 4} = {1, 1, 2, 3, 2, 2, 3, 4}
– Sometimes people purposefully insert duplicates
– Projections produce duplicates
• Intersection
• Example: {1,2,1,1,3} is a bag (still unordered!)
• Most operators still work
– Take minimum count of each value
– {1, 1, 2, 3} ∩ {2, 2, 3, 4} = {2, 3}
– Each occurrence on right can cancel one occurrence on left
– {1, 1, 2, 3} – {1, 2, 3, 4} = {1}
• Union, intersection no longer distribute
– {1} ∩ ({1} U {1}) – {1} ∩ {1, 1} – {1} • Some laws no longer apply
Make
Toyota
Toyota
Honda
Honda
Honda
Ford
Ford
Ford
Model
Prius
Prius
Accord
Accord
Accord
Echo
Echo
Echo
Color
Gray
Red
Green
Red
Red
Red
Gray
White
Similar to a join?
• Difference
Select, rename unchanged
Project no longer eliminates duplicates
Set operations need tweaks
Joins tend to multiply the number of duplicates
Projection and duplicates
22
• Union
• In practice, relations are bags (multisets)
–
–
–
–
Bag versions of set operations
23
• Consider a relation R modeling cars for sale
• What does Make(R) return?
• What *should* Make(R) return?
vs. vs. ≠ ({1} ∩ {1}) U ({1} ∩ {1})
{1} U {1}
{1,1}
Duplicate elimination ()
Make
Toyota
Toyota
Honda
Honda
Honda
Ford
Ford
Ford
Model
Prius
Prius
Accord
Accord
Accord
Echo
Echo
Echo
Color
Gray
Red
Green
Red
Red
Red
Gray
White
24
• Consider a relation R modeling cars for sale
• What does Make(R) return?
• (Make(R)) is a set
Make
Honda
Ford
Toyota
Duplicates important for summaries (“how many”)
6
Summarizing groups of tuples (1)
Student
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Year
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2011
2011
2011
Dept
CS
CS
CS
Math
Math
Math
CS
CS
CS
CS
Math
Math
Stats
CS
CS
CS
Course
A08
A48
A65
A23
A30
A37
B07
B09
B36
B58
B24
B41
B52
C24
C43
C69
Grade
B‐
B
B+
B
B+
A
B
B‐
B‐
B
A‐
B
B‐
B+
A‐
A
25
Student
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
All courses Xiao has taken
All courses Xiao took in 2010
All math courses Xiao took in 2010
Summarizing groups of tuples (3)
• Description #1: want to output a single tuple which summarizes a set of related tuples
• Description #2: want to “collapse” a set of tuples into a single, “representative” tuple
• Questions
– How to identify related tuples (set to collapse)?
=> Grouping key: a subset of attributes to test for equality
– How to collapse a column into a value (summarize it)?
=> Use an aggregation function (sum, count, avg, min, max, …)
Summarizing groups of tuples (2)
27
Year
2009
2009
2009
2009
2009
2009
2010
2010
2010
2010
2010
2010
2010
2011
2011
2011
Dept Course Grade
CS
A08
B‐
CS
A48
B
CS
A65
B+
Math A23
B
Math A30
B+
Math A37
A
CS
B07
B
CS
B09
B‐
CS
B36
B‐
CS
B58
B
Math B24
A‐
Math B41
B
Stats B52
B‐
CS
C24
B+
CS
C43
A‐
CS
C69
A
26
How to summarize this??
Student
Xiao
Xiao
Xiao
Xiao
Xiao
Xiao
Dept
CS
Math
CS
Math
Stats
CS
Year Course Grade
2009
?
?
2009
?
?
2010
?
?
2010
?
?
2010 B52
B‐
2011
?
?
These columns are easy… equal for every tuple in a group
Show the best grade? Worst grade? Average?
Grouping ()
28
• Duplicates useful when computing statistics
– min, max, sum, count, average, …
• A,B,C,f(X),g(Y),h(Z)(R) computes aggregate values using some attributes as a grouping key
–
–
–
–
Implicit projection (drops unreferenced attributes)
A, B, C is the grouping key
X, Y, Z are attributes to aggregate
f, g, h are aggregating functions to apply
• Aggregating function: commutative, associative
– f(x,y) = f(y,x), f(x, f(y, z)) = f(f(x, y), z)
7
29
Grouping ()
• All tuples having the same key go to same group
– One output tuple for each unique key
– Output “group total” for each non‐key attribute in group
– e.g. f(x1, f(x2, f(x3, …)))
A B C
1
1
1
2
2
2
2
3
3
1
f
g
A,f(B),g(C)
1
2
3
2
30
Duplicates and grouping
Make
Toyota
Toyota
Honda
Honda
Honda
Ford
Ford
Ford
Model
Prius
Prius
Accord
Accord
Civic
Echo
Echo
Echo
Color
Gray
Red
Green
Red
Red
Red
Gray
White
• Consider a relation R modeling cars for sale
• Make,count(*)(R) returns?
– The number of cars of each make
Make
Toyota
Honda
Ford
Count
2
3
3
3
Duplicates important for summaries (“how many”)
31
Sorting ()
• A(R) sorts tuples in R on attribute A
• Example: ‐Count,Make(R)
Count
2
3
3

Make
Ford
Honda
Toyota
Count
3
3
2
Alphabetical order when count is equal
Descending count
• Sorted relations not in the R.A. value domain!
=> Must be root operator of query tree**
32
• Consider the following query
– Additional attributes serve as tie‐breakers
– Local convention: ‘‐’ reverse‐sorts (descending order)
Make
Toyota
Honda
Ford
The dangling tuple problem
– ‐Total(Name,Total(Name,sum(Value)(Emp Sales)))
– “List employees and their total sales in descending order”
Emp
Sales
Name
Total
EID Name
EID Value …
1
Mary
1
20
… = Jaspreet 35
Mary
20
2
Xiao
3
10
…
3
Jaspreet
3
15
…
Xiao?
• Join hides fact that Xiao has no sales!
– Challenge: rewrite query to include Xiao’s zero
What’s not there can be very important
8
Extending the “inner” join
33
• All joins so far resemble intersection
34
Outer join ( )
• T = R S computes the “outer” join of R and S
=> Tuples with no match discarded (“dangling”)
– Like normal join, but all tuples from R and S appear in output
– Pad (left, right, or all) dangling tuples with 
• Sometimes desirable to output dangling tuples
– |T| ≥ max(|R|, |S|)
– List all employees and their sales total (even if zero)
L
• Problem: how to show missing part(s) of tuple?
1
2
3
4
5
– Introduce special value (null)
– Pad dangling tuples as needed to match schema
– Note:  technically outside R.A. value domain
A
B
C
D
1
2
3
4
4

5
A

A
B
C
D

X
X
X
R
X
X
X
• Natural, equi‐, and theta‐ variants still apply
Outer join in action
• Consider the following query
– ‐Total(Name,Total(Name,sum(Value)(Emp Sales)))
– “List employees and their total sales in descending order”
Emp
Sales
Name
Total
EID Name
EID Value …
1
Mary
1
20
… = Jaspreet 35
Mary
20
2
Xiao
3
10
…
Xiao

3
Jaspreet
3
15
…
35
36
Extended projection
• x=E(R) computes column x from expression E
– Arithmetic (z=3*x + y)
– String manipulation (substring, capitalization)
– Some conditional expressions
• Example:
‐Total(Name,Total=T?T:0(Name,T(Name,sum(Value)(Emp Sales))))
Emp
EID
1
2
3
Name
Mary
Xiao
Jaspreet
Sales
EID Value
1
20
3
10
3
15
…
…
…
…
Name
= Jaspreet
Mary
Xiao
Total
35
20
0
9