Denorm_SQL

Denormalizing Data with
PROC SQL
Is that a real word?? Spell Check…
Demoralizing Data with
PROC SQL
Grocery Store
What You Have
CUSTOMER_ID
NAME
CUSTOMER_ID
ITEM_ID
ITEM_ID
ITEM_NAME
1
Ansel
1
1
1
eggs
2
Fiona
1
2
2
milk
1
4
3
bread
1
6
4
chicken
5
beef
3
James
4
Kathy
2
2
5
Ying
2
3
6
Otto
2
4
6
broccoli
7
Costas
2
7
7
carrots
8
Abdul
2
8
8
apples
9
Enrico
…
…
9
peaches
4
dog food
Mitzu
10
10
10
10
5
10
8
10
10
•
•
Relational tables in normal form
Purchase events in many-to-many relation
•
•
Good for relational storage, bad for computing stats
Data step: join tables using complex merge
What you want
CUSTOMER_ID
NAME
EGGS
MILK
BREAD
…
DOG_FOOD
1
Ansel
1
1
0
…
0
2
Fiona
1
1
1
…
1
3
James
0
1
0
…
0
4
Kathy
1
0
1
…
0
5
Ying
1
1
1
…
1
6
Otto
0
0
1
…
1
7
Costas
0
0
1
…
1
8
Abdul
0
1
0
…
0
9
Enrico
1
0
1
…
1
10
Mitzu
0
1
0
…
1
•
•
•
•
Matrix shows who bought what
One row per customer
One column per item
Easy to compute stats
How do you get this?
A few SQL examples
to build up to a solution…
SQL Examples
What items did customer #1 buy?
select item_id from purchases
where customer_id = 1;
ITEM_ID
---------1
2
4
6
What items did customer #1 buy?
Join with grocery table to get item name
select P.item_id,
from purchases P,
where G.item_id =
and P.customer_id
ITEM_ID
---------1
2
4
6
G.item_name from
groceries G
P.item_id
= 1;
ITEM_NAME
--------eggs
milk
chicken
broccoli
How many customers bought eggs?
Use SQL aggregate function count().
select count(*)
from purchases P, groceries G
where P.item_id = G.item_id
and G.item_name = 'eggs'
COUNT(*)
-------5
Did customer #1 buy eggs?
Restrict by customer, count() function returns 0 or 1, i.e., yes or no
select count(*)
from groceries G,
where P.item_id =
and G.item_name =
and P.customer_id
COUNT(*)
-------1
purchases P
G.item_id
'eggs'
= 1;
Did customer #10 buy eggs?
select count(*)
from groceries G,
where P.item_id =
and G.item_name =
and P.customer_id
COUNT(*)
-------0
purchases P
G.item_id
'eggs'
= 10;
Subqueries
• In SQL, select clause can include a query that
returns a scalar value
select
name,
(select count(*) from purchases) num_purchases
from customers
NAME
NUM_PURCHASES
----------- ------------Ansel
58
Fiona
58
James
58
Kathy
58
Ying
58
Otto
58
Costas
58
Abdul
58
Enrico
58
Mitzu
58
Correlated Subqueries
• Relate inner and outer queries via alias
select
name,
(select count(*)
from purchases
where customer_id = C.customer_id) num_purchases
from customers C;
NAME
NUM_PURCHASES
----------- ------------Ansel
4
Fiona
9
James
6
Kathy
3
Ying
8
Otto
7
Costas
7
Abdul
2
Enrico
7
Mitzu
5
Putting the pieces together
Joins to get data from multiple tables
Count() to get 0/1, yes/no
Correlated subqueries rotate rows to
columns
Aliases to name columns
Final query
select
customer_id,
name,
(select count(*) from purchases P, groceries G
where G.item_id = P.item_id
and G.item_name = 'eggs'
and P.customer_id = C.customer_id) eggs,
(select count(*) from purchases P, groceries G
where G.item_id = P.item_id
and G.item_name = 'milk'
and P.customer_id = C.customer_id) milk,
(select count(*) from purchases P, groceries G
where G.item_id = P.item_id
and G.item_name = 'bread'
and P.customer_id = C.customer_id) bread
from customers C;
SAS Code
proc sql;
create table Work.Purchase_Matrix as
select
customer_id,
name,
(select count(*) from purchases P, groceries G
where G.item_id = P.item_id
and G.item_name = 'eggs'
and P.customer_id = C.customer_id) eggs,
(select count(*) from purchases P, groceries G
where G.item_id = P.item_id
and G.item_name = 'milk'
and P.customer_id = C.customer_id) milk,
(select count(*) from purchases P, groceries G
where G.item_id = P.item_id
and G.item_name = 'bread'
and P.customer_id = C.customer_id) bread
from customers C;
quit;
Final Dataset
CUSTOMER_ID
----------1
2
3
4
5
6
7
8
9
10
NAME
EGGS
MILK
BREAD
----------- ---------- ---------- ---------Ansel
1
1
0
Fiona
1
1
1
James
0
1
0
Kathy
1
0
1
Ying
1
1
1
Otto
0
0
1
Costas
0
0
1
Abdul
0
1
0
Enrico
1
0
1
Mitzu
0
1
0
Resources
SQL may not always be the most appropriate choice for a
given problem. This technique starts to get untenable as the
number of columns needed in the output increases.
DATA Step vs. PROC SQL: What’s a neophyte to do?
http://www2.sas.com/proceedings/sugi29/269-29.pdf
Proc SQL versus The Data Step
http://www.nesug.org/proceedings/nesug06/hw/hw06.pdf
Questions?