Denormalizing Data with PROC SQL Is that a real word?? Spell Check… Demoralizing Data with PROC SQL Grocery Store What You Have CUSTOMER_ID NAME CUSTOMER_ID ITEM_ID ITEM_ID ITEM_NAME 1 Ansel 1 1 1 eggs 2 Fiona 1 2 2 milk 1 4 3 bread 1 6 4 chicken 5 beef 3 James 4 Kathy 2 2 5 Ying 2 3 6 Otto 2 4 6 broccoli 7 Costas 2 7 7 carrots 8 Abdul 2 8 8 apples 9 Enrico … … 9 peaches 4 dog food Mitzu 10 10 10 10 5 10 8 10 10 • • Relational tables in normal form Purchase events in many-to-many relation • • Good for relational storage, bad for computing stats Data step: join tables using complex merge What you want CUSTOMER_ID NAME EGGS MILK BREAD … DOG_FOOD 1 Ansel 1 1 0 … 0 2 Fiona 1 1 1 … 1 3 James 0 1 0 … 0 4 Kathy 1 0 1 … 0 5 Ying 1 1 1 … 1 6 Otto 0 0 1 … 1 7 Costas 0 0 1 … 1 8 Abdul 0 1 0 … 0 9 Enrico 1 0 1 … 1 10 Mitzu 0 1 0 … 1 • • • • Matrix shows who bought what One row per customer One column per item Easy to compute stats How do you get this? A few SQL examples to build up to a solution… SQL Examples What items did customer #1 buy? select item_id from purchases where customer_id = 1; ITEM_ID ---------1 2 4 6 What items did customer #1 buy? Join with grocery table to get item name select P.item_id, from purchases P, where G.item_id = and P.customer_id ITEM_ID ---------1 2 4 6 G.item_name from groceries G P.item_id = 1; ITEM_NAME --------eggs milk chicken broccoli How many customers bought eggs? Use SQL aggregate function count(). select count(*) from purchases P, groceries G where P.item_id = G.item_id and G.item_name = 'eggs' COUNT(*) -------5 Did customer #1 buy eggs? Restrict by customer, count() function returns 0 or 1, i.e., yes or no select count(*) from groceries G, where P.item_id = and G.item_name = and P.customer_id COUNT(*) -------1 purchases P G.item_id 'eggs' = 1; Did customer #10 buy eggs? select count(*) from groceries G, where P.item_id = and G.item_name = and P.customer_id COUNT(*) -------0 purchases P G.item_id 'eggs' = 10; Subqueries • In SQL, select clause can include a query that returns a scalar value select name, (select count(*) from purchases) num_purchases from customers NAME NUM_PURCHASES ----------- ------------Ansel 58 Fiona 58 James 58 Kathy 58 Ying 58 Otto 58 Costas 58 Abdul 58 Enrico 58 Mitzu 58 Correlated Subqueries • Relate inner and outer queries via alias select name, (select count(*) from purchases where customer_id = C.customer_id) num_purchases from customers C; NAME NUM_PURCHASES ----------- ------------Ansel 4 Fiona 9 James 6 Kathy 3 Ying 8 Otto 7 Costas 7 Abdul 2 Enrico 7 Mitzu 5 Putting the pieces together Joins to get data from multiple tables Count() to get 0/1, yes/no Correlated subqueries rotate rows to columns Aliases to name columns Final query select customer_id, name, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'eggs' and P.customer_id = C.customer_id) eggs, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'milk' and P.customer_id = C.customer_id) milk, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'bread' and P.customer_id = C.customer_id) bread from customers C; SAS Code proc sql; create table Work.Purchase_Matrix as select customer_id, name, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'eggs' and P.customer_id = C.customer_id) eggs, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'milk' and P.customer_id = C.customer_id) milk, (select count(*) from purchases P, groceries G where G.item_id = P.item_id and G.item_name = 'bread' and P.customer_id = C.customer_id) bread from customers C; quit; Final Dataset CUSTOMER_ID ----------1 2 3 4 5 6 7 8 9 10 NAME EGGS MILK BREAD ----------- ---------- ---------- ---------Ansel 1 1 0 Fiona 1 1 1 James 0 1 0 Kathy 1 0 1 Ying 1 1 1 Otto 0 0 1 Costas 0 0 1 Abdul 0 1 0 Enrico 1 0 1 Mitzu 0 1 0 Resources SQL may not always be the most appropriate choice for a given problem. This technique starts to get untenable as the number of columns needed in the output increases. DATA Step vs. PROC SQL: What’s a neophyte to do? http://www2.sas.com/proceedings/sugi29/269-29.pdf Proc SQL versus The Data Step http://www.nesug.org/proceedings/nesug06/hw/hw06.pdf Questions?
© Copyright 2026 Paperzz