STAB22 Statistics I Lecture 2 1 Describing Categorical Data Categorical Data: individual cases fall into one of several groups or categories E.g. Student course grades (categories: A, B, C, D, F) Student ID # Grade 1234 A 2234 D ⁞ ⁞ Look at several methods for describing categorical data, using: Tables: Frequency, Relative Frequency, Contingency Tables Charts: Bar Chart, Pie Chart, Side-by-side Charts 2 Frequency Table Frequency table: List all categories and their frequencies (i.e. # of cases in each category) E.g. Student course grades (categories: A, B, C, D, F) Describes distribution of a categorical variable (i.e. its values & how frequently each occurs) Grade A B C D F Total Frequency 40 75 50 20 15 200 3 Relative Frequency Table Relative Frequency Table: List all categories and their relative frequencies (i.e. proportion / percentage # of each category) Grade A B C D F Total Frequency 40 75 50 20 15 200 Grade A B C D F Total Rel. Freq. (%) 20 37.5 25 10 7.5 100 4 Bar Chart Bar Chart: Categories represented by bars, where height of each bar is simple / relative frequency StatCrunch: Graphics > Bar Plot > with summary 5 Pie Chart Pie Chart: Categories represented by pie slices, where size of each slice is proportional to relative frequency StatCrunch: Graphics > Pie Chart > with summary 6 Describing Two Categorical Variables Consider two categorical variables for each individual case E.g. Student course grades (1st var), & year the course was taken in (2nd var) Student ID # Grade Year 1234 A 2010 2234 D 2011 6744 B 2010 ⁞ ⁞ ⁞ Want to describe joint behavior of both categorical variables 7 Contingency Table Contingency Table: Two-way table lists frequencies for each combination of the two categorical variables E.g. Student Grade by Year Grade Year A B C D F Total 2010 40 75 50 20 15 200 2011 45 70 58 19 8 200 Total 85 145 108 39 23 400 8 Contingency Table Contingency tables can also report relative frequencies as % of total # of cases Grade Year A B C D F Total 2010 10 % 18.75 % 12.5 % 5% 3.75 % 50 % 2011 11.25 % 17.5 % 14.5 % 4.75 % 2% 50 % Total 21.25 % 36.25 % 27 % 9.75 % 5.75 % 100 % StatCrunch: Stat > Tables > Contingency > with summary A.k.a. joint distribution of two categorical variables 9 Contingency Table Margins of contingency table give distributions of each categorical variable separately Grade Year A B C D F Total 2010 10 % 18.75 % 12.5 % 5% 3.75 % 50 % 2011 11.25 % 17.5 % 14.5 % 4.75 % 2% 50 % Total 21.25 % 36.25 % 27 % 9.75 % 5.75 % 100 % distribution of Year distribution of Grade A.k.a. marginal distributions (same as the individual distributions of each variable) 10 Example University applications to professional schools, classified by Gender & Decision Decision Accepted Cell format Count (Total percent) Rejected Total Male 490 (40.83%) 210 (17.5%) 700 (58.33%) Female 280 (23.33%) 220 (18.33%) 500 (41.67%) Total 770 (64.17%) 430 (35.83%) 1200 (100.00%) Gender What proportion of applicants are female that get accepted? What proportion of applicants are male? 11 Conditional Distributions Often need distribution of one variable for a particular value of the other E.g. What % of males gets accepted / rejected? Fix value of one variable & look at distribution of the other for that value only E.g. Conditional distribution of decision, conditional on gender=male Male count (%) Accepted Rejected Total 490 210 700 12 Conditional Distributions Can condition on either variable (i.e. rows or columns) of contingency table E.g. Conditional distributions of Decision for Gender = Male/Female (row %) Decision Accepted Cell format Count (Row percent) Rejected Total Male 490 (70%) 210 (30%) 700 (100%) Female 280 (56%) 220 (44%) 500 (100%) 770 (64.17%) 430 (35.83%) 1200 (100.00%) Gender Total cond. distr. of Decision for Gender=Male cond. distr. of Decision for Gender=Female 13 Conditional Distributions E.g. Conditional distributions of Gender for Decision = Accepted/Rejected (column %) Decision Accepted Cell format Count (Column percent) Rejected Total Male 490 (63.64%) 210 (48.84%) 700 (58.33%) Female 280 (36.36%) 220 (51.16%) 500 (41.67%) 770 (100.00%) 430 (100.00%) 1200 (100.00%) Gender Total cond. distr. of Gender for Decision = Accepted cond. distr. of Gender for Decision = Rejected 14 Conditional Distributions If conditional distributions of one variable are the same for every value of the other, we say the two variables are independent E.g. If conditional distribution of student course grades does not change with year, then Grade is independent of Year (& vice-versa) Compare conditional distributions visually using side-by-side bar plot 15 Side-by-side Bar Plot StatCrunch: Graphics > Chart > Columns 16 Simpson’s Paradox For professional school applications, females have lower acceptance rate (56%) than males (70%)! Is there discrimination? Accepted Cell format Count (Row percent) Rejected Total Male 490 (70%) 210 (30%) 700 (100%) Female 280 (56%) 220 (44%) 500 (100%) Let’s look at tables for Law & Business school admissions separately… 17 Simpson’s Paradox LAW Accepted Male Female Rejected Total BUSINESS Accepted Rejected Total 10 (10%) 90 (90%) 100 (100%) Male 480 (80%) 120 (20%) 600 (100%) 100 (33.33%) 200 (66.67%) 300 (100%) Female 180 (90%) 20 (10%) 200 (100%) For both Law & Business schools, females have higher acceptance rates (33.3% & 90%) than males (10% & 80%) => no discrimination! How is it possible that female overall acceptance rate is lower? 18
© Copyright 2026 Paperzz