Lecture 2

STAB22 Statistics I
Lecture 2
1
Describing Categorical Data

Categorical Data: individual cases fall into
one of several groups or categories

E.g. Student course grades
(categories: A, B, C, D, F)
Student ID #
Grade
1234
A
2234
D
⁞

⁞
Look at several methods for describing
categorical data, using:


Tables: Frequency, Relative Frequency,
Contingency Tables
Charts: Bar Chart, Pie Chart, Side-by-side Charts
2
Frequency Table

Frequency table: List all categories and their
frequencies (i.e. # of cases in each category)


E.g. Student course
grades (categories:
A, B, C, D, F)
Describes distribution
of a categorical variable
(i.e. its values & how
frequently each occurs)
Grade
A
B
C
D
F
Total
Frequency
40
75
50
20
15
200
3
Relative Frequency Table

Relative Frequency Table: List all
categories and their relative frequencies (i.e.
proportion / percentage # of each category)
Grade
A
B
C
D
F
Total
Frequency
40
75
50
20
15
200
Grade
A
B
C
D
F
Total
Rel. Freq. (%)
20
37.5
25
10
7.5
100
4
Bar Chart

Bar Chart: Categories represented by bars,
where height of each bar is simple / relative
frequency
StatCrunch:
Graphics > Bar Plot >
with summary
5
Pie Chart

Pie Chart: Categories represented by pie
slices, where size of each slice is proportional
to relative frequency
StatCrunch:
Graphics > Pie Chart >
with summary
6
Describing Two Categorical
Variables

Consider two categorical variables for each
individual case

E.g. Student course grades (1st var), & year the
course was taken in (2nd var)
Student ID #
Grade
Year
1234
A
2010
2234
D
2011
6744
B
2010
⁞

⁞
⁞
Want to describe joint behavior of both
categorical variables
7
Contingency Table

Contingency Table: Two-way table lists
frequencies for each combination of the two
categorical variables

E.g. Student Grade by Year
Grade
Year
A
B
C
D
F
Total
2010
40
75
50
20
15
200
2011
45
70
58
19
8
200
Total
85
145
108
39
23
400
8
Contingency Table

Contingency tables can also report relative
frequencies as % of total # of cases
Grade
Year
A
B
C
D
F
Total
2010
10 %
18.75 %
12.5 %
5%
3.75 %
50 %
2011
11.25 %
17.5 %
14.5 %
4.75 %
2%
50 %
Total
21.25 % 36.25 %
27 %
9.75 %
5.75 %
100 %
StatCrunch: Stat > Tables > Contingency > with summary

A.k.a. joint distribution of two categorical
variables
9
Contingency Table

Margins of contingency table give distributions
of each categorical variable separately
Grade
Year
A
B
C
D
F
Total
2010
10 %
18.75 %
12.5 %
5%
3.75 %
50 %
2011
11.25 %
17.5 %
14.5 %
4.75 %
2%
50 %
Total
21.25 % 36.25 %
27 %
9.75 %
5.75 %
100 %
distribution
of Year
distribution of Grade

A.k.a. marginal distributions (same as the
individual distributions of each variable)
10
Example

University applications to professional
schools, classified by Gender & Decision
Decision
Accepted
Cell format
Count
(Total percent)


Rejected
Total
Male
490
(40.83%)
210
(17.5%)
700
(58.33%)
Female
280
(23.33%)
220
(18.33%)
500
(41.67%)
Total
770
(64.17%)
430
(35.83%)
1200
(100.00%)
Gender
What proportion of applicants are female that get
accepted?
What proportion of applicants are male?
11
Conditional Distributions

Often need distribution of one variable for a
particular value of the other


E.g. What % of males gets accepted / rejected?
Fix value of one variable & look at distribution
of the other for that value only

E.g. Conditional distribution of decision,
conditional on gender=male
Male
count
(%)
Accepted
Rejected
Total
490
210
700
12
Conditional Distributions


Can condition on either variable (i.e. rows or
columns) of contingency table
E.g. Conditional distributions of Decision for
Gender = Male/Female (row %)
Decision
Accepted
Cell format
Count
(Row percent)
Rejected
Total
Male
490
(70%)
210
(30%)
700
(100%)
Female
280
(56%)
220
(44%)
500
(100%)
770
(64.17%)
430
(35.83%)
1200
(100.00%)
Gender
Total
cond. distr.
of Decision
for Gender=Male
cond. distr.
of Decision
for Gender=Female
13
Conditional Distributions

E.g. Conditional distributions of Gender for
Decision = Accepted/Rejected (column %)
Decision
Accepted
Cell format
Count
(Column percent)
Rejected
Total
Male
490
(63.64%)
210
(48.84%)
700
(58.33%)
Female
280
(36.36%)
220
(51.16%)
500
(41.67%)
770
(100.00%)
430
(100.00%)
1200
(100.00%)
Gender
Total
cond. distr.
of Gender for
Decision = Accepted
cond. distr.
of Gender for
Decision = Rejected
14
Conditional Distributions

If conditional distributions of one variable are
the same for every value of the other, we say
the two variables are independent


E.g. If conditional distribution of student course
grades does not change with year, then Grade is
independent of Year (& vice-versa)
Compare conditional distributions visually
using side-by-side bar plot
15
Side-by-side Bar Plot
StatCrunch:
Graphics >
Chart >
Columns
16
Simpson’s Paradox

For professional school applications, females
have lower acceptance rate (56%) than
males (70%)! Is there discrimination?
Accepted
Cell format
Count
(Row percent)

Rejected
Total
Male
490
(70%)
210
(30%)
700
(100%)
Female
280
(56%)
220
(44%)
500
(100%)
Let’s look at tables for Law & Business school
admissions separately…
17
Simpson’s Paradox
LAW
Accepted
Male
Female


Rejected
Total
BUSINESS Accepted
Rejected
Total
10
(10%)
90
(90%)
100
(100%)
Male
480
(80%)
120
(20%)
600
(100%)
100
(33.33%)
200
(66.67%)
300
(100%)
Female
180
(90%)
20
(10%)
200
(100%)
For both Law & Business schools, females have
higher acceptance rates (33.3% & 90%) than
males (10% & 80%) => no discrimination!
How is it possible that female overall acceptance
rate is lower?
18