Association between Categorical Variables

Chapter 5
Association between
Categorical Variables
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Which hosts send more buyers to
Amazon.com?

To answer this question we must gather data on
two categorical variables: Host and Purchase

Host identifies the originating site: MSN,
RecipeSource, or Yahoo; Purchase indicates
whether or not the visit results in a sale
3 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Consider Two Categorical Variables
Simultaneously

A table that shows counts of cases on one
categorical variable contingent on the value of
another (for every combination of both variables)

Cells in a contingency table are mutually
exclusive
4 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Contingency Table for Web Shopping
5 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Marginal and Conditional Distributions
•
•
Marginal distributions appear in the “margins” of a
contingency table and represent the totals
(frequencies) for each categorical variable
separately
Conditional distributions refer to counts within a
row or column of a contingency table (restricted
to cases satisfying a condition)
6 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Conditional Distribution of Purchase for each
Host (Column Counts and Percentages)
7 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Conditional Distribution
•
Reveals the percentage of purchases
among visitors from RecipeSource to be
much less than for MSN and Yahoo
•
Host and Purchase are associated
8 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Segmented Bar Charts
•
Used to display conditional distributions
•
Divides the bars in a bar chart into
segments that are proportional to the
percentage in each category of a second
variable
9 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Contingency Table of Purchase by Region
10 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Segmented Bar Chart Shows Association
11 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Mosaic Plots

Alternative to segmented bar chart

A plot in which the size of each “tile” is
proportional to the count in a cell of a
contingency table
12 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Contingency Table of Shirt Size by Style
13 of 39
Copyright © 2011 Pearson Education, Inc.
5.1 Contingency Tables
Mosaic Plot Shows Association
14 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Motivation
Should insurance companies vary the
premiums for different car models (are
some cars more likely to be stolen than
others)?
15 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Method
Data obtained from the National Highway Traffic
Safety Administration (NHTSA) on car theft for
seven popular models (two categorical variables:
type of car and whether the car was stolen).
16 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Mechanics
17 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Mechanics
18 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.1: CAR THEFT
Message
The Dodge Intrepid is more likely to be stolen than
other popular models. The data suggest that
higher premiums for theft insurance should be
charged for models that are more likely to be
stolen.
19 of 39
Copyright © 2011 Pearson Education, Inc.
5.2 Lurking Variables
and Simpson’s Paradox
Association Not Necessarily Causation

Lurking Variable: a concealed variable that
affects the apparent relationship between two
other variables

Simpson’s Paradox: a change in the association
between two variables when data are separated
into groups defined by a third variable
20 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Motivation
Does it matter which of two airlines a
corporate CEO chooses when flying to
meetings if he wants to avoid delays?
21 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Method
Data obtained from US Bureau of
Transportation Statistics on flight delays for
two airlines (two categorical variables:
airline and whether the flight arrived on
time).
22 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Mechanics
23 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Mechanics –
Is destination a lurking variable?
24 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Mechanics –
This is Simpson’s Paradox
25 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.2: AIRLINE ARRIVALS
Message
The CEO should book on US Airways as it is
more likely to arrive on time regardless of
destination.
26 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Chi-Squared Statistic

A measure of association in a contingency
table

Calculated based on a comparison of the
observed contingency table to an artificial
table with the same marginal totals but no
association
27 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Contingency Table
28 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Calculating the Chi-Squared Statistic
29 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Calculating the Chi-Squared Statistic
2
x 

 30  40 
40
 10 
40
2

2

 70  60 
10 
60
2
60
2

10 

2
40

 50  40 
40
 10 
2
(50  60)2

60
2
60
 2.5  1.67  2.5  1.67
 8.33
30 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Cramer’s V

Derived from the Chi-Squared Statistic

Ranges in value from 0 (variables are not
associated) to 1(variables are perfectly
associated)
31 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Calculating Cramer’s V
V
x2
nmin  r  1, c  1
V = 0.20 for our example
There is a weak association between group
(students or staff) and attitude toward sharing
copyrighted music
32 of 39
Copyright © 2011 Pearson Education, Inc.
5.3 Strength of Association
Checklist: Chi-Squared and Cramer’s V

Verify that variables are categorical

Verify that there are no obvious lurking
variables
33 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.3: REAL ESTATE
Motivation
Do people who heat their homes with gas
prefer to cook with gas as well? What
heating systems and appliances should a
developer select for newly built homes?
34 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.3: REAL ESTATE
Method
The developer contacts homeowners to
obtain the data. Two categorical variables:
type of fuel used for home heating (gas or
electric) and type of fuel used for cooking
(gas or electric).
35 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.3: REAL ESTATE
Mechanics
Chi-Squared = 98.62; Cramer’s V = 0.47
36 of 39
Copyright © 2011 Pearson Education, Inc.
4M Example 5.3: REAL ESTATE
Message
Homeowners prefer gas to electric heat by
about 2 to 1. The developer should build
about two-thirds of new homes with gas
heat. Put electric appliances in all homes
with electric heat and in half of the homes
with gas heat (assuming that buyers for
new homes have the same preferences).
37 of 39
Copyright © 2011 Pearson Education, Inc.
Best Practices

Use contingency tables to find and summarize
association between two categorical variables.

Be on the lookout for lurking variables.

Use plots to show association.

Exploit the absence of association.
38 of 39
Copyright © 2011 Pearson Education, Inc.
Pitfalls

Don’t interpret association as causation.

Don’t display too many numbers in a table.
39 of 39
Copyright © 2011 Pearson Education, Inc.