+ Categorical Variables

+
Chapter 1: Exploring Data
Section 1.1
Analyzing Categorical Data
+
What is Statistics?
Statistics is the science of
 Collecting data
 Analyzing data
 Drawing conclusions from data
+
Major Branches of Statistics
1.
Descriptive Statistics
 Organizing, Summarizing Information
 Graphical techniques
 Numerical techniques
2.
Inferential Statistics
 Estimation
 Decision making
+
Important Terms
Variable – A variable is any characteristic whose
value may change from one individual to another
Examples:
 Brand of television
 Height of a building
 Number of students in a class
Examples from Algebra?
+
Important Terms
Data results from making observations either on a single variable
or simultaneously on two or more variables.
A univariate data set consists of observations on a single
variable made on individuals in a sample or population.
A bivariate data set consists of observations on two variables
made on individuals in a sample or population.
A multivariate data set consists of observations on two or more
variables made on individuals in a sample or population.
+
Data Sets
 A univariate
data set is categorical (or
qualitative) if the individual observations are
categorical responses.
 A univariate
data set is numerical (or
quantitative) if the individual observations
are numerical responses where numerical
operations generally have meaning.
+
Categorical
Variables place
individuals into one of several
groups or categories
EXAMPLES?
The distribution of a categorical variable lists the count or
percent of individuals who fall into each category.
Frequency Table
Format
Variable
Relative Frequency Table
Count of Stations
Format
Percent of Stations
Adult Contemporary
1556
Adult Contemporary
Adult Standards
1196
Adult Standards
8.6
Contemporary Hit
4.1
Contemporary Hit
569
11.2
Country
2066
Country
14.9
News/Talk
2179
News/Talk
15.7
Oldies
1060
Oldies
Religious
2014
Religious
Rock
869
Spanish Language
750
Other Formats
Total
1579
13838
7.7
14.6
Rock
6.3
Count
Spanish Language
5.4
Other Formats
11.4
Total
99.9
Percent
Analyzing Categorical Data
Example, page 8
+

+
categorical data
Frequency tables can be difficult to read. Sometimes
is is easier to analyze a distribution by displaying it
with a bar graph or pie chart.
Format
Frequency Table
Count
of Stations
Count of Stations
2500
Adult
Contemporary
2000
Adult Standards
Contemporary Hit
1500
Country
1000
Format
Percent of Stations
1556
Adult Contemporary
1196
Adult Standards
569
2066
News/Talk
2179
Oldies
1060
500
Relative Frequency Table
Percent
of Stations
Contemporary
11% Hit11%
5%
Country
6%
News/Talk
15%
869
Rock
750
Spanish Language
Total
1579
13838
Religious
Other Formats
Total
16%
14.9
News/Talk 15.7
Oldies
Oldies
15%
8%
Country
4%
0
Other Formats
4.1
9%
2014
Spanish Language
8.6
Contemporary hit
Religious
Rock
Adult
Contemporary
11.2
Adult Standards
Religious
7.7
14.6
Rock
6.3
Spanish
5.4
Other
11.4
99.9
Analyzing Categorical Data
 Displaying
+
Types of Numerical Data
Numerical data is discrete if the possible
values are isolated points on the number line.
Numerical data is continuous if the set of
possible values form an entire interval on the
number line.
+
Examples of Discrete Data
 The
number of students served in the MHS lunchroom
over a one hour time period for a sample of 7 different
days:
230
220
310
280
210
270
320
 The
number of textbooks bought by students at a given
school during a semester for a sample of 16 students
5
3
3
5
6
7
8
6
6
7
1
5
3
4
6
12
+
Examples of Continuous Data
 The
height of students that are taking AP Statistics for a
sample of 10 students.
72.1”
64.3”
68.2”
74.1”
66.3”
61.2”
68.3”
71.1”
65.9”
70.8”
Note: Even though the heights are only measured accurately to
1 tenth of an inch, the actual height could be any value in some
reasonable interval.
+
Examples of Continuous Data
The crushing strength of a sample of four jacks used to
support trailers.
7834 lb
8248 lb
9817 lb 8141 lb
Gasoline mileage (miles per gallon) for a brand of car is
measured by observing how far each of a sample of seven
cars of this brand of car travels on ten gallons of gasoline.
23.1 26.4
22.6 24.3
29.8
25.0
25.9
+
To decide if a given data set represents
continuous or discrete numerical data, use
the following guidelines:
 Discrete
data is typically gathered by
counting during observation
 Continuous
data is typically found by a
measuring process
+
End of Day 1
It is important to be an informed consumer
by:
•
•
•
•
Extracting information from charts
and graphs
Following numerical arguments
Knowing the basics of how data
should be gathered, summarized
and analyzed to draw statistical
conclusions
Understanding the validity and
appropriateness of processes and
decisions that affect your life
+
Deceptive Graphs
+
The NY Times ran an article in the “Week in
Review” on the housing bubble on 9-23-2007.
Included was a graphic that, with some clever
scaling, exaggerates the changes in housing
prices over the last 20 years.
+
The following bar graph gives the percent of owners of
three brands of trucks who are satisfied with their truck.
Truck Brands
From this graph we may legitimately conclude
A) there is very little difference in the satisfaction of owners for
the three brands.
B) Chevrolet owners are substantially more satisfied than Ford
or Toyota owners.
C) owners of other brands of trucks are less
satisfied than the owners of these three
brands.
D) Chevrolet probably sells more trucks
than Ford or Toyota.
Good and Bad
Our eyes react to the area of the bars as well as
height. Be sure to make your bars equally wide.
Avoid the temptation to replace the bars with pictures
for greater appeal…this can be misleading!
Alternate Example
This ad for DIRECTV
has multiple problems.
How many can you
point out?
Analyzing Categorical Data
Bar graphs compare several quantities by comparing
the heights of bars that represent those quantities.
+
 Graphs:
When a dataset involves two categorical variables, we begin by
examining the counts or percents in various categories for
one of the variables.
Definition:
Two-way Table – describes two categorical
variables, organizing counts according to a row
variable and a column variable.
Example, p. 12
Young adults by gender and chance of getting rich
Female
Male
Total
Almost no chance
96
98
194
Some chance, but probably not
426
286
712
A 50-50 chance
696
720
1416
A good chance
663
758
1421
Almost certain
486
597
1083
Total
2367
2459
4826
What are the variables
described by this twoway table?
How many young
adults were surveyed?
+
Tables and Marginal Distributions
Analyzing Categorical Data
 Two-Way
Definition:
The Marginal Distribution of one of the
categorical variables in a two-way table of
counts is the distribution of values of that
variable among all individuals described by the
table.
Note: Are percents or counts more informative when
comparing groups of different sizes?
To examine a marginal distribution,
1)Use the data in the table to calculate the marginal
distribution (in percents) of the row or column totals.
2)Make a graph to display the marginal distribution.
+
Tables and Marginal Distributions
Analyzing Categorical Data
 Two-Way
Example, p. 13
Young adults by gender and chance of getting rich
Female
Male
Total
Almost no chance
96
98
194
Some chance, but probably not
426
286
712
A 50-50 chance
696
720
1416
A good chance
663
758
1421
Almost certain
486
597
1083
Total
2367
2459
4826
Response
Almost no chance
Examine the marginal
distribution of chance
of getting rich.
Percent
Chance of being wealthy by age 30
194/4826 = 4.0%
A 50-50 chance
A good chance
Almost certain
Percent
Some chance
+
Tables and Marginal Distributions
Analyzing Categorical Data
 Two-Way
35
30
25
20
15
10
5
0
Almost
none
Some
50-50
Good
chance
chance
chance
Survey Response
Almost
certain

Marginal distributions tell us nothing about the relationship
between two variables.
Definition:
A Conditional Distribution of a variable
describes the values of that variable among
individuals who have a specific value of another
variable.
To examine or compare conditional distributions,
1)Select the row(s) or column(s) of interest.
2)Use the data in the table to calculate the conditional
distribution (in percents) of the row(s) or column(s).
3)Make a graph to display the conditional distribution.
• Use a side-by-side bar graph or segmented bar
graph to compare distributions.
+
between Categorical Variables
Analyzing Categorical Data
 Relationships
Example, p. 15
Young adults by gender and chance of getting rich
Female
Male
Total
Almost no chance
96
98
194
Some chance, but probably not
426
286
712
A 50-50 chance
696
720
1416
A good chance
663
758
1421
Almost certain
486
597
1083
Total
2367
2459
4826
Male
Female
Almost no chance
98/2459 =
4.0%
96/2367 =
4.1%
Some chance
286/2459 =
11.6%
426/2367 =
18.0%
A 50-50 chance
720/2459 =
29.3%
696/2367 =
29.4%
758/2459 =
30.8%
663/2367 =
28.0%
597/2459 =
24.3%
486/2367 =
20.5%
A good chance
Almost certain
Chance
by
age
Chanceofofbeing
being wealthy
wealthy
age
3030
Chance
being
wealthyby
by
age
30
100%
90%
80%
70%
35
60%
30
25
50%
20
40%
15
10
30%
5
20%
0
10%Almost
Almost no
Some
no Some
chance
chance
chance
0% chance
Percent
Percent
Percent
Response
Calculate the conditional
distribution of opinion
among males.
Examine the relationship
between gender and
opinion.
Almost certain
Good chance
Males
Males
50-50 chance
50-50
50-50
chance
chance
Good
Good
chance
chance
Males
Females
Opinion
Opinion
Opinion
Females
Almost
Some chance
Almost
certain
certain
Almost no chance
+
Tables and Conditional Distributions
Analyzing Categorical Data
 Two-Way

As you learn more about statistics, you will be asked to solve
more complex problems.

Here is a four-step process you can follow.
How to Organize a Statistical Problem: A Four-Step Process
State: What’s the question that you’re trying to answer?
Plan: How will you go about answering the question? What
statistical techniques does this problem call for?
Do: Make graphs and carry out needed calculations.
Conclude: Give your practical conclusion in the setting of the
real-world problem.
+
a Statistical Problem
Analyzing Categorical Data
 Organizing