Statistical Consulting Topics
In R, dealing with. . .
• NA values for factors
• Unused factor levels
Topic 1: NA values for factors
Recall Assignment 4
> my.data=read.csv("my_data.csv")
> my.data
subject
sex salary
1
1
male
100
2
2
male
40
3
3 female
NA
4
4 female
37
5
5
88
6
6
150
7
7 female
95
8
8
male
27
9
9
male
33
10
10 female
NA
11
11
56
12
12
male
60
1
> attach(my.data)
>
> t.test(salary~sex)
Error in t.test.formula(salary ∼ sex):
factor must have exactly 2 levels
grouping
Because salary was a continuous variable, the
empty cells were correctly read-in as missing.
Because sex was categorical, the empty cells
were incorrectly read-in as another factor level.
> levels(sex)
[1] ""
"female" "male"
Most common solutions in homework:
1. Remove offending observations
(subset to observations with male/female listed).
2. Manually force empty sex cells to be NA.
2
Easiest (most universal) solution:
> my.data=read.csv("my_data.csv",na.strings="")
> levels(my.data$sex)
[1] "female" "male"
Another reason to correctly state NA upon import:
60
40
salary
80
100
> subset=my.data[my.data$sex=="male"|
my.data$sex=="female", ]
> attach(subset)
> plot(salary~sex)
female
male
sex
> levels(sex)
[1] ""
"female" "male"
Here, the R tools that work on factors still see
sex as having 3 levels (true for either solution
on previous page).
3
Topic 2: Unused factor levels
Consider an anthropologist’s data set with 28
levels to the variable called Material:
> levels(beads$Material)
[1] "Amber"
"Amethyst"
.
.
.
[27] "Unidentified Shell" "Variscite"
> length(levels(beads$Material))
[1] 28
We subsetted down to only two levels of Material
for investigation with the variable Region...
> subset=beads[beads$Material=="Calcite" |
beads$Material=="Slate",]
> table(subset$Material,subset$Region)
Amber
Amethyst
Amphibolite
Biotite Schist
Bone
Bone or Ivory
Alentejo Algarve Estremadura Ribatejo Torres Vedras
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
Bronze
Calcite
Calcite (shell)
Ceramic
Chlorite Schist
Clay
Crinoid Fossil
Crystal Quartz
Dentalium
Dolomite
Fish Tooth
Fish Vertibra
Glycimeris
Lignite
Muscovite Schist
Nassarius
Slate
Slate or Schist
Talc Schist
Unidentifed Stone
Unidentified Shell
Variscite
0
8
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
54
0
0
0
0
0
0
0
0
0
0
0
0
0
0
798
0
0
0
0
0
0
23
0
0
0
0
0
0
0
0
0
0
0
0
0
0
31
0
0
0
0
0
To make this a more useful table, we want to
redefine the Material factor within the data
frame called subset to only include the levels
that are present.
5
0
694
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1007
0
0
0
0
0
Here’s one option:
> subset$Material=factor(subset$Material)
> table(subset$Material,subset$Region)
Alentejo Algarve Estremadura Ribatejo Torres Vedras
Calcite
8
0
54
23
694
Slate
0
0
798
31
1007
Before redefining Material (Y ∼ Material):
3.0
2.5
2.0
After redefining Material (Y ∼ Material):
> plot(log(subset$Diameter..mm.)~subset$Material)
6
Variscite
Unidentified Shell
Unidentifed Stone
Talc Schist
Slate or Schist
Slate
Nassarius
Muscovite Schist
Lignite
Glycimeris
Fish Vertibra
subset$Material
Fish Tooth
Dolomite
Dentalium
Crystal Quartz
Clay
Crinoid Fossil
Chlorite Schist
Ceramic
Calcite
Calcite (shell)
Bronze
Bone
Bone or Ivory
Biotite Schist
Amethyst
Amphibolite
Amber
1.5
log(subset$Diameter..mm.)
> plot(log(subset$Diameter..mm.)~subset$Material,las=3)
Calcite
Slate
subset$Material
7
1.5
2.0
2.5
3.0
log(subset$Diameter..mm.)
You can drop all unused levels for all factors in
a data frame using the droplevels() function:
> subset=beads[beads$Material=="Calcite" |
beads$Material=="Slate",]
> subset=droplevels(subset)
> levels(subset$Material)
[1] "Calcite" "Slate"
> levels(subset$Region)
[1] "Alentejo"
"Estremadura"
[4] "Torres Vedras"
"Ribatejo"
> table(subset$Material,subset$Region)
Alentejo Estremadura Ribatejo Torres Vedras
Calcite
8
54
23
694
Slate
0
798
31
1007
8
Another option for a continuous vs. categorical
exploratory plot (violin plot):
> library(ggplot2)
> p=ggplot(subset,aes(Material,log(Diameter..mm.)))
> p + geom_violin()
log(Diameter..mm.)
3.0
2.5
2.0
1.5
Calcite
Material
> table(subset$Material)
Calcite
Slate
779
1836
9
Slate
© Copyright 2026 Paperzz