how to speak ggplot2 like a native

how to speak ggplot2 like a native
DC R Meetup
Predictive Analytics World
October 19th, 2010
Harlan D. Harris, PhD
[email protected]
ggplot's philosophy



Graphics are (should be!) created by combining
a specification with data. (Wilkinson, 2005)
The specification is not the name of the visual
form (bar graph, scatterplot, histogram).
The specification is a collection of rules that
together describe how to build a graph, a
Grammar of Graphics
October 19th, 2010
Harlan D. Harris, PhD
2
graphics as grammar
12
10
15
10
Colum
n1
5
Colum
n2
0
Colum
n3
Row 3
data
date ct sz z
October 19th, 2010
6
4
Column
2
2
Column
3
0
Row 2
Row 4
Row 1
Row 3
Row 1
me
Column
1
8
x=date
y=ct/sz
bars
group by z
Harlan D. Harris, PhD
3
advantages

Flexible



Smart



can define new graph types by changing
specifications
can combine many forms into single graphs
compact: rules have useful defaults
graphs always have meaning
Reusable


can plug new data into old specification
can explore many types of plots from a set of data
October 19th, 2010
Harlan D. Harris, PhD
4
ggplot2

Hadley Wickham (Rice Univ.)


Extends & implements
The Grammar of Graphics (Wilkinson, 1995, 2005)




also: reshape2, plyr, etc.
Focus on layers; based on grid
Specification as R objects constructed by functions
Large library of components with good defaults
ggplot2: Elegant Graphics for Data Analysis
(Wickham, 2009)
October 19th, 2010
Harlan D. Harris, PhD
5
my gripes



Specification is hierarchical structure;
grammar is left-to-right R expression;
graph is spatial
Can't see the structure (usefully)
Abuses both notation and R semantics


Deep Magic with lazy evaluation, proto objects
Existing tutorials lead to conceptual confusion,
requires relearning of fundamentals

Start with the structure, not with the shortcuts
October 19th, 2010
Harlan D. Harris, PhD
6
goal
October 19th, 2010
Harlan D. Harris, PhD
7
data to plot
October 19th, 2010
Harlan D. Harris, PhD
8
ggplot likes “long” data
October 19th, 2010
Harlan D. Harris, PhD
9
will plot model vs. empirical
October 19th, 2010
Harlan D. Harris, PhD
10
simplest plot
aes=”aesthetics”=”create mapping”
October 19th, 2010
Harlan D. Harris, PhD
11
you don't need
to know this!
structure
ggplot(data=d.long.EI, mapping=aes(x=Parameter, y=Errors,
color=Condition)) +
layer(geom="line")
ggplot
data
(copy)
layers
Ø
mapping
scales
coords
facets
options
x=Param.
y=Errs
color=Cond.
layer[1]
data
mapping
geom
line

stat
identity
geom_
params
stat_
params
structure(p), str(p)
October 19th, 2010
Harlan D. Harris, PhD
12
add empirical data and chance
October 19th, 2010
Harlan D. Harris, PhD
13
you don't need
to know this!
structure so far
ggplot
data
(copy)
layers
mapping
scales
x=Param.
y=Errs
color=Cond.
coords
facets
options
layer[1]
data
data
(U)
data
(K)
data
mapping
geom
layer[1]
line
mapping
mapping
yint=Errs
mapping
yint=[64]
October 19th, 2010
stat
identity
geom
layer[1]
stat
point
identity
hline
geom
hline
stat
geom
layer[1]
hline
Harlan D. Harris, PhD
stat
hline
geom_
params
geom_
params
size=3
geom_
params
size=2
geom_
stat_
params
stat_
params
stat_
params
stat_
color=”black”
params
params
linetype=2
size=.5
14
scales
October 19th, 2010
Harlan D. Harris, PhD
15
coordinates & scales

coordinates affect display of axes


scales affect data mapping


cartesian, polar, map, etc.
colors, shapes, lines
source of confusion


set axis ticks/breaks and labels with
scale_x_continuous() or scale_y_discrete(), but
restrict DATA range with scale_*(limits=c(1,10))
restrict AXIS (plotted) range with
coord_cartesian(xlim=c(1,10))
October 19th, 2010
Harlan D. Harris, PhD
16
options
October 19th, 2010
Harlan D. Harris, PhD
17
shortcuts




All those layer() calls are tedious!
geom_*() creates a layer with a specific geom
(and various defaults, including a stat)
stat_*() creates a layer with a specific stat
(and various defaults, including a geom)
qplot() creates a ggplot and a layer
October 19th, 2010
Harlan D. Harris, PhD
18
quick note on stats


stat=”identity”
stat=”lm”


stat=”smooth”


fit y=f(x) with loess()
stat=”summary”


fit y=f(x) with lm(), generate new data to be plotted
by geom_line(), CIs with geom_ribbon()
y=f(x) with arbitrary f()
stat=”bin”

histograms
October 19th, 2010
Harlan D. Harris, PhD
19
simplest faceted plot
October 19th, 2010
Harlan D. Harris, PhD
20
everything else (+alpha)
October 19th, 2010
Harlan D. Harris, PhD
21
other things I find useful






scale_x_continuous(breaks=seq(1,9,2),
labels=c(“one”, “”, “five”, “”, “nine”))
geom_text(aes(x=.., y=.., label=..))
annotate(geom=”text”, x=14, y=19, “outlier!”)
geom_density()
stat_summary(fun.data=”mean_cl_boot”,
geom=”crossbar”)
geom_jitter(position=position_jitter(width=.5))
October 19th, 2010
Harlan D. Harris, PhD
22
“fizzy bubbly” plot
rated.movies <- subset(movies,
mpaa!=“”)
rated.movies$mpaa <factor(rated.movies$mpaa)
p <- ggplot(rated.movies,
aes(mpaa, rating)) +
geom_jitter(alpha=.5) +
stat_summary(fun.data=
“mean_sdl”, geom=“crossbar”,
color=“red”, size=1)
ggsave(“movies.png”, p,
dpi=150)
October 19th, 2010
takehomes





a ggplot graph is generated by a specification +
data
ggplot specifications are a core object plus
layers
mappings among data, x/y, scales, and other
attributes are fundamental
geom and stat shortcuts allow smart/compact
construction of graphs
ggplot encourages good graphs, with facets,
good use of color, minimal chartjunk
October 19th, 2010
Harlan D. Harris, PhD
24
2010 case study competition winner
October 19th, 2010
resources





Wickham, H. (2009) ggplot2: Elegant Graphics
for Data Analysis. Springer.
http://had.co.nz/ggplot2/
http://groups.google.com/group/ggplot2
http://stackoverflow.com/questions/tagged/r
http://github.com/hadley/ggplot2/wiki
October 19th, 2010
Harlan D. Harris, PhD
26
thanks!
October 19th, 2010
Harlan D. Harris, PhD
27