Why I love R - University of Birmingham

Introduction
Key attributes
Capabilities
Examples
Why I love R
Alastair Sanderson
School of Physics & Astronomy, University of Birmingham
2012-03-20
Summary
Introduction
Key attributes
Outline
1
Introduction
2
Key attributes
3
Capabilities
4
Examples
5
Summary
Capabilities
Examples
Summary
Introduction
Key attributes
Capabilities
Outline
Key attributes of R:
free/open source
widely used
well documented
a high level programming language
Capabilities of R:
data handling
statistical/numerical analysis & modelling
data visualisation
reproducible research
Some examples using R
Summary
Examples
Summary
Introduction
Key attributes
Outline
1
Introduction
2
Key attributes
3
Capabilities
4
Examples
5
Summary
Capabilities
Examples
Summary
Introduction
R is. . .
Key attributes
Capabilities
Examples
Summary
Freely available
Free/open source software
. . . with free/open source IDEs (integrated development
environment), e.g:
RStudio: http://rstudio.org/
Emacs Speaks Statistics (ESS): http://ess.r-project.org/
According to John Chambers, in his excellent book Software
for Data Analysis - Programming with R , the
mission
of R is
. . . to enable the best and most thorough exploration of
data possible
. . . with the associated
prime directive,
that
the computations and the software should be trustworthy:
they should do what they claim, and be seen to do so.
Introduction
R is. . .
Key attributes
Capabilities
Examples
Summary
Widely used (1/2)
Mature software (v1.0 released in 2000); runs on a wide range
of platforms; with an annual development update cycle
Large & growing user base (1-2 million)
Many user contributed packages (>3500), very easy to install:
install.packages("mypkg")
library("mypkg")
Figure: Google searches for R plot: solidly growing interest
Introduction
R is. . .
Key attributes
Capabilities
Examples
Summary
Widely used (2/2)
R is used by major institutions, e.g. Google, Facebook, NY
Times, New Scientist etc.
Track R's popularity: http://r4stats.com/popularity
R is the favourite tool for users of Kaggle (a platform for
predictive modelling and analytics competitions):
http://blog.kaggle.com/2011/11/27/kagglers-favorite-tools
Introduction
R is. . .
Key attributes
Capabilities
Examples
Summary
Well documented
Excellent help pages, richly cross-linked:
help(package="base")
?data.frame
?Syntax
news(package="ggplot2")
demo("plotmath")
browseVignettes()
#
#
#
#
#
#
List package contents
help on a specific task
help on a general topic
details of recent changes
demonstrate maths annotation
view index of vignettes in web browser
See documentation links at http://www.r-project.org for manuals,
wiki, R journal, books etc.
CRAN task views: http://cran.r-project.org/web/views/
Type function name without brackets to view its R source code
Many R bloggers, aggregated at http://www.r-bloggers.com/
Introduction
R is. . .
Key attributes
Capabilities
Examples
Summary
A high level programming language
Functional, object oriented language:
http://cran.r-project.org/doc/manuals/R-lang.html
http://cran.r-project.org/doc/manuals/R-ints.html
Debugger, code proling (Rprof ) etc.:
http://cran.r-project.org/doc/manuals/R-exts.html
Can link to compiled code (C, C++, Fortran) see ?Foreign;
e.g. seamless integration of R with C++ :
http://cran.r-project.org/web/packages/Rcpp/index.html
Support for parallel computing:
http://cran.r-project.org/web/views/HighPerformanceComputing.html
Writing packages in R:
http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf
Introduction
Key attributes
Outline
1
Introduction
2
Key attributes
3
Capabilities
4
Examples
5
Summary
Capabilities
Examples
Summary
Introduction
Key attributes
Capabilities
Examples
Summary
Data handling
Data input/output:
http://cran.r-project.org/doc/manuals/R-data.html
Data structures: e.g. vectors, factors, arrays/matrices,
lists/data frames: http://cran.r-project.org/doc/manuals/R-intro.html
?Extract # details of operators to extract/replace parts of data structures
?apply; ?sapply; ?aggregate; ?sweep # vectorised operations in R
?subset; ?transform; ?match; ?merge # data access & join commands
?regex
# details of regular expressions capabilities
install.packages("stringr")
# make it easier to work with strings
install.packages("RODBC")
# Open Database Connectivity interface
Very powerful & convenient data manipulation packages:
plyr : http://cran.r-project.org/web/packages/plyr/index.html
reshape2 :
http://cran.r-project.org/web/packages/reshape2/index.html
Introduction
Key attributes
Capabilities
Examples
Summary
Statistical analysis & modelling
A brief summary:
?Distributions
?RNG
help(package="stats")
?NA
library("cluster")
library("boot")
library("survival")
?lm; ?nls; ?anova
#
#
#
#
#
#
#
#
details of supported statistical distributions
details of random number generation
list contents of base stats package
support for missing data
cluster analysis
bootstrap resampling
survival analysis
linear/non-linear regression/ANOVA
See also these CRAN task views:
http://cran.r-project.org/web/views/Distributions.html
http://cran.r-project.org/web/views/Multivariate.html
http://cran.r-project.org/web/views/MachineLearning.html
http://cran.r-project.org/web/views/Spatial.html
http://cran.r-project.org/web/views/Robust.html
Introduction
Key attributes
Capabilities
Examples
Summary
Numerical analysis & modelling
Wide variety of capabilities, e.g:
?integrate
?optim, ?nlminb
?D
install.packages("deSolve")
library("Matrix")
?spline; ?smooth.spline
library("splines")
?prcomp
#
#
#
#
#
#
#
#
numerical integration
general-purpose optimisation
symbolic differentiation
solve differential equations
sparse and dense matrix classes and methods
spline interpolation
regression spline functions and classes
Principal Components Analysis
See also the following CRAN task views:
http://cran.r-project.org/web/views/Optimization.html
http://cran.r-project.org/web/views/HighPerformanceComputing.html
http://cran.r-project.org/web/views/ChemPhys.html
Introduction
Key attributes
Capabilities
Examples
Data visualisation
help(package="graphics")
?Devices
library(lattice)
library(grid)
install.packages("ggplot2")
install.packages("rgl")
#
#
#
#
#
#
contents of base graphics package
details of available output devices
excellent for highly structured data
lower-level hierarchical graphics
fantastic graphics package!
interactive graphics
CRAN task view on graphics:
http://cran.r-project.org/web/views/Graphics.html
ggplot2 web resources:
http://had.co.nz/ggplot2/
http://crantastic.org/packages/ggplot2
see Andy's talk
Summary
Introduction
Key attributes
Capabilities
Examples
Reproducible research
"The term reproducible research was rst proposed by Jon Claerbout at Stanford
University and refers to the idea that the ultimate product of research is the
paper along with the full computational environment used to produce the results
in the paper such as the code, data, etc. necessary for reproduction of the
results and building upon the research."
quote from http://en.wikipedia.org/wiki/Reproducibility
Sweave (see ?Sweave): http://www.statistik.lmu.de/∼ leisch/Sweave/
xtable package - export tables to LATEX or HTML
Using Emacs Org mode (http://orgmode.org) with R:
http://orgmode.org/worg/org-contrib/babel/languages/ob-doc-R.html
http://orgmode.org/worg/org-contrib/babel/uses.html
RC package (see Ian's talk next meeting)
CRAN taskview: http://cran.r-project.org/web/views/ReproducibleResearch.html
Summary
Introduction
Key attributes
Outline
1
Introduction
2
Key attributes
3
Capabilities
4
Examples
5
Summary
Capabilities
Examples
Summary
Introduction
Key attributes
Capabilities
Examples
Hybrid R plot/image graphics
Figure: Integrating spatial data with a map, using ggplot2 in R
http://blog.revolutionanalytics.com/2012/02/what-are-the-most-popular-bike-routes-in-london.html
Summary
Introduction
Key attributes
Capabilities
Examples
Summary
A New York Times graphic using R
Figure: Michael Jackson's billboard rankings vs. the Beatles (top, in red)
and U2 (bottom; in red)
These charts were done mostly in R and were published within
hours of Michael Jackson's death:
http://blog.revolutionanalytics.com/2009/06/nyt-charts-michael-jacksons-pop-hits.html
See the full, interactive graphic is here:
http://www.nytimes.com/interactive/2009/06/25/arts/0625-jackson-graphic.html
Introduction
Key attributes
Capabilities
Examples
Summary
Dates & time series
> dt <- as.Date("2012-03-20")
> dt - as.Date("01/Jan/12", format="%d/%b/%y")
Time difference of 79 days
> months(dt)
[1] "March"
> weekdays(dt)
[1] "Tuesday"
For more information:
?DateTimeClasses; ?Dates
Time series handling & modelling:
?acf
# auto-correlation
?ccf
# cross-correlation
?arima # Fit ARIMA models to univariate time series
install.package("zoo")
# } very useful time series
install.package("forecast") # } packages
CRAN task view: http://cran.r-project.org/web/views/TimeSeries.html
Introduction
Key attributes
Capabilities
Examples
Summary
Visualising multivariate galaxy data
2D kernel density estimates; semi-transparency;
colour/shape/size encoding
http://www.sr.bham.ac.uk/~ajrs/talks/SandersonAlastair_user2011_talk.pdf
Example (galaxy spatial distribution)
Luminosity (Solar)
109
−15.0
9.5
10
galaxy velocities)
1010
−15.5
1010.5
−16.0
Scaled velocity
−2
−1
−16.5
0
1
−17.0
2
−17.5
20
Number of galaxies
Declination
Example (histogram of
15
Morphology
Early
Late
10
?
5
Morphology
Early
Late
200.0 199.5 199.0 198.5 198.0 197.5
Right Ascension
?
0
1500
2000
2500
3000
Galaxy velocity (km/s)
3500
Introduction
Key attributes
Capabilities
Examples
World Bank data demo - part 1
Load package, download & save data
require("WDI") # load R package to access World Bank data
MSdata <- WDI(indicator="IT.CEL.SETS.P2", start=1990, end=2010)
save(MSdata, file="MSdata.RData")
Plot curves for each country & median line for all countries
## Use shorter name for data column:
MSdata <- transform(MSdata, MSpc = IT.CEL.SETS.P2)
require(ggplot2)
ggplot(data=MSdata, aes(year, MSpc, group=country)) +
geom_line(alpha=0.1) + # semi-transparent lines: 10% of normal
## add median curve for all countries (i.e. over-ride grouping):
stat_summary(fun.y=median, geom="line", aes(group=1), colour="blue") +
scale_y_log10() +
ylab("Mobile cellular subscriptions per 100 people")
Summary
Introduction
Key attributes
Capabilities
Examples
Summary
Mobile cellular subscriptions per 100 people
Fraction of population with mobile phones
1e+01
1e−01
1e−03
1990
1995
2000
year
2005
2010
Introduction
Key attributes
Capabilities
Examples
World Bank data demo - part 2
Plot 20 most recent countries to get mobile phones
require(plyr) # extremely useful package
## Year when a country with no subscriptions first registered some:
firstMS <- ddply(subset(MSdata, MSpc==0 & any(MSpc>0)), .(country),
summarise, year1 = max(year) + 1)
## sort by the year mobile phone use first starts:
firstMS <- firstMS[order(firstMS$year1), ]
## Create dotplot:
ggplot(data=tail(firstMS, 20), aes(year1, reorder(country, year1))) +
geom_point() +
xlab("Year of first mobile cellular subscriptions") + ylab("")
Summary
Introduction
Key attributes
Capabilities
Examples
Summary
Year that mobile phone usage starts
Korea,Dem.Rep.
●
Tuvalu
●
Eritrea
●
Guinea−Bissau
●
Comoros
●
Bhutan
●
SaoTomeandPrincipe
●
Micronesia,Fed.Sts.
●
Mayotte
●
Iraq
●
Afghanistan
●
Somalia
●
SierraLeone
●
Mauritania
●
Liberia
●
Chad
●
SyrianArabRepublic
●
Nepal
●
Ethiopia
Swaziland
●
●
1998
2000
2002
2004
2006
Year of first mobile cellular subscriptions
2008
Introduction
Key attributes
Outline
1
Introduction
2
Key attributes
3
Capabilities
4
Examples
5
Summary
Capabilities
Examples
Summary
Introduction
Key attributes
Capabilities
Examples
Conclusions
Is is free/open source software
It has a rapidly growing user base (1-2 million)
It is widely used in academia, business & industry
Today, all of the Fortune 500 companies use R for their
data analyses
http://www.r-bloggers.com/open-source-is-opening-data-to-predictive-analytics/
I love R because it empowers the individual by enabling
cutting-edge processing, analysis & visualisation of data,
based on trustworthy computations and software.
Summary
Introduction
Key attributes
Capabilities
Examples
Birmingham R User Meeting (BRUM)
http://www.birminghamR.org
Alastair Sanderson:
http://www.sr.bham.ac.uk/~ajrs
Summary