September 19, 2013 Real Pirate Attacks, Charted with R

Revolutions
Learn more about using open source R for big data analysis,
predictive modeling, data science and more from the staff of
Revolution Analytics.
graphics
September 20, 2013
R and The Journal of Computational and Graphical Statistics
by Joseph Rickert
I don’t think that most people find reading the articles in the statistical journals to be easy going.
In my experience, the going is particularly rough when trying to learn something completely new
and I don’t expect it could be any other way. There is no getting around the hard work. However,
at least in the field of computational statistics things seem to be getting a little easier. These days,
it is very likely that you will find some code included in the supplementary material for a journal
article; at least in the Journal of Computational and Graphical Statistics (JCGS) anyway. JCGS,
which was started in 1992 with the mission of presenting “the very latest techniques on
improving and extending the use of computational and graphical methods in statistics and data
analysis”, still seems to be the place to publish. (Stanford's Rob Tibshirani published an article in
Issue 1, Volume 1 back in 1992, Robert Tibshirani & Michael LeBlanc, and also in the most
recent issue: Noah Simon, Jerome Friedman, Trevor Hastie & Robert Tibshirani.) Driven by the
imperative to produce reproducible research most authors in this journal include some computer
code to facilitate independent verification of their results. Of the 80 non-editorial articles
published in the last 6 issues of JCGS all but 9 of these included computer code as part of the
supplementary materials. The following table lists the counts of the type of software included.
(Note that a few articles included code in multiple languages, R and C++ for example.)
June13 March13 Dec12 Sept12 June12 March12 total_by_code
R
Matlab
c
cpp
other
none
total_by_month
9
6
0
0
0
0
9
0
0
0
1
6
5
1
1
0
0
2
5
3
2
1
3
0
7
4
1
1
0
1
7
4
0
2
2
0
42
18
4
4
6
9
15
16
9
14
14
15
83
R code accounted for 57% of the 74 instances of software included in the supplementary
materials. I think an important side effect of the inclusion of code is that studying the article is
much easier for everyone. Seeing the R code is like walking into a room of full of people and
spotting a familiar face: you know where to start. And, at least it seems feasible to “reverse
engineer” the article. Look at the input data, run the code, see what it produces and map it to the
math.
The following code comes from the supplementary material included in the survey article:
“Computational Statistical Methods for Social Networks Models” by Hunter, Krivitsky and
Schweinberger in the December 2012 issue of JCGS.
# Some of the code from Appendix of the article:
# “Computational Statistical Methods for Social Networks Models”
# by Hunter, Krivitsky and Schweinberger in the December 2012 issue of JCGS.
#Two-dimensional Euclidean latent space model with three clusters and random
# receiver effects
library(latentnet)
data(sampson)
monks.d2G3r <- ergmm(samplike ~ euclidean(d=2,G=3)+rreceiver)
Z <- plot(monks.d2G3r, rand.eff="receiver", pie=TRUE, vertex.cex=2)
text(Z, label=1:nrow(Z))
#Three-dimensional Euclidean latent space model with three clusters and
# random receiver effects
library(latentnet)
data(sampson)
monks.d3G3r <- ergmm(samplike ~ euclidean(d=3,G=3)+rreceiver)
plot(monks.d3G3r, rand.eff="receiver",use.rgl=TRUE, labels=TRUE)
Created by Pretty R at inside-R.org
The first four lines produce the graph below.
The sampson data set contains social network data that Samuel F. Sampson collected in the late
‘60s when he was a resident experimenter at a monastery in New England. The call to ergmm()
fits a “latent space model” by embedding the data in a 2 dimensional Euclidean space, clustering
it into 3 groups and including a random “receiver” effect”. The last 4 lines of code produce a
way cool, interactive three dimensional plot that you can rotate.
Posted by Joseph Rickert at 08:27 in academia, graphics, packages, R | Permalink | Comments
(0)
September 19, 2013
Real Pirate Attacks, Charted with R
To mark Talk Like a Pirate Day, Bob Rudis uses R to animate a map of the cumulative realworld pirate attacks since 1978:
Looks like the
Carribean and the West Indes, traditional pirate haunts, are still active. But the real hot-spot in
modern times is Africa. Find the R code behind the animation at the blog post linked below.
rud.is: Animated IRL Pirate Attacks In R
Posted by David Smith at 13:25 in graphics, R | Permalink | Comments (0)
September 09, 2013
An animated peek into the workings of Bayesian Statistics
One of the practical challenges of Bayesian statistics is being able to deal with all of the complex
probability distributions involved. You begin with the likelihood function of interest, but once
you combine it with the prior distributions of all the parameters, you end up with a complex
posterior distribution that you need to characterize. Since you usually can't calculate that
distribution analytically, the best you can do is to simulate from that distribution (generally,
using Markov-Chain Monte-Carlo techniques). Packages like RStan handle the simulation for
you, but it's fairly easy to use the Metropolis-Hastings algorithm to code it yourself, at least for
simple problems.
PhD student Maxwell Joseph did just that, using the R language. Beginning with a data set of 50
points, he set out to estimate the joint posterior distribution of the mean and variance, given
simple priors (Normal for the mean; Uniform for the variance). He ran three chains of the M-H
algorithm simultanously, and created the animation below. You can see each of the chains
(purple, red and blue) as they progress through the joint distribution of the mean (horizontal axis)
and variance (vertical axis), and see how the posterior distribution evolves over time in the 3-D
image to the right.
I love the amoeba-like effect as the posterior converges to something close to a 2-D Gaussian
distribution, and as you'd expect the mode of that posterior gives excellent estimates for the true
mean and variance.
Maxwell shares all of the R code for setting up the likelihood and priors, running the MetropolisHastings chains, and animating the results at his blog, Ecology in silico. Note the use of R's
system command to call ImageMagick convert to stitch individual PNG frames into the
animated GIF you see above. (Another alternative is to use Yihui Xie's animations package, but
the direct method works just as well.)
Ecology in silico: Animating the Metropolis Algorithm (via Allison Barner)
Posted by David Smith at 16:47 in graphics, R, statistics | Permalink | Comments (0) | TrackBack
(0)
August 20, 2013
The financial meltdown, to a trance beat
With the FMS Symphony by csv soundsystem you can listen to the Global Financial Crisis as
you watch interest rates plunge while the Treasury floods the market with emergency funds.
The source data for the chart and music comes from daily emails (like this one) sent by the US
Treasury summarizing the cash spending and borrowing of the Federal government. (Incredibly,
this data is not available in any structured format ready for analysis.) The CSV Soundsystem
team then used R packages for data processing (plyr, reshape2), sonification (tuneR and ddr,
which makes musical sounds including "cheesy synth") and visualization (aplpack) to produce
the chart, animation and music.
We used principal component analysis to rotate the 52 line-items and plotted the 15 highestloaded components as Chernoff faces. We plotted interest rate and federal account balance with
line-width as standard deviation of all the line-items.
We represented similar data in audio. Chords were selected based on the derivative of account
balance, and a melody was composed based on the federal interest rate. We also included a
contrapuntal riff driven by the distance between accumulated federal debt and the legal debt
ceiling.
If you enjoyed the house-trance stylings of federal treasury statistics, you can even download the
music track for your playlist. If you want to embark on a similar project, you can find more
behind-the-scenes details and the code at GitHub.
csv soundsystem: FMS Treasury
Posted by David Smith at 17:27 in finance, graphics, R | Permalink | Comments (0) | TrackBack
(0)
August 13, 2013
Visualizing Katrina's strongest winds, with R
Here's a new picture of the devastation wrought by Katrina in 2005. This image shows the
maximum wind speeds of the hurricane, not at any particular point of time, but over the duration
of the entire storm:
The data come from NOAA's H*Wind project, which makes windspeed data from sensors on the
ground and floating in the ocean available to the public. Catastrophe scientist Dr. Rob Hodges
analyzed the data using the R language to determine the maximum windspeed in small hexagons
tiled across the Gulf region and then visualize the results. While the strongest winds blew
offshore, they nonetheless contributed to the storm surge that ultimately swamped New Orleans.
You can find Dr. Hodge's R code behind this analysis at GitHub. The visualization itself was
created using the spplot function with panel.polygonsplot panel function. Open data
also led to this visualization of Hurricane Sandy's path as it the disaster unfolded.
Catastrophe Science: Maximum Observed Windspeeds Using H*Wind Analyses
Posted by David Smith at 14:45 in applications, graphics, R | Permalink | Comments (1) |
TrackBack (0)
August 06, 2013
How to choose a new business location with R
This guest post is by Rodolfo Vanzini. Rodolfo is senior partner at eXponential.it — an asset
management consultancy based in Italy — and advises clients on investment management issues.
He taught at the University of Siena and is an analytics professional.
With an economist education and a financial markets expertise four years ago I thought I couldn't
be of any help, if not emotional, in supporting my wife in her decision to partner with an
internationally renowned franchise and open a school of English in the city where we live,
Bologna, Italy. Two years later she realized the premises she had rented weren't large enough to
accomodate in the nearest future the foreseeable demand of courses and decided to plan to move.
The first question that popped up was: how far do our customers have to travel to reach the
school of English? How much does proximity play a role in the decision to move?
Having adverstised city-wide our first assumption was that customers were spread over the city
more or less uniformly, therefore ideally location within the city wasn't a significant issue.
Nevertheless, I have learned what behavioral economics has taught us over the last thirty years:
individuals make biased decisions basing their assumptions and conclusions on a limited and
approximate set of rules often leading to sub-optimal outcomes. Thus, not to make a mistake we
definetely had to pin-point our customers on a map to assess how integral was retaining
proximity within the neighboring area to local customers. My analytical background came into
play, at last. With a bit of R I decided to perform some basic analytics to check out how near or
far customers had to travel.
With no previous experience I had to look up something that could help me with the obvious
mapping issues I was to come across. Pretty soon I found what I was looking for in the ggmap
package. I first created a character variable with our school address and then imported addresses
from our local database into a data frame (note that I had previously geocoded the addresses
using latlon <- geocode(as.character(addr$Address), output='latlon')).
bologna <- "Via Dagnini, 42, Bologna, Emilia Romagna, Italy"
cust <- read.table(file = "addr_cust.csv", header = TRUE, dec = ",", sep =
";")
cust.2012 <- subset(cust, Year == 2012)
head(cust)
##
ID
## 1 1 Name Last Name
## 2 2 Name Last Name
## 3 3 Name Last Name
## 4 4 Name Last Name
## 5 5 Name Last Name
## 6 6 Name Last Name
##
Full.address
## 1 Via Brizzi, 10, 40068 San Lazzaro di Savena, Emilia Romagna Italy
##
##
##
##
##
##
##
##
##
##
##
##
2
3
4
5
6
1
2
3
4
5
6
Via Ruggi, 14,
Via Stradelli Guelfi, 78/3,
Via Degli Scalini 9,
Via Cavazza, 8,
Via Delle Fragole, 26,
latitude longitude Year
44.46
11.38 2012
44.48
11.37 2012
44.48
11.42 2012
44.48
11.35 2012
44.48
11.37 2012
44.47
11.37 2012
40137
40138
40136
40137
40137
Bologna,
Bologna,
Bologna,
Bologna,
Bologna,
Emilia
Emilia
Emilia
Emilia
Emilia
Romagna
Romagna
Romagna
Romagna
Romagna
Italy
Italy
Italy
Italy
Italy
The nicest part of ggmap, I found out, is that you can choose among different kinds of map
sources providing different types of geo/graphic details.
require(ggmap)
## Loading required package: ggmap
## Loading required package: ggplot2
qmap(bologna, zoom = 12, source = "google", maptype = "roadmap")
## Map from URL :
##
http://maps.googleapis.com/maps/api/staticmap?center=Via+Dagnini,+42,+Bologna
,+Emilia+Romagna,+Italy&zoom=12&size=%20640x640&scale=%202&maptype=roadmap&se
nsor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL :
##
http://maps.googleapis.com/maps/api/geocode/json?address=Via+Dagnini,+42,+Bol
ogna,+Emilia+Romagna,+Italy&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
After a few attempts fiddling with different map sources (google, osm and stamen) I resolved the
latter would offer the optimal graphics needed. Then I plotted latitude and longitude coordinates
to show our customer addresses on the map.
bologna.map <- get_map(bologna, zoom = 12, source = "stamen", maptype =
"toner")
## Map from URL :
##
http://maps.googleapis.com/maps/api/staticmap?center=Via+Dagnini,+42,+Bologna
,+Emilia+Romagna,+Italy&zoom=12&size=%20640x640&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL :
##
http://maps.googleapis.com/maps/api/geocode/json?address=Via+Dagnini,+42,+Bol
ogna,+Emilia+Romagna,+Italy&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
Bologna.Map.2012 <- ggmap(bologna.map, base_layer = ggplot(aes(x = longitude,
y = latitude), data = cust.2012), extent = "device")
Bologna.Map.2012 + geom_point(size = I(3), alpha = 1/3)
Though the first map wasn't as informative as expected, with surprise we started to notice how
clustered in the south-eastern part of the city customers appeared. What I was looking for was a
way to show the density of customers in the surrounding area and found an alternative graphics
using faceting to check if there had been a developing pattern over the years since start-up and
found that, partly unexpectedly, customers had started clustering in that area since the beginning.
After checking with the available data provided locally by city hall and the chamber of
commerce for GPD per capita and population density the answer to our initial question was
coming into shape. There had been a natural selection of customers: customer density decreases
as we move out from the center of the cloud. Needless to say that we wanted to retain our
customer relationships and it was becoming apparent that the new location had to be found as
near as possible to the middle of that cloud.
Bologna.Map <- ggmap(bologna.map, base_layer = ggplot(aes(x = longitude, y =
latitude),
data = cust), extent = "device")
Bologna.Map + stat_density2d(aes(x = longitude, y = latitude, fill =
..level..,
alpha = ..level..), size = 2, bins = 4, data = cust, geom = "polygon",
show_guide = FALSE) +
facet_wrap(~Year)
The following map, with a focus on 2012 to recap what we'd found out about our (then) current
location and our customers, has demonstrated how central that south-eastern area has been and
still is to our local business.
Bologna.Map.2012 + stat_density2d(aes(x = longitude, y = latitude, fill =
..level..,
alpha = ..level..), size = 2, bins = 4, data = cust.2012, geom =
"polygon",
show_guide = FALSE)
This little exercise has demonstrated how relevant proximity is, probably not only to our local
business considering the well researched and studied central place theory in urban economics, for
instance. These findings confirm to a certain extent some general theories of small business
management and have enabled us not to make a mistake driven by some behavioral bias. If I
think I've always considered analytics and R as tools of applied economics in search of
unexpressed arbitrage opportunities by Wall Street firms, then I might conclude it's more useful
to find established patterns of relationships on Main Street.
Posted by David Smith at 11:19 in applications, graphics, R | Permalink | Comments (2) |
TrackBack (0)
August 05, 2013
Explore smartphone market share with Nanocubes
Back in May, Twitter's Miguel Rios created some beautiful data visualizations to show that with
enough (i.e. billions) of geotagged tweets, you can reveal the geography of where people live,
work and commute. Now, a new interactive visualization of 210 million geotagged tweets by
AT&T Research Labs reveals the market share of iPhone, Android and Windows smartphones
down to the smallest geographic levels.
Simon Urbanek (known to R users not least as a member of the R Core Group) explained in
his Web-Based Interactive Graphics talk at JSM 2013 today that the visualization uses 32Tb of
Twitter data, yet runs smoothly and interactively on a single machine with 16Gb of RAM. When
you first start the application, it shows a view of the population density of the USA, as you might
expect from millions of geotagged tweets. But this application also lets you explore the mobile
device used to send each tweet. Across the USA (and excluding devices with a device type of
"none" or "iPad", which was hardly used), the proportion of tweets sent using each device was:



60.7% of 150.6M tweets were sent using an iPhone,
36.4% with an Android phone,
and 2.8% with a Windows phone.
More interestingly, you can use the app to zoom down into specific geographic areas and see the
device market share just for that region. In the NYC area, the iPhone dominates, with 74.1% of
tweets from known smartphones:
However, in the Atlanta region, Android phones fare much better than in the USA as a whole,
being the source of 45.9% of smartphone tweets:
Windows phones appear be best represented in the Seattle-Tacoma region, home to Microsoft
HQ:
Despite the massive number of data points and the beauty and complexity of the real-time data
visualization, it runs impressively quickly. The underlying data structure is based on Nanocubes,
a fast datastructure for in-memory data cubes. The basic idea is that nanocubes aggregate data
hierarchically, so that as you zoom in and out of the interactive application, one pixel on the
screen is mapped to just one data point, aggregated from the many that sit "behind" that pixel.
Learn more about nanocubes, and try out the application yourself (modern browser required) at
the link below.
Nanocubes: Fast Visualization of Large Spatiotemporal Datasets
Posted by David Smith at 14:48 in data science, graphics, R | Permalink | Comments (0) |
TrackBack (0)
July 22, 2013
Bike sharing in 100 cities
Many cities around the world have bike sharing programs: pick up a bike at a docking station,
ride it across town and drop it off at another session, and just pay for the time you use. (Even
Albacete, the Spanish college town hosting last month's UseR conference, had one.) Most of
these systems provide open data feeds of bike availability, and the data is available for all cities
via the CityBikes API. Ramnath Vaidyanathan used this API to create an on-line application
showing the real-time status of bike availability in over 100 cities. Here's London, for example:
Red dots indicate bikeshare stations with zero available bikes; green dots show stations with the
most availability. (I took the snapshot above around midnight London time: it looks like all the
city-center bikes have been taken to the outlying stations by commuters, and the overnight
redistribution hasn't yet taken place.)
Ramnath used an R language script to download the data from the JSON API (using the httr
package), and created the maps using the rCharts packages (which provides an interface to
Leaflet maps). The interactive application, with city selection, pan and zoom, and pop-up detail
for each station, was created using RStudio's Shiny package for R. It's easy to imagine other
applications for this data, for example using statistical forecasting to optimize the overnight
redistribution, or to determine the most popular bike routes in a city.
Click the map above to play around with the application yourself, or follow the link below for
more details from Ramnath on how the application was created.
Ramnath Vaidyanathan: Visualizing Bike Sharing Networks
Posted by David Smith at 16:36 in applications, graphics, R | Permalink | Comments (3) |
TrackBack (0)
July 04, 2013
R User Groups Worldwide
by Joseph Rickert
UserR! 2013 is less than a week away, and one task that we have been doing in preparation here
at Revolution Analytics is to scrub the list of worldwide R user groups that we maintain in our
Local R User Group Directory. These are all of the active (appear to have had some activity
within the past year) R user groups that we know about. Please help us verify the accuracy of the
list. If you don't see a spot on the map where there should be one,
or can't find a group you know about in this file Download R_User_Groups_World_7_2013, or
you find something else wrong, please let us know.
All of us at Revolution Analytics are proud to be a sponsor of UseR! 2013. If your are going to
Albacete, please come by our booth and say hello. Have a safe trip and see you in Spain!
Note: the map above was drawn with code adapted from a solution that Sandy Muspratt provided
to a question on stackoverflow last year. Sandy's code will allow you to recenter the map on your
favorite R user group.
Posted by Joseph Rickert at 09:05 in graphics, user groups | Permalink | Comments (0) |
TrackBack (0)
June 25, 2013
A comprehensive guide to time series plotting in R
As R has evolved over the past 20 years its capabilities have improved in every area. The visual
display of time series is no exception: as the folks from Timely Portfolio note that:
Through both quiet iteration and significant revolutions, the volunteers of R have made
analyzing and charting time series pleasant.
R began with the basics, a simple line plot:
and then evolved to true time series charts (with a calendar-aware time axis), high-frequency
charts, interactive charts, and even charts in the style of The Economist. In fact, more than a
dozen options for plotting time series have been introduced over the years as shown in the
timeline below:
Check out all the examples at the link below — you're sure to find a style of time series chart
that meets your needs.
Timely Portfolio (github): R Financial Time Series Plotting
Posted by David Smith at 17:13 in finance, graphics, R | Permalink | Comments (0) | TrackBack
(0)
Next »