Revolutions Learn more about using open source R for big data analysis, predictive modeling, data science and more from the staff of Revolution Analytics. graphics September 20, 2013 R and The Journal of Computational and Graphical Statistics by Joseph Rickert I don’t think that most people find reading the articles in the statistical journals to be easy going. In my experience, the going is particularly rough when trying to learn something completely new and I don’t expect it could be any other way. There is no getting around the hard work. However, at least in the field of computational statistics things seem to be getting a little easier. These days, it is very likely that you will find some code included in the supplementary material for a journal article; at least in the Journal of Computational and Graphical Statistics (JCGS) anyway. JCGS, which was started in 1992 with the mission of presenting “the very latest techniques on improving and extending the use of computational and graphical methods in statistics and data analysis”, still seems to be the place to publish. (Stanford's Rob Tibshirani published an article in Issue 1, Volume 1 back in 1992, Robert Tibshirani & Michael LeBlanc, and also in the most recent issue: Noah Simon, Jerome Friedman, Trevor Hastie & Robert Tibshirani.) Driven by the imperative to produce reproducible research most authors in this journal include some computer code to facilitate independent verification of their results. Of the 80 non-editorial articles published in the last 6 issues of JCGS all but 9 of these included computer code as part of the supplementary materials. The following table lists the counts of the type of software included. (Note that a few articles included code in multiple languages, R and C++ for example.) June13 March13 Dec12 Sept12 June12 March12 total_by_code R Matlab c cpp other none total_by_month 9 6 0 0 0 0 9 0 0 0 1 6 5 1 1 0 0 2 5 3 2 1 3 0 7 4 1 1 0 1 7 4 0 2 2 0 42 18 4 4 6 9 15 16 9 14 14 15 83 R code accounted for 57% of the 74 instances of software included in the supplementary materials. I think an important side effect of the inclusion of code is that studying the article is much easier for everyone. Seeing the R code is like walking into a room of full of people and spotting a familiar face: you know where to start. And, at least it seems feasible to “reverse engineer” the article. Look at the input data, run the code, see what it produces and map it to the math. The following code comes from the supplementary material included in the survey article: “Computational Statistical Methods for Social Networks Models” by Hunter, Krivitsky and Schweinberger in the December 2012 issue of JCGS. # Some of the code from Appendix of the article: # “Computational Statistical Methods for Social Networks Models” # by Hunter, Krivitsky and Schweinberger in the December 2012 issue of JCGS. #Two-dimensional Euclidean latent space model with three clusters and random # receiver effects library(latentnet) data(sampson) monks.d2G3r <- ergmm(samplike ~ euclidean(d=2,G=3)+rreceiver) Z <- plot(monks.d2G3r, rand.eff="receiver", pie=TRUE, vertex.cex=2) text(Z, label=1:nrow(Z)) #Three-dimensional Euclidean latent space model with three clusters and # random receiver effects library(latentnet) data(sampson) monks.d3G3r <- ergmm(samplike ~ euclidean(d=3,G=3)+rreceiver) plot(monks.d3G3r, rand.eff="receiver",use.rgl=TRUE, labels=TRUE) Created by Pretty R at inside-R.org The first four lines produce the graph below. The sampson data set contains social network data that Samuel F. Sampson collected in the late ‘60s when he was a resident experimenter at a monastery in New England. The call to ergmm() fits a “latent space model” by embedding the data in a 2 dimensional Euclidean space, clustering it into 3 groups and including a random “receiver” effect”. The last 4 lines of code produce a way cool, interactive three dimensional plot that you can rotate. Posted by Joseph Rickert at 08:27 in academia, graphics, packages, R | Permalink | Comments (0) September 19, 2013 Real Pirate Attacks, Charted with R To mark Talk Like a Pirate Day, Bob Rudis uses R to animate a map of the cumulative realworld pirate attacks since 1978: Looks like the Carribean and the West Indes, traditional pirate haunts, are still active. But the real hot-spot in modern times is Africa. Find the R code behind the animation at the blog post linked below. rud.is: Animated IRL Pirate Attacks In R Posted by David Smith at 13:25 in graphics, R | Permalink | Comments (0) September 09, 2013 An animated peek into the workings of Bayesian Statistics One of the practical challenges of Bayesian statistics is being able to deal with all of the complex probability distributions involved. You begin with the likelihood function of interest, but once you combine it with the prior distributions of all the parameters, you end up with a complex posterior distribution that you need to characterize. Since you usually can't calculate that distribution analytically, the best you can do is to simulate from that distribution (generally, using Markov-Chain Monte-Carlo techniques). Packages like RStan handle the simulation for you, but it's fairly easy to use the Metropolis-Hastings algorithm to code it yourself, at least for simple problems. PhD student Maxwell Joseph did just that, using the R language. Beginning with a data set of 50 points, he set out to estimate the joint posterior distribution of the mean and variance, given simple priors (Normal for the mean; Uniform for the variance). He ran three chains of the M-H algorithm simultanously, and created the animation below. You can see each of the chains (purple, red and blue) as they progress through the joint distribution of the mean (horizontal axis) and variance (vertical axis), and see how the posterior distribution evolves over time in the 3-D image to the right. I love the amoeba-like effect as the posterior converges to something close to a 2-D Gaussian distribution, and as you'd expect the mode of that posterior gives excellent estimates for the true mean and variance. Maxwell shares all of the R code for setting up the likelihood and priors, running the MetropolisHastings chains, and animating the results at his blog, Ecology in silico. Note the use of R's system command to call ImageMagick convert to stitch individual PNG frames into the animated GIF you see above. (Another alternative is to use Yihui Xie's animations package, but the direct method works just as well.) Ecology in silico: Animating the Metropolis Algorithm (via Allison Barner) Posted by David Smith at 16:47 in graphics, R, statistics | Permalink | Comments (0) | TrackBack (0) August 20, 2013 The financial meltdown, to a trance beat With the FMS Symphony by csv soundsystem you can listen to the Global Financial Crisis as you watch interest rates plunge while the Treasury floods the market with emergency funds. The source data for the chart and music comes from daily emails (like this one) sent by the US Treasury summarizing the cash spending and borrowing of the Federal government. (Incredibly, this data is not available in any structured format ready for analysis.) The CSV Soundsystem team then used R packages for data processing (plyr, reshape2), sonification (tuneR and ddr, which makes musical sounds including "cheesy synth") and visualization (aplpack) to produce the chart, animation and music. We used principal component analysis to rotate the 52 line-items and plotted the 15 highestloaded components as Chernoff faces. We plotted interest rate and federal account balance with line-width as standard deviation of all the line-items. We represented similar data in audio. Chords were selected based on the derivative of account balance, and a melody was composed based on the federal interest rate. We also included a contrapuntal riff driven by the distance between accumulated federal debt and the legal debt ceiling. If you enjoyed the house-trance stylings of federal treasury statistics, you can even download the music track for your playlist. If you want to embark on a similar project, you can find more behind-the-scenes details and the code at GitHub. csv soundsystem: FMS Treasury Posted by David Smith at 17:27 in finance, graphics, R | Permalink | Comments (0) | TrackBack (0) August 13, 2013 Visualizing Katrina's strongest winds, with R Here's a new picture of the devastation wrought by Katrina in 2005. This image shows the maximum wind speeds of the hurricane, not at any particular point of time, but over the duration of the entire storm: The data come from NOAA's H*Wind project, which makes windspeed data from sensors on the ground and floating in the ocean available to the public. Catastrophe scientist Dr. Rob Hodges analyzed the data using the R language to determine the maximum windspeed in small hexagons tiled across the Gulf region and then visualize the results. While the strongest winds blew offshore, they nonetheless contributed to the storm surge that ultimately swamped New Orleans. You can find Dr. Hodge's R code behind this analysis at GitHub. The visualization itself was created using the spplot function with panel.polygonsplot panel function. Open data also led to this visualization of Hurricane Sandy's path as it the disaster unfolded. Catastrophe Science: Maximum Observed Windspeeds Using H*Wind Analyses Posted by David Smith at 14:45 in applications, graphics, R | Permalink | Comments (1) | TrackBack (0) August 06, 2013 How to choose a new business location with R This guest post is by Rodolfo Vanzini. Rodolfo is senior partner at eXponential.it — an asset management consultancy based in Italy — and advises clients on investment management issues. He taught at the University of Siena and is an analytics professional. With an economist education and a financial markets expertise four years ago I thought I couldn't be of any help, if not emotional, in supporting my wife in her decision to partner with an internationally renowned franchise and open a school of English in the city where we live, Bologna, Italy. Two years later she realized the premises she had rented weren't large enough to accomodate in the nearest future the foreseeable demand of courses and decided to plan to move. The first question that popped up was: how far do our customers have to travel to reach the school of English? How much does proximity play a role in the decision to move? Having adverstised city-wide our first assumption was that customers were spread over the city more or less uniformly, therefore ideally location within the city wasn't a significant issue. Nevertheless, I have learned what behavioral economics has taught us over the last thirty years: individuals make biased decisions basing their assumptions and conclusions on a limited and approximate set of rules often leading to sub-optimal outcomes. Thus, not to make a mistake we definetely had to pin-point our customers on a map to assess how integral was retaining proximity within the neighboring area to local customers. My analytical background came into play, at last. With a bit of R I decided to perform some basic analytics to check out how near or far customers had to travel. With no previous experience I had to look up something that could help me with the obvious mapping issues I was to come across. Pretty soon I found what I was looking for in the ggmap package. I first created a character variable with our school address and then imported addresses from our local database into a data frame (note that I had previously geocoded the addresses using latlon <- geocode(as.character(addr$Address), output='latlon')). bologna <- "Via Dagnini, 42, Bologna, Emilia Romagna, Italy" cust <- read.table(file = "addr_cust.csv", header = TRUE, dec = ",", sep = ";") cust.2012 <- subset(cust, Year == 2012) head(cust) ## ID ## 1 1 Name Last Name ## 2 2 Name Last Name ## 3 3 Name Last Name ## 4 4 Name Last Name ## 5 5 Name Last Name ## 6 6 Name Last Name ## Full.address ## 1 Via Brizzi, 10, 40068 San Lazzaro di Savena, Emilia Romagna Italy ## ## ## ## ## ## ## ## ## ## ## ## 2 3 4 5 6 1 2 3 4 5 6 Via Ruggi, 14, Via Stradelli Guelfi, 78/3, Via Degli Scalini 9, Via Cavazza, 8, Via Delle Fragole, 26, latitude longitude Year 44.46 11.38 2012 44.48 11.37 2012 44.48 11.42 2012 44.48 11.35 2012 44.48 11.37 2012 44.47 11.37 2012 40137 40138 40136 40137 40137 Bologna, Bologna, Bologna, Bologna, Bologna, Emilia Emilia Emilia Emilia Emilia Romagna Romagna Romagna Romagna Romagna Italy Italy Italy Italy Italy The nicest part of ggmap, I found out, is that you can choose among different kinds of map sources providing different types of geo/graphic details. require(ggmap) ## Loading required package: ggmap ## Loading required package: ggplot2 qmap(bologna, zoom = 12, source = "google", maptype = "roadmap") ## Map from URL : ## http://maps.googleapis.com/maps/api/staticmap?center=Via+Dagnini,+42,+Bologna ,+Emilia+Romagna,+Italy&zoom=12&size=%20640x640&scale=%202&maptype=roadmap&se nsor=false ## Google Maps API Terms of Service : http://developers.google.com/maps/terms ## Information from URL : ## http://maps.googleapis.com/maps/api/geocode/json?address=Via+Dagnini,+42,+Bol ogna,+Emilia+Romagna,+Italy&sensor=false ## Google Maps API Terms of Service : http://developers.google.com/maps/terms After a few attempts fiddling with different map sources (google, osm and stamen) I resolved the latter would offer the optimal graphics needed. Then I plotted latitude and longitude coordinates to show our customer addresses on the map. bologna.map <- get_map(bologna, zoom = 12, source = "stamen", maptype = "toner") ## Map from URL : ## http://maps.googleapis.com/maps/api/staticmap?center=Via+Dagnini,+42,+Bologna ,+Emilia+Romagna,+Italy&zoom=12&size=%20640x640&maptype=terrain&sensor=false ## Google Maps API Terms of Service : http://developers.google.com/maps/terms ## Information from URL : ## http://maps.googleapis.com/maps/api/geocode/json?address=Via+Dagnini,+42,+Bol ogna,+Emilia+Romagna,+Italy&sensor=false ## Google Maps API Terms of Service : http://developers.google.com/maps/terms Bologna.Map.2012 <- ggmap(bologna.map, base_layer = ggplot(aes(x = longitude, y = latitude), data = cust.2012), extent = "device") Bologna.Map.2012 + geom_point(size = I(3), alpha = 1/3) Though the first map wasn't as informative as expected, with surprise we started to notice how clustered in the south-eastern part of the city customers appeared. What I was looking for was a way to show the density of customers in the surrounding area and found an alternative graphics using faceting to check if there had been a developing pattern over the years since start-up and found that, partly unexpectedly, customers had started clustering in that area since the beginning. After checking with the available data provided locally by city hall and the chamber of commerce for GPD per capita and population density the answer to our initial question was coming into shape. There had been a natural selection of customers: customer density decreases as we move out from the center of the cloud. Needless to say that we wanted to retain our customer relationships and it was becoming apparent that the new location had to be found as near as possible to the middle of that cloud. Bologna.Map <- ggmap(bologna.map, base_layer = ggplot(aes(x = longitude, y = latitude), data = cust), extent = "device") Bologna.Map + stat_density2d(aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..), size = 2, bins = 4, data = cust, geom = "polygon", show_guide = FALSE) + facet_wrap(~Year) The following map, with a focus on 2012 to recap what we'd found out about our (then) current location and our customers, has demonstrated how central that south-eastern area has been and still is to our local business. Bologna.Map.2012 + stat_density2d(aes(x = longitude, y = latitude, fill = ..level.., alpha = ..level..), size = 2, bins = 4, data = cust.2012, geom = "polygon", show_guide = FALSE) This little exercise has demonstrated how relevant proximity is, probably not only to our local business considering the well researched and studied central place theory in urban economics, for instance. These findings confirm to a certain extent some general theories of small business management and have enabled us not to make a mistake driven by some behavioral bias. If I think I've always considered analytics and R as tools of applied economics in search of unexpressed arbitrage opportunities by Wall Street firms, then I might conclude it's more useful to find established patterns of relationships on Main Street. Posted by David Smith at 11:19 in applications, graphics, R | Permalink | Comments (2) | TrackBack (0) August 05, 2013 Explore smartphone market share with Nanocubes Back in May, Twitter's Miguel Rios created some beautiful data visualizations to show that with enough (i.e. billions) of geotagged tweets, you can reveal the geography of where people live, work and commute. Now, a new interactive visualization of 210 million geotagged tweets by AT&T Research Labs reveals the market share of iPhone, Android and Windows smartphones down to the smallest geographic levels. Simon Urbanek (known to R users not least as a member of the R Core Group) explained in his Web-Based Interactive Graphics talk at JSM 2013 today that the visualization uses 32Tb of Twitter data, yet runs smoothly and interactively on a single machine with 16Gb of RAM. When you first start the application, it shows a view of the population density of the USA, as you might expect from millions of geotagged tweets. But this application also lets you explore the mobile device used to send each tweet. Across the USA (and excluding devices with a device type of "none" or "iPad", which was hardly used), the proportion of tweets sent using each device was: 60.7% of 150.6M tweets were sent using an iPhone, 36.4% with an Android phone, and 2.8% with a Windows phone. More interestingly, you can use the app to zoom down into specific geographic areas and see the device market share just for that region. In the NYC area, the iPhone dominates, with 74.1% of tweets from known smartphones: However, in the Atlanta region, Android phones fare much better than in the USA as a whole, being the source of 45.9% of smartphone tweets: Windows phones appear be best represented in the Seattle-Tacoma region, home to Microsoft HQ: Despite the massive number of data points and the beauty and complexity of the real-time data visualization, it runs impressively quickly. The underlying data structure is based on Nanocubes, a fast datastructure for in-memory data cubes. The basic idea is that nanocubes aggregate data hierarchically, so that as you zoom in and out of the interactive application, one pixel on the screen is mapped to just one data point, aggregated from the many that sit "behind" that pixel. Learn more about nanocubes, and try out the application yourself (modern browser required) at the link below. Nanocubes: Fast Visualization of Large Spatiotemporal Datasets Posted by David Smith at 14:48 in data science, graphics, R | Permalink | Comments (0) | TrackBack (0) July 22, 2013 Bike sharing in 100 cities Many cities around the world have bike sharing programs: pick up a bike at a docking station, ride it across town and drop it off at another session, and just pay for the time you use. (Even Albacete, the Spanish college town hosting last month's UseR conference, had one.) Most of these systems provide open data feeds of bike availability, and the data is available for all cities via the CityBikes API. Ramnath Vaidyanathan used this API to create an on-line application showing the real-time status of bike availability in over 100 cities. Here's London, for example: Red dots indicate bikeshare stations with zero available bikes; green dots show stations with the most availability. (I took the snapshot above around midnight London time: it looks like all the city-center bikes have been taken to the outlying stations by commuters, and the overnight redistribution hasn't yet taken place.) Ramnath used an R language script to download the data from the JSON API (using the httr package), and created the maps using the rCharts packages (which provides an interface to Leaflet maps). The interactive application, with city selection, pan and zoom, and pop-up detail for each station, was created using RStudio's Shiny package for R. It's easy to imagine other applications for this data, for example using statistical forecasting to optimize the overnight redistribution, or to determine the most popular bike routes in a city. Click the map above to play around with the application yourself, or follow the link below for more details from Ramnath on how the application was created. Ramnath Vaidyanathan: Visualizing Bike Sharing Networks Posted by David Smith at 16:36 in applications, graphics, R | Permalink | Comments (3) | TrackBack (0) July 04, 2013 R User Groups Worldwide by Joseph Rickert UserR! 2013 is less than a week away, and one task that we have been doing in preparation here at Revolution Analytics is to scrub the list of worldwide R user groups that we maintain in our Local R User Group Directory. These are all of the active (appear to have had some activity within the past year) R user groups that we know about. Please help us verify the accuracy of the list. If you don't see a spot on the map where there should be one, or can't find a group you know about in this file Download R_User_Groups_World_7_2013, or you find something else wrong, please let us know. All of us at Revolution Analytics are proud to be a sponsor of UseR! 2013. If your are going to Albacete, please come by our booth and say hello. Have a safe trip and see you in Spain! Note: the map above was drawn with code adapted from a solution that Sandy Muspratt provided to a question on stackoverflow last year. Sandy's code will allow you to recenter the map on your favorite R user group. Posted by Joseph Rickert at 09:05 in graphics, user groups | Permalink | Comments (0) | TrackBack (0) June 25, 2013 A comprehensive guide to time series plotting in R As R has evolved over the past 20 years its capabilities have improved in every area. The visual display of time series is no exception: as the folks from Timely Portfolio note that: Through both quiet iteration and significant revolutions, the volunteers of R have made analyzing and charting time series pleasant. R began with the basics, a simple line plot: and then evolved to true time series charts (with a calendar-aware time axis), high-frequency charts, interactive charts, and even charts in the style of The Economist. In fact, more than a dozen options for plotting time series have been introduced over the years as shown in the timeline below: Check out all the examples at the link below — you're sure to find a style of time series chart that meets your needs. Timely Portfolio (github): R Financial Time Series Plotting Posted by David Smith at 17:13 in finance, graphics, R | Permalink | Comments (0) | TrackBack (0) Next »
© Copyright 2026 Paperzz