R + Hadoop Joseph Rickert Oct. 22 2013

Big Data, R, Statistics and
Hadoop
Data Science & Business Analytics
October 22, 2013
Joseph B. Rickert
Data Scientist, Revolution Analytics
The 3 Realms
From the Statisticians Point of View
The Hype
2008
2013
Big Data is one of THE
biggest buzzwords around at
the moment, and I believe big
data will change the world.
Bernard Marr: 6/6/13
http://bit.ly/16X59iL
http://www.edge.org/3rd_culture/anderson08/anderson08_index.html
3
The collision of two cultures
4
This Talk
Where
would
some
theory
help?
Putting the hype aside:
What tools exist
in R
to meet the challenges
of large data sets?
What are the
practical aspects of
doing statistics
on large data
sets?
5
The Sweet Spot for “doing” Statistics
Number of rows
as we have come to love it:
• Any algorithm you can imagine
• “in the flow” work environment
• A sense of always moving forward
• Quick visualizations
• You can get far without much real programming
10
6
Data
In
Memory
6
The 3 Realms
Number of rows
The realm of
“chunking”
The realm of
massive data
>1012
Data
in
1011
Data in
a File
106
Data
In
Memory
Feels like statistics
Multipl
e
Files
Feels like machine learning
7
The realm of “chunking”
Number of rows
What’s new here?
• External memory algorithms
• Distributed computing
• Change your way of working
1011
Data in
a File
8
The realm of “chunking”
Number of rows
1011
Data in
a File
External Memory Algorithms
Operate on data chunk by chunk
Declare and initialize the variables needed
for( i in 1 to number_of_chunks) {
Perform the calculations for that chunk
update the variables being computed}
When all chunks have been processed do the final calculations
You only see a small part of the data at one
time – some things e.g. factors are trouble
9
The realm of “chunking”
Number of rows
# Each record of the data file contains informtion for individual commercial airline flights
# One of the variables collected is the DayOfWeek of the flight
# This function tabulates DayOfWeek
chunkTable <- function(fileName, varsToKeep = NULL, blocksPerRead = 1 )
{
ProcessChunkAndUpdate <- function( dataList){
# Process Data
chunkTable <- table(as.data.frame(dataList))
# Update Results
tableSum <- chunkTable + .rxGet("tableSum")
.rxSet("tableSum", tableSum)
cat("Chunk number: ",.rxChunkNum," tableSum = ",
tableSum, "\n")
return( NULL )
}
updatedObjects <- rxDataStep( inData = fileName,
varsToKeep = varsToKeep, blocksPerRead =
blocksPerRead,
transformObjects = list(tableSum = 0),
transformFunc = ProcessChunkAndUpdate,
returnTransformObjects = TRUE,reportProgress = 0)
return(updatedObjects$tableSum)
}
chunkTable(fileName=fileName, varsToKeep="DayOfWeek")
1011
Data in
a File
> chunkTable(fileName=fileName, varsToKeep="DayOfWeek")
Chunk number: 1 tableSum = 33137 27267 27942 28141 28184 25646 29683
Chunk number: 2 tableSum = 65544 52874 53857 54247 54395 55596 63487
Chunk number: 3 tableSum = 97975 77725 78875 81304 82987 86159 94975
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
97975 77725 78875 81304 82987 86159 94975
10
The realm of “chunking”
Number of rows
Distributed Computing
 Must deal with cluster management
 Data storage and allocation strategies important
101
1
Data in
a File
Data
Compute
node
Data
Compute
node
Data
Compute
node
Master
node
11
The realm of “chunking”
Number of rows
1011
Change your way of working
 Might have to change your usual way of working (e.g. not feasible to
“look at” residuals to validate a regression model)
 Don’t compute things you are not going to use (e.g. residuals)
 Plotting what you want to see may be difficult
 Limited number of functions available
 Some real programming likely
Data in
a File
12
R Tools for the realm of “chunking”
 External Memory Algorithms
– bigmemory: massive matrices in memory-mapped files
– ff and ffbase offer file-based access to data sets.
– SciDB-R: access massive SciDB matrices from R
– RevoScaleR
• parallel external memory algorithms e.g. rxDTree
• Distributed computing infrastructure
 Visualization:
– bigvis: aggregation and smoothing applied to visualization
– tabplot
13
rxDTree: trees for big data
 Based on an algorithm published
by Ben-Haim and Yom-Tov in
2010
 Avoids sorting the raw data
 Builds trees using histogram
summaries of the data
 Inherently parallel: each compute
node sees 1/N of data (all
variables)
 Compute nodes build histograms
for all variables
 Master node integrates
histograms and builds tree
#Build a tree using rxDTree with a 2,021,019 row
version of #the segmentaionData data set
#from the caret package
allvars <- names(segmentationData)
xvars <- allvars[-c(1, 2, 3)]
form <- as.formula(paste("Class", "~",
paste(xvars,
collapse = "+")))
#
cp <- 0.01
# Set the complexity parameter
xval <- 0
# Don't do any cross validation
maxdepth <- 5
# Set the maximum tree depth
##----------------------------------------------# Build a model with rxDtree
# Looks like rpart() but with a parameter macNumBins to
# control accuracy
dtree.model <- rxDTree(form,
data = "segmentationDataBig",
maxNumBinns = NULL,
maxDepth = maxdepth,
cp = cp, xVal = xval,
blocksPerRead = 250)
14
The realm of massive data
Number of rows
What’s new here?
• The cluster is given!!
• Restricted to the Map/Reduce paradigm
• Basic statistical tasks are difficult
• This is batch programming! The “flow” is gone.
• The Data Mining Mindset
>1012
Data
Multipl
e
in
Files
15
The realm of massive data
Number of rows
The cluster is given!!
• Parallel computing is necessary
• Distribute data parallel computations favors ensemble
methods
>1012
Data
Multipl
e
in
Files
16
The realm of massive data
Number of rows
The Map/Reduce Paradigm
 Very limited number of algorithms readily available
 Algorithms that need coordination among compute nodes
difficult or slow
• Serious programming is required
• Multiple languages likely
>1012
Data
Multipl
e
in
Files
17
The realm of massive data
Number of rows
>1012
Basic Statistical Tasks are challenging
Data
Multipl
e
in
 Getting random samples of exact lengths difficult
 Approximate sampling methods common
 Independent parallel random number streams required
Files
18
The realm of massive data
Number of rows
The Data Mining Mind Set:
>1012
Data
Multipl
e
in
Files
Accumulated experience over the last
decade has shown that in real-world
settings, the size of the dataset is the
most important ... Studies have
repeatedly shown that simple models
trained over enormous quantities of data
outperform more sophisticated models
trained on less data ....
Lin and Ryaboy
19
RHadoop: Map-Reduce with R
20
Essential References

Statistics vs. Data Mining
– Statistical Modeling: The Two Cultures, Leo Breiman, 2001 http://bit.ly/15gO2oB

Mathematical Formulations of Big Data Issues
– On Measuring and Correcting he Effects of Data Mining and Model Selection: Ye ,1998 http://bit.ly/12YpZN7
– High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality: Donoho, 2000 http://stanford.io/fbQoQU

Machine Learning in the Hadoop Environment
– Large Scale Machine Learning at Twitter: Lin and Kolcz, 2012 http://bit.ly/JMQEhP
– Scaling Big Data Mining Infrastructure: The Twitter Experience: Lin and Ryaboy, 2012 http://bit.ly/10kVOca
– How-to: Resample from a Large Data Set in parallel (with R on Hadoop): Laserson 2013 http://bit.ly/YRQIDD
–
–

Statistical Techniques for Big Data
A Scalable Bootstrap for Massive Data, Kleiner et. al., 2011 http://bit.ly/PfaO75
Big Data Decision Trees
– Big Data Decision Trees with R: Cal away, Edlefsen and Gong http://bit.ly/10BtmrW
– A streaming parallel decision tree algorithm: Ben-Haim and Yom-Tov, 2010
• Short paper http://bit.ly/11BHdK4
• Long paper http://bit.ly/11PJ0Kr
21
Bridging the Gaps
Statistics with Revolution R Enterprise
Model Buliding with RevoScaleR
Agenda:
 What is RevoScaleR?
 RevoScaleR and Hadoop
 Run some code
23
RevoScaleR
R Data
Step
Predictive
Models
Descriptive
Statistics
Data
Visualization
Statistical
Tests
Machine
Learning
Sampling
Simulation
24
RevoScaleR
 An R package ships exclusively with Revolution R Enterprise
 Implements Parallel External Memory Algorithms (PEMAs)
 Provides functions to:
– Import, Clean, Explore and Transform Data
– Statistical Analysis and Predictive Analytics
– Enable distributed computing
 Scales from small local data to huge distributed data
 The same code works on small and big data, and on workstation, server, cluster, Hadoop
25
RevoScaleR Functions
Data Prep, Distillation & Descriptive Analytics
R Data Step








Data import – Delimited,
Fixed, SAS, SPSS, OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort / Merge /Split
Aggregate by category
(means, sums)
Use any of the
functionality of the R
language to transform and
clean data row by row!
Descriptive Statistics











Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance / Covariance
Correlation
Sum of Squares (cross product
matrix for set variables)
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Statistical Tests




Chi Square Test
t-Test
F-Test
Plus 1,000’s of other tests
available in R!
Sampling



Subsample (observations &
variables)
Random Sampling
High quality, fast, parallel
random number generators
26
Parallel External Memory Algorithms
(PEMA’s)
 The ScaleR analytics algorithms are all built on a platform (DistributeR) that efficiently parallelizes a broad
class of statistical, data mining and machine learning algorithms
 PEMA’sprocess data a chunk at a time in parallel across cores and nodes:
1.
Initialize
2.
Process Chunk
3.
Aggregate
4.
Finalize
Revolution R Enterprise
27
RevoScaleR PEMAs
Statistical Modeling
Predictive Models








Covariance, Correlation, Sum of
Squares (cross product matrix for
set variables) matrices
Multiple Linear Regression
Generalized Linear Models
(GLM) - All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
Logistic Regression
Classification & Regression
Trees
Decision Forests
Predictions/scoring for models
Residuals for all models
Machine Learning
Data Visualization




Histogram
Line Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
Variable Selection


Stepwise Regression
PCA
Cluster Analysis

K-Means
Classification


Decision Trees
Decision Forests
Simulation


Parallel random number
generators for Monte
Carlo
Use the rich
functionality of R for
simulations
28
RevoScaleR Scalability and
Performance
 Handles an arbitrarily large number of rows in a fixed amount of memory
 Scales linearly with the number of rows
 Scales linearly with the number of nodes
 Scales well with the number of cores per node
 Scales well with the number of parameters
 Independent of the “compute context”
– number of cores
– computers,
– distributed computing platform)
 Extremely high performance
29
GLM comparison using in-memory data: glm()
and ScaleR’s rxGlm()
Revolution R Enterprise
30
Specific speed-related factors






Efficient computational algorithms
Efficient memory management – minimize data copying and data conversion
Heavy use of C++ templates; optimal code
Efficient data file format; fast access by row and column
Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)
Handle categorical variables efficiently
Revolution R Enterprise
31
Write Once. Deploy Anywhere.
Hadoop
Hortonworks
Cloudera
EDW
IBM
Teradata
Clustered Systems
Platform LSF
Microsoft HPC
Workstations & Servers
Desktop
Server
Linux
In the Cloud
Microsoft Azure Burst
Amazon AWS
DeployR
ConnectR
RevoScaleR
DistributedR
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
32
RRE in Hadoop
33
A Simple Goal: Hadoop As An R Engine.
 Run Revolution R Enterprise Code In Hadoop Without Change
Hadoop
 Provide ScaleR Pre-Parallelized Algorithms
 No Need To “Think In MapReduce”
 Eliminate Movement to Slash Latencies
 Expanded Deployment Options
34
RRE in Hadoop
HDFS
Name Node
MapReduce
Data Node
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
35
RRE in Hadoop
HDFS
Name Node
MapReduce
Data Node
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
36
RevoScaleR on Hadoop
 Each pass through the data is one MapReduce job
 Prediction (Scoring), Transformation, Simulation:
– Map tasks store results in HDFS or return to client
 Statistics, Model Building, Visualization:
– Map tasks produce “intermediate result objects” that are aggregated by a Reduce task
– Master process decides if another pass through the data is required
 Data can be cached or stored in XDF binary format for increased speed,
especially on iterative algorithms
Revolution R Enterprise
37
Let’s run some code.
Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData )
39
Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData)
40
Demo rxLinMod in Hadoop Launching
Revolution R Enterprise
41
Demo rxLinMod in Hadoop - In
Progress
Revolution R Enterprise
42
Demo rxLinMod in Hadoop Completed
Revolution R Enterprise
43
Theory that could help deflate the hype
 Provide a definition of big data that makes statistical sense
 Characterize the type of data mining classification problem in which more data does beat sophisticated
models
 Describe the boundary where rpart type algorithms should yield to rxDTree type approaches
44