Big Data, R, Statistics and
Hadoop
Data Science & Business Analytics
October 22, 2013
Joseph B. Rickert
Data Scientist, Revolution Analytics
The 3 Realms
From the Statisticians Point of View
The Hype
2008
2013
Big Data is one of THE
biggest buzzwords around at
the moment, and I believe big
data will change the world.
Bernard Marr: 6/6/13
http://bit.ly/16X59iL
http://www.edge.org/3rd_culture/anderson08/anderson08_index.html
3
The collision of two cultures
4
This Talk
Where
would
some
theory
help?
Putting the hype aside:
What tools exist
in R
to meet the challenges
of large data sets?
What are the
practical aspects of
doing statistics
on large data
sets?
5
The Sweet Spot for “doing” Statistics
Number of rows
as we have come to love it:
• Any algorithm you can imagine
• “in the flow” work environment
• A sense of always moving forward
• Quick visualizations
• You can get far without much real programming
10
6
Data
In
Memory
6
The 3 Realms
Number of rows
The realm of
“chunking”
The realm of
massive data
>1012
Data
in
1011
Data in
a File
106
Data
In
Memory
Feels like statistics
Multipl
e
Files
Feels like machine learning
7
The realm of “chunking”
Number of rows
What’s new here?
• External memory algorithms
• Distributed computing
• Change your way of working
1011
Data in
a File
8
The realm of “chunking”
Number of rows
1011
Data in
a File
External Memory Algorithms
Operate on data chunk by chunk
Declare and initialize the variables needed
for( i in 1 to number_of_chunks) {
Perform the calculations for that chunk
update the variables being computed}
When all chunks have been processed do the final calculations
You only see a small part of the data at one
time – some things e.g. factors are trouble
9
The realm of “chunking”
Number of rows
# Each record of the data file contains informtion for individual commercial airline flights
# One of the variables collected is the DayOfWeek of the flight
# This function tabulates DayOfWeek
chunkTable <- function(fileName, varsToKeep = NULL, blocksPerRead = 1 )
{
ProcessChunkAndUpdate <- function( dataList){
# Process Data
chunkTable <- table(as.data.frame(dataList))
# Update Results
tableSum <- chunkTable + .rxGet("tableSum")
.rxSet("tableSum", tableSum)
cat("Chunk number: ",.rxChunkNum," tableSum = ",
tableSum, "\n")
return( NULL )
}
updatedObjects <- rxDataStep( inData = fileName,
varsToKeep = varsToKeep, blocksPerRead =
blocksPerRead,
transformObjects = list(tableSum = 0),
transformFunc = ProcessChunkAndUpdate,
returnTransformObjects = TRUE,reportProgress = 0)
return(updatedObjects$tableSum)
}
chunkTable(fileName=fileName, varsToKeep="DayOfWeek")
1011
Data in
a File
> chunkTable(fileName=fileName, varsToKeep="DayOfWeek")
Chunk number: 1 tableSum = 33137 27267 27942 28141 28184 25646 29683
Chunk number: 2 tableSum = 65544 52874 53857 54247 54395 55596 63487
Chunk number: 3 tableSum = 97975 77725 78875 81304 82987 86159 94975
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
97975 77725 78875 81304 82987 86159 94975
10
The realm of “chunking”
Number of rows
Distributed Computing
Must deal with cluster management
Data storage and allocation strategies important
101
1
Data in
a File
Data
Compute
node
Data
Compute
node
Data
Compute
node
Master
node
11
The realm of “chunking”
Number of rows
1011
Change your way of working
Might have to change your usual way of working (e.g. not feasible to
“look at” residuals to validate a regression model)
Don’t compute things you are not going to use (e.g. residuals)
Plotting what you want to see may be difficult
Limited number of functions available
Some real programming likely
Data in
a File
12
R Tools for the realm of “chunking”
External Memory Algorithms
– bigmemory: massive matrices in memory-mapped files
– ff and ffbase offer file-based access to data sets.
– SciDB-R: access massive SciDB matrices from R
– RevoScaleR
• parallel external memory algorithms e.g. rxDTree
• Distributed computing infrastructure
Visualization:
– bigvis: aggregation and smoothing applied to visualization
– tabplot
13
rxDTree: trees for big data
Based on an algorithm published
by Ben-Haim and Yom-Tov in
2010
Avoids sorting the raw data
Builds trees using histogram
summaries of the data
Inherently parallel: each compute
node sees 1/N of data (all
variables)
Compute nodes build histograms
for all variables
Master node integrates
histograms and builds tree
#Build a tree using rxDTree with a 2,021,019 row
version of #the segmentaionData data set
#from the caret package
allvars <- names(segmentationData)
xvars <- allvars[-c(1, 2, 3)]
form <- as.formula(paste("Class", "~",
paste(xvars,
collapse = "+")))
#
cp <- 0.01
# Set the complexity parameter
xval <- 0
# Don't do any cross validation
maxdepth <- 5
# Set the maximum tree depth
##----------------------------------------------# Build a model with rxDtree
# Looks like rpart() but with a parameter macNumBins to
# control accuracy
dtree.model <- rxDTree(form,
data = "segmentationDataBig",
maxNumBinns = NULL,
maxDepth = maxdepth,
cp = cp, xVal = xval,
blocksPerRead = 250)
14
The realm of massive data
Number of rows
What’s new here?
• The cluster is given!!
• Restricted to the Map/Reduce paradigm
• Basic statistical tasks are difficult
• This is batch programming! The “flow” is gone.
• The Data Mining Mindset
>1012
Data
Multipl
e
in
Files
15
The realm of massive data
Number of rows
The cluster is given!!
• Parallel computing is necessary
• Distribute data parallel computations favors ensemble
methods
>1012
Data
Multipl
e
in
Files
16
The realm of massive data
Number of rows
The Map/Reduce Paradigm
Very limited number of algorithms readily available
Algorithms that need coordination among compute nodes
difficult or slow
• Serious programming is required
• Multiple languages likely
>1012
Data
Multipl
e
in
Files
17
The realm of massive data
Number of rows
>1012
Basic Statistical Tasks are challenging
Data
Multipl
e
in
Getting random samples of exact lengths difficult
Approximate sampling methods common
Independent parallel random number streams required
Files
18
The realm of massive data
Number of rows
The Data Mining Mind Set:
>1012
Data
Multipl
e
in
Files
Accumulated experience over the last
decade has shown that in real-world
settings, the size of the dataset is the
most important ... Studies have
repeatedly shown that simple models
trained over enormous quantities of data
outperform more sophisticated models
trained on less data ....
Lin and Ryaboy
19
RHadoop: Map-Reduce with R
20
Essential References
Statistics vs. Data Mining
– Statistical Modeling: The Two Cultures, Leo Breiman, 2001 http://bit.ly/15gO2oB
Mathematical Formulations of Big Data Issues
– On Measuring and Correcting he Effects of Data Mining and Model Selection: Ye ,1998 http://bit.ly/12YpZN7
– High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality: Donoho, 2000 http://stanford.io/fbQoQU
Machine Learning in the Hadoop Environment
– Large Scale Machine Learning at Twitter: Lin and Kolcz, 2012 http://bit.ly/JMQEhP
– Scaling Big Data Mining Infrastructure: The Twitter Experience: Lin and Ryaboy, 2012 http://bit.ly/10kVOca
– How-to: Resample from a Large Data Set in parallel (with R on Hadoop): Laserson 2013 http://bit.ly/YRQIDD
–
–
Statistical Techniques for Big Data
A Scalable Bootstrap for Massive Data, Kleiner et. al., 2011 http://bit.ly/PfaO75
Big Data Decision Trees
– Big Data Decision Trees with R: Cal away, Edlefsen and Gong http://bit.ly/10BtmrW
– A streaming parallel decision tree algorithm: Ben-Haim and Yom-Tov, 2010
• Short paper http://bit.ly/11BHdK4
• Long paper http://bit.ly/11PJ0Kr
21
Bridging the Gaps
Statistics with Revolution R Enterprise
Model Buliding with RevoScaleR
Agenda:
What is RevoScaleR?
RevoScaleR and Hadoop
Run some code
23
RevoScaleR
R Data
Step
Predictive
Models
Descriptive
Statistics
Data
Visualization
Statistical
Tests
Machine
Learning
Sampling
Simulation
24
RevoScaleR
An R package ships exclusively with Revolution R Enterprise
Implements Parallel External Memory Algorithms (PEMAs)
Provides functions to:
– Import, Clean, Explore and Transform Data
– Statistical Analysis and Predictive Analytics
– Enable distributed computing
Scales from small local data to huge distributed data
The same code works on small and big data, and on workstation, server, cluster, Hadoop
25
RevoScaleR Functions
Data Prep, Distillation & Descriptive Analytics
R Data Step
Data import – Delimited,
Fixed, SAS, SPSS, OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort / Merge /Split
Aggregate by category
(means, sums)
Use any of the
functionality of the R
language to transform and
clean data row by row!
Descriptive Statistics
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance / Covariance
Correlation
Sum of Squares (cross product
matrix for set variables)
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Statistical Tests
Chi Square Test
t-Test
F-Test
Plus 1,000’s of other tests
available in R!
Sampling
Subsample (observations &
variables)
Random Sampling
High quality, fast, parallel
random number generators
26
Parallel External Memory Algorithms
(PEMA’s)
The ScaleR analytics algorithms are all built on a platform (DistributeR) that efficiently parallelizes a broad
class of statistical, data mining and machine learning algorithms
PEMA’sprocess data a chunk at a time in parallel across cores and nodes:
1.
Initialize
2.
Process Chunk
3.
Aggregate
4.
Finalize
Revolution R Enterprise
27
RevoScaleR PEMAs
Statistical Modeling
Predictive Models
Covariance, Correlation, Sum of
Squares (cross product matrix for
set variables) matrices
Multiple Linear Regression
Generalized Linear Models
(GLM) - All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
Logistic Regression
Classification & Regression
Trees
Decision Forests
Predictions/scoring for models
Residuals for all models
Machine Learning
Data Visualization
Histogram
Line Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
Variable Selection
Stepwise Regression
PCA
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Simulation
Parallel random number
generators for Monte
Carlo
Use the rich
functionality of R for
simulations
28
RevoScaleR Scalability and
Performance
Handles an arbitrarily large number of rows in a fixed amount of memory
Scales linearly with the number of rows
Scales linearly with the number of nodes
Scales well with the number of cores per node
Scales well with the number of parameters
Independent of the “compute context”
– number of cores
– computers,
– distributed computing platform)
Extremely high performance
29
GLM comparison using in-memory data: glm()
and ScaleR’s rxGlm()
Revolution R Enterprise
30
Specific speed-related factors
Efficient computational algorithms
Efficient memory management – minimize data copying and data conversion
Heavy use of C++ templates; optimal code
Efficient data file format; fast access by row and column
Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)
Handle categorical variables efficiently
Revolution R Enterprise
31
Write Once. Deploy Anywhere.
Hadoop
Hortonworks
Cloudera
EDW
IBM
Teradata
Clustered Systems
Platform LSF
Microsoft HPC
Workstations & Servers
Desktop
Server
Linux
In the Cloud
Microsoft Azure Burst
Amazon AWS
DeployR
ConnectR
RevoScaleR
DistributedR
DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
32
RRE in Hadoop
33
A Simple Goal: Hadoop As An R Engine.
Run Revolution R Enterprise Code In Hadoop Without Change
Hadoop
Provide ScaleR Pre-Parallelized Algorithms
No Need To “Think In MapReduce”
Eliminate Movement to Slash Latencies
Expanded Deployment Options
34
RRE in Hadoop
HDFS
Name Node
MapReduce
Data Node
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
35
RRE in Hadoop
HDFS
Name Node
MapReduce
Data Node
Data Node
Data Node
Data Node
Data Node
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
Job
Tracker
36
RevoScaleR on Hadoop
Each pass through the data is one MapReduce job
Prediction (Scoring), Transformation, Simulation:
– Map tasks store results in HDFS or return to client
Statistics, Model Building, Visualization:
– Map tasks produce “intermediate result objects” that are aggregated by a Reduce task
– Master process decides if another pass through the data is required
Data can be cached or stored in XDF binary format for increased speed,
especially on iterative algorithms
Revolution R Enterprise
37
Let’s run some code.
Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData )
39
Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData)
40
Demo rxLinMod in Hadoop Launching
Revolution R Enterprise
41
Demo rxLinMod in Hadoop - In
Progress
Revolution R Enterprise
42
Demo rxLinMod in Hadoop Completed
Revolution R Enterprise
43
Theory that could help deflate the hype
Provide a definition of big data that makes statistical sense
Characterize the type of data mining classification problem in which more data does beat sophisticated
models
Describe the boundary where rpart type algorithms should yield to rxDTree type approaches
44
© Copyright 2025 Paperzz