Machine Learning at Scale
using h2o
Giri Tatavarty
Data Science Manager – R&D
dunnhumby inc
https://www.linkedin.com/in/giridhar-tatavarty-1a16503
What is Machine Learning ?
● It is field of computer science that has concepts from statistics,
pattern recognition, artificial intelligence and computational
learning theory. It is about algorithms which can teach themselves
to recognize patterns in data without programming explicitly.
● Supervised learning: The algorithm is presented with example
inputs and their desired outputs, and the goal is to learn a general
rule that maps inputs to outputs. Example: Image tagging,voice
recognition, fraud detection, time series forecasting, predictive analytics,
spam detection, face recognition, finger print recognition, handwriting
recognition, Netflix or Amazon recommendations
2
© dunnhumby 2015 | Confidential
What is Machine Learning?
● Unsupervised learning: No labels/examples are given to the
learning algorithm, leaving algorithm n its own to find structure in
the data. Example: Clustering, Segmentations
3
© dunnhumby 2015 | Confidential
What is Machine Learning?
● Reinforcement learning: A computer program interacts with a
dynamic environment in which it must perform a certain goal (such
as driving a vehicle, or playing chess game), without a teacher
explicitly telling it whether it has come close to its goal.
4
© dunnhumby 2015 | Confidential
Machine Learning in R
5
© dunnhumby 2015 | Confidential
Machine Learning with R
● Different packages for different ML algorithms
– xgboost,rpart,randomforest,bigrf,glmnet,gbm
● Programmatic Interface and parameters inputs are very different
for these packages
● Data needs to be standardized
– Convert categorical text data to multiple binary columns
– Standardize values/ Scale them
● Hyper parameter optimization framework missing with standard
packages
● Working with BigData as these packages are limited by the size
of single working machine
6
© dunnhumby 2015 | Confidential
h2o solves some of these problems
● Standard interface for all different algorithms
● No need to standardize or covert categorical variables
● Runs on big data . Same code can run on your laptop as well as
100 node cluster.
● Generates production ready high performance java code for
scoring
● Can work with other languages such as python / REST api or use
the web interface for data exploration and analysis
Cons
● Limited by the implementation of h2o ( to be fair h2o is
opensource)
7
© dunnhumby 2015 | Confidential
H2o architecture
8
© dunnhumby 2015 | Confidential
H2o architecture -II
Data is compressed /
chunked and distributed
across the nodes
Processing is done in a
tree based topology to
minimize inter node
communication and
summarize the data locally
as much as possible.
9
© dunnhumby 2015 | Confidential
H2o Installation in R
# install latest version from Cron or specific version from Amazon
install.packages("h2o", type="source", repos=(c("http://h2orelease.s3.amazonaws.com/h2o/rel-tibshirani/8/R")))
or
install.packages("h2o")
# load the library and start up a local h2o engine
library(h2o)
localH2O = h2o.init(nthreads=-1)
# Run the demo
demo(h2o.kmeans)
10
© dunnhumby 2015 | Confidential
11
© dunnhumby 2015 | Confidential
Baby Steps - Iris dataset
Iris data set
12
–
150 examples
–
4 attributes
–
4 classes
© dunnhumby 2015 | Confidential
Task : Train the model to predict species (class) of the flower based on attributes
13
© dunnhumby 2015 | Confidential
Model Statistics – Open browser http://localhost:54321 ( Free Plots)
14
© dunnhumby 2015 | Confidential
Step 2: Split data into training and test datasets; Report performance on unseen data
15
© dunnhumby 2015 | Confidential
16
© dunnhumby 2015 | Confidential
Big Data Test Case – http://www.dunnhumby.com/sourcefiles.aspx
~ 50 GB
17
© dunnhumby 2015 | Confidential
~ 300M
rows
22
columns
DataSet 1 - Transactions
18
© dunnhumby 2015 | Confidential
PREDICT IF CUSTOMER IS GOING TO VISIT THE
STORE NEXT WEEK,
BASED ON PREVIOUS VISITS
If yes then predict what he is likely to buy and activate the promotion
channels necessary.
How do we go about predictions
● Create a model which takes your spend on previous 12 weeks ( or
n weeks ) to predict the current week visit
– Create a Training Data Set
– Create a Test Data Set
– Train the Model on Training data set
– Test the predictions on Test Dataset
20
© dunnhumby 2015 | Confidential
Large dataset ingestion on laptop
21
© dunnhumby 2015 | Confidential
A bit of data munging
22
© dunnhumby 2015 | Confidential
Plots from R using h2o.hist
h2o.hist(subset.hex$SPEND[subset.hex$SPEND<30 ])
23
© dunnhumby 2015 | Confidential
Summarize Data and create features - h2o.group_by
24
© dunnhumby 2015 | Confidential
Final Dataset before ML Model
25
© dunnhumby 2015 | Confidential
Creating a Random Forest Model
26
© dunnhumby 2015 | Confidential
Using Flow to explore the Model
27
© dunnhumby 2015 | Confidential
Change and explore thresholds
28
© dunnhumby 2015 | Confidential
Logistic Regression – h2o.glm
29
© dunnhumby 2015 | Confidential
Gradient Boost Machines
30
© dunnhumby 2015 | Confidential
Deep Learning and Neural Networks
31
© dunnhumby 2015 | Confidential
Machine Learning Algorithms supported
● K-Means
● GLM (generalized linear models)
● DRF ( Distributed Random forest)
● Naïve Bayes
● PCA (Principal Component Analysis)
● GBM (Gradient Boosting)
● Deep Learning
32
© dunnhumby 2015 | Confidential
Meta Learning
● Grid Search & non-negative least squares (NNLS)
● Ensemble Models
33
© dunnhumby 2015 | Confidential
Resources
● http://h2o.ai
● http://bit.ly/1Qh79Xr h2o booklet on R
● http://www.dunnhumby.com/sites/default/files/filepicker/1/dunnhu
mby_-_Let_s_Get_Sort-of-Real_User_Guide.pdf
●
34
© dunnhumby 2015 | Confidential
© Copyright 2026 Paperzz