[Powerpoint] - Data_Science_with_MRS9

Marcin Szeliga
Data Science
that’s scale
SQLSat Kyiv Team
Yevhen Nedashkivskyi
Alesya Zhuk
Eugene Polonichko
Oksana Borysenko
Oksana Tkach
Mykola Pobyivovk
Sponsor Sessions Starts at 13:10
Don’t miss them, they might be providing
some interesting and valuable information!
Room A
Room B
Room C
13:10 - 13:30
DevArt
Microsoft
Eleks
13:30 - 13:50
DB Best
Intapp
DataArt
Our Awesome Sponsors
Session will begin very soon :)
Please complete the evaluation form
from your pocket after the session.
Your feedback will help us to improve
future conferences and speakers will
appreciate your feedback!
Enjoy the conference!
Marcin Szeliga
Data philosopher
20 years of experience with SQL Server
Data Platform MVP & MCT
Microsoft Certified Solutions Expert
Data Platform
Data Management and Analytics
Cloud Platform and Infrastructure
[email protected]
Agenda
Tools
MRO (Microsoft R Open), MRC (Microsoft R Client), MRS
(Microsoft R Server)
Tips & tricks on performing data science experiment
with 540 lines of R code
Data ingestion
Data preparation
Data profiling
Data enhancement
Data modeling
Model evaluation
Model improvement
Model operationalization
Microsoft R Open (MRO)
Based on R Open (Revolution R Open to be precise)
Free and Open Source R distribution
Compatible with all R-related software
MRAN website
https://mran.revolutionanalytics.com/
Enhanced and distributed by Microsoft
Intel MKL Library
Reproducible R toolkit
ParallelR
Rhadoop
AzureML
Microsoft R Client (MRC)
Free, community-supported, data science tool for high
performance analytic
http://aka.ms/rclient/download
Built on top of Microsoft R Open (MRO)
Brings together ScaleR technology and its proprietary
functions
Allows you to work with production data only locally
Data to be processed must fit in local memory
Processing is limited up to two threads for ScaleR functions
R Tools for Visual Studio (RTVS) is an integrated
development environment available as a free add-in for
any edition of Visual Studio
https://www.visualstudio.com/vs/rtvs/
Microsoft R Server (MRS) 9
R for the enterprise
Available for download from MSDN and Visual Studio Dev
Essentials
Adds support for
Remote execution
Remote compute contexts
Data chunking
Additional threads for multithreaded processing
Parallel processing and streaming
R Server platforms
R Server for Hadoop
R Server for Teradata DB
R Server for Linux
R Server for Windows
SQL Server R Services
What’s new in MRS 9
MRS 9.0 brings
State-of-the-art machine learning algorithms (MicrosoftML library)
Fast linear learner, with support for L1 and L2 regularization
Fast boosted decision tree
Fast random forest
Logistic regression, with support for L1 and L2 regularization
GPU-accelerated Deep Neural Networks (DNNs) with convolutions
Binary classification using a One-Class Support Vector Machine
Simplified operationalization of R Models (MRSDeploy)
New data sources for Apache Hive and Parquet
MRS 9.1 adds
Pre-trained cognitive models for sentiment analysis and image
featurization
New platform - Apache Spark on a HDInsight cluster
Real-time scoring
Data Science approach – follow
the data
SOURCE: CRISP-DM 1.0 http://www.crisp-dm.org/download.htm DESIGN: Nicole Leaper http://www.nicoleleaper.com
Solving problems with machine
learning
A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by
P, improves with experience E (Tom Mitchell)
Thanks to law of big numbers this is possible
Hoeffding's inequality allows us to measure how trustworthy
the results are
Our goal is to classify transactions as fraudulent or not
based on historical and demographic data
The train data are a set of cases (observations), each of
which is described by:
List of attributes (input, explanatory, x, independed variables or just
features)
Label classes (output, explained, y, dependend variable or just labels)
Data ingestion
Tidy your data
Each variable forms a column
Each observation forms a row
Each type of observational unit
forms a table
Ensure fast access to the data
Convert data into optimized for
fast processing, compressed
format
XDF files (eXternal Data Frame) is a
binary file format with an R interface
that optimizes row and column
processing and analysis
Move computation where data is
stored
Set remote compute context
Data preparation
Predictive models should provide most accurate and reliable
predictions
Feel free to add variables, transform them, and play with model
parameters
Find more data
Datasets can be combined if they have at least one common
variable
Impute missing values
Try to minimize changes in variables distribution
Correct bad data
Data that does not comply with bussines rules or common sense
Deal with outliers
Unusual values do not fall within the scope of 1.5*IQR
Data profiling
For each variable
Check how much information it contains (variance will help)
Asses its quality (range, numer of missing observations, duplicates,
outlieres)
Search for patterns
If systematic relationship exists between two variables it will
appear as a pattern in the data
When you spot a pattern, ask yourself
Could this pattern be due to coincidence?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Descriptive statistics are simplifications - graphs tend to be
more relevant and easier to interpret
Data enhancement
Add features
If you have knowledge in a given domain you can calculate them
on the basis of other attributes
Computed variables can also be the result of the technical
transformations
Split data into train, test and control sets (cross validation is
even better but slower)
Train set is used to detect patterns
Test set is used to detect errors, always present in the data
Control set is used only once, to a final assessment of data mining
model
If the distribution of output variable is heavily skewed, you
should balance it
Accuracy paradox: model with 99.99% accuracy can be completely
useless
Data modeling
• Classification and regression are
methods of supervised learning
– Source data contains ground truth
• Most data mining algorithms can
be used for both tasks
–
–
–
–
–
Logistic regression
Linear regression
Boosted decision tree
Random forest
Neural net
Model evaluation
There is no single best model
Ongoing evaluation of model performance is
a must
The best models are simple models that fit
data well
We need a balance between accuracy and
simplicity
In a binary classification scenario, the target
variable has only two possible outcomes
One is called positive p, second – negative n
Since each case the true value of the output
variable is known, we can simply submit
these records for classification and compare
the prediction with true values
Model evaluation cont.
You can deduce from confusion matrix a series of measures assessing
the quality of the classifier
𝑇𝑃 + 𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁
p𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑠𝑐𝑜𝑟𝑒 = 2 ∗
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
𝑟𝑒𝑐𝑎𝑙𝑙 =
Single measure is handy
High F-score means high precision and recall
AUC is equal to the probability that a classifier will rank a randomly
chosen positive instance higher than a randomly chosen negative one
Model improvement
Model quality depends on many elements
Gathering relevant and representative source data
Proper data preparation
Enriching train data
Selecting the appropriate algorithm
Hyperparameters tuning
Do you remeber data mining life cycle?
Model operationalization
Thank you
We moved from row data into
inteligent fraud detection system in
one hour
Take some time to walk through the
code at your pace
Please evaluate all sessions
After this session, you can speak
with me
In the conference venue
Via social media
https://www.linkedin.com/in/marcinszeliga/
Through an email [email protected]