Fatafat Introduction Driving Results Through Real-time

Production Model
Lifecycle Management
Prepared for ODSC
Wed, Aug 24, 2016
http://www.meetup.com/San-FranciscoODSC/events/232754330/
Wed, Aug 24, 2016
[email protected]
Linkedin.com/in/GregMakowski
© 2016 LigaData, Inc. All Rights Reserved.
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 2
Develop a Robust Solution (or get
Epsilon
(owned by American Express then)
fired)
ACG’s first neural network (1992) (~40 quants in Analytic Consulting Group)
Score 250mm house holds every month, pick the best 5mm hh
Neural net by a previous consultant,
did great “in the lab” !!
did “reasonable” month 1
© 2016 LigaData, Inc. All Rights Reserved. | 3
Develop a Robust Solution (or get
Epsilon
(owned by American Express then)
fired)
ACG’s first neural network (1992) (~40 quants in Analytic Consulting Group)
Score 250mm house holds every month, pick the best 5mm hh
Neural net by a previous consultant,
did great “in the lab” !!
did “reasonable” month 1
did “worse” month 2
“bad” month 3 (no lift over random)
prior consultant was fired
I was hired, and told why I was replacing him
My model captured the same response with 4mm hh mailed
was stable for 24+ months, saved $1mm / month
Why? Good KDD Process (Knowledge Discovery in Databases)
© 2016 LigaData, Inc. All Rights Reserved. | 4
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 5
Model Notebook
Bad vs. Good
6
Model Notebook
R package “caret”
Same parameter search wrapper over 217 algorithms
http://topepo.github.io/caret/index.html
A “section” of a model notebook
Still need to track the results of each section
Bad vs. Good
7
217 R Algorithms
Covered
Do you really want a one-off solution?
• Experimenting with Algorithms
• Experimenting with Algorithm Parameters
• Variable description  refine preprocessing
•
:
• Deep Learning architectures have many parameters
and network designs
Bad vs. Good
8
Model Notebook
Q) What is the best outcome metric?
ROC, R2, Lift, MAD ….
Bad vs.
Good
9
Model Notebook
Q) What is the best outcome metric?
ROC, R2, Lift, MAD ….
A) Deployment simulation of cost-value-strategy
Is the business deployment over all the score range? [0… 1]?
Just over the top 1% or 5% of the score (then NOT ROC, R2, corr)
Does the business problem use the 80-20 rule? Have a longBad vs.
tail?
Good
Are some records 5* or 20* more valuable?
 Use cost-profit weighting, or more complex system
Is this taught in
mining competitions or
10
classes?
Calculate $ of “Business Pain”
zero
error
Unde
r
Stock
Need to Deeply
Understand
Business Metrics
Over
Stoc
k
Calculate $ of “Business Pain”
zero
error
Under
Stock
15%
business pain
$
Need to Deeply
Understand
Business Metrics
1% bus
pain $
←Equal mistakes
→
Unequal
PAIN in $
?
Over
Stoc
k
At least use Type I vs.
Type II weighting
Calculate $ of “Business Pain”
No way – that could get you fired!
New progress in getting feedback
zero
error
Under
Stock
15%
business pain
$
30%
bus pain
$
1% bus
pain $
←Equal mistakes
→
Unequal
PAIN in $
Over
Stock
4 week
supply of
SKU → 30%
off sale
Model Notebook
Outcome Details
• My Heuristic Design Objectives:
What would
you do?
(yours may be
different)
– Accuracy in deployment
– Reliability and consistent behavior, a general solution
• Use one or more hold-out data sets to check consistency
• Penalize more, as the forecast becomes less consistent
– No penalty for model complexity (if it validates
consistently)
• Let me drive a car to work, instead limiting me to a bike
– Message for check writer
– Don’t consider only Occam’s Razor: value consistent good results
– Develop a “smooth, continuous metric” to sort and find
14
models that perform “best” in future deployment
Model Notebook
Outcome Details
• Training = results on the training set
• Validation = results on the validation hold out
• Gap = abs( Training – Validation )
A bigger gap (volatility) is a bigger concern for deployment, a symptom
Minimize Senior VP Heart attacks!
(one penalty for volatility)
Set expectations & meet expectations
Regularization helps significantly
• Conservative Result
= worst( Training, Validation) + Gap_penalty
Corr / Lift / Profit → higher is better: Cons Result = min(Trn, Val) Gap
MAD / RMSE / Risk → lower is better: Cons Result = max(Trn, Val) +
Gap
15
Model Notebook
Bad vs. Good
16
Model Notebook Process
Tracking Detail ➔ Training the Data Miner
Input /
Test
Outcom
e
To
p
5%
Regressio
n
Top
10
%
Heuristic Strategy:
1) Try a few models of
many algorithm types
Top
20
%
(seed the search)
1) Opportunistically spend
more effort on what is
working (invest in top stocks)
Yippeee
!
AutoNeura
l
Neura
l
The Data Mining Battle Field
Mor
e
2) Still try a few trials on
medium success (diversify,
limited by project time-box)
1) Try ensemble methods,
combining model
forecasts & top source
vars w/ model
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 18
When Rejecting Credit –
Law Requires 4 Record Level
Reasons
The law does not care how complex the model or ensemble was..
i.e. NOT sex, age, marital status, race, ….
i.e. ”over 180 days late on 2+ bills”
There are solutions to this constraint, for an arbitrary black box
The solutions have broad use in many areas of the model lifecycle
© 2016 LigaData, Inc. All Rights Reserved. | 19
Should a data miner cut algorithm
choices, so they can come up with
reasons?
© 2016 LigaData, Inc. All Rights Reserved. | 20
Should a data miner cut algorithm
choices, so they can come up with
reasons?
97% of the time, NO!
(or let me compete with you)
Focus on the most GENERAL & ACCURATE system first
“I understand how a bike works, but I drive a car to work”
“I can explain the model, to the level of detail needed to drive
your business”
A VP does not need to know how to program a B+ tree, in order to make a SQL
vendor purchase decision. (Be a trusted advisor)
© 2016 LigaData, Inc. All Rights Reserved. | 21
Description Solution – Sensitivity
Analysis
(OAT) One At a Time
Target
field
Arbitrarily
Complex
Data Mining
System
(S) Source fields
For source fields with
binned ranges, sensitivity
tells you importance of
the range, i.e. “low”, ….
“high”
Can put sensitivity values
in Pivot Tables
or Cluster
Record Level “Reason
codes” can be extracted
from the most important
bins that apply to the
given record
https://en.wikipedia.org/wiki/Sensitivity_anal
ysis
© 2016 LigaData, Inc. All Rights Reserved. | 22
Description Solution – Sensitivity
Analysis
(OAT) One At a Time Delta in
forecast
Target
field
Arbitrarily
Complex
Data Mining
System
(S) Source fields
For source fields with
binned ranges, sensitivity
tells you importance of
the range, i.e. “low”, ….
“high”
Can put sensitivity values
in Pivot Tables
or Cluster
Record Level “Reason
codes” can be extracted
from the most important
bins that apply to the
given record
Present record N, S times, each input 5% bigger (fixed input delta)
Record delta change in output, S times per record
© 2016 LigaData, Inc. All Rights Reserved. | 23
Aggregate: average(abs(delta)), target change per input field delta
Description Solution – Sensitivity Analysis
Applying Reasons per record (independent of var ranking)
• Reason codes are specific to the model and
record
record 1
record 2
• Ranked predictive fields
Mr. Smith Mr.
Jones
max_late_payment_120d
max_late_payment_90d
bankrupt_in_last_5_yrs
max_late_payment_60d
0
0
1
1
• Mr. Smith’s reason codes include:
1
0
1
0
© 2016 LigaData, Inc. All Rights Reserved. | 24
Description Solution – Alternatives
R’s caret offers some feature selection,
•
http://topepo.github.io/caret/featureselection.html
Wrapper methods
• Recursive feature elimination
• Genetic algorithms
• Simulated Annealing
Filter methods (univariate)
With variable ranking
still need to relate field
ranking to record reason
Univariate methods do
NOT cover variable
Variable Importance
interactions in the model,
• http://topepo.github.io/caret/varimp.html or non-linear
• Algorithm specific (9 kinds)
• Model Independent Metrics
If classification: ROC curve analysis (univariate) per predictor
If regression: Fit a linear model
© 2016 LigaData, Inc. All Rights Reserved. | 25
Description Solution
Local Interpretable Model-agnostic Explanations (LIME)
”Why Should I Trust You?” Explaining the Predictions of
Any Classifier – Knowledge Discovery in Databases 2016
(August 13-17)
https://arxiv.org/abs/1602.04938 (PDF)
https://github.com/marcotcr/
lime-experiments (Python code)
Describes models locally,
in terms of their variables
Minimize locality-aware loss
© 2016 LigaData, Inc. All Rights Reserved. | 26
Description Solution
Local Interpretable Model-agnostic Explanations (LIME)
© 2016 LigaData, Inc. All Rights Reserved. | 27
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 28
Putting a Model in Production
Cut out extra preprocessed variables not used in final model
Minimize passes of the data
Many situations, I have had to RECODE prep and/or model to meet production
system requirements
• BAD: recode to Oracle, move SAS to mainframe & create JCL
Could take 2 months for conversion & full QA
• GOOD: Generate PMML code for model
Build up PMML preprocessing library, like Netflix
© 2016 LigaData, Inc. All Rights Reserved. | 29
www.DMG.org/
PMML/products
Putting a Model in Production
© 2016 LigaData, Inc. All Rights Reserved. | 30
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 31
Tracking Model Drift
(easy to see with 2 input dimensions vs.
score)
Current
Scoring
Data
Training
Data
© 2016 LigaData, Inc. All Rights Reserved. | 32
Tracking Model Drift
A trained model is only as general as
the variety of behavior in the training data
the artifacts abstracted out by preprocessing
Good KDD process and variable designs the analysis universe like the general
scoring universe
Over time, there is “drift” from the behavior represented in the scoring data, and
the original training data
Stock market cycles
Bull  Bear  Bull  …
© 2016 LigaData, Inc. All Rights Reserved. | 33
Tracking Model Drift
MODEL DRIFT DETECTOR in N dimensions
• Change in distribution of target (alert over threshold)
During training, find thresholds for 10 or 20 equal frequency bins of the score
During scoring, look at key thresholds around business decisions (act vs not)
Has the % over the fixed threshold changed much?
Chi-square or KL Divergence (contingency table metrics)
• Change in distribution of most important input fields
Diagnose CAUSES, what is changing, how much…
Out of the top 25% of the most important input fields…
Which had the largest change in contingency table metric?
© 2016 LigaData, Inc. All Rights Reserved. | 34
Tracking Model Drift
Most frequent process in companies – RETRAIN EVERY DAY
• Does yesterday’s 4th of July sale training data best represent your 5th of July
activity?
• Have you ”forgotten” past lessons, not in yesterday’s data
The Stability vs. Placticity dilemma or
Learn how to play the guitar without forgetting grandmother
What about fraud cases from 6 months ago?
Same issues exist in online training
• Drifting vs. forgetting?
choose robustness and transparency, which ever you do
© 2016 LigaData, Inc. All Rights Reserved. | 35
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 36
Retrain, Refresh or Update DBC
Model Retrain
• Brute force, most effort, most expense, most reliable
• Repeat the full data mining model training project
• Re-evaluate all algorithms, preprocessing, ensembles
1-2 months
Model Refresh
•
•
•
•
•
“Minimal retraining”
Just run the final 1-3 model trainings on “fresher” data
Do not repeat exploring all algorithms and ensembles
Assume the ”structure” is a reasonable solution
Go back to your prior Model Notebook – choose the best as a short cut
1 week
© 2016 LigaData, Inc. All Rights Reserved. | 37
DBC – Dependent By Category
(Powerful Preprocessing, Bayesian Priors)
Find out top ~10-20 most predictive variables to date
Explore interactions in a hierarchy
Like mini-OLAP cubes, with target average in each cell
Use ”best fit”, most granular, that is significant
Use the most granular OLAP cell,
WITH A MIN REC CNT or significance
If same number dim, use most extreme target value
A*B*C*D
A*B*C, A*B*D, A*C*D, B*C*D
A*B, … C*D
A, B, C, D
Frequently produces 4-6 of
Top 10 most predictive variables
© 2016 LigaData, Inc. All Rights Reserved. | 38
DBC Example
•
Average past Lift per category
•
•
•
•
Percent off bin (i.e. 0%, 5%, 10%, 15% … 80%)
Price Savings Bin (i.e. $2, $4, $6 …)
Store hierarchy
Product hierarchy (50k to 100k SKUs, 4-6 levels)
•
•
•
•
•
Department, Sub-department, Category, Sub-Category
Seasonality, time, month, week
Reason codes (the event is a circular, clearance)
Location on the page in the flyer (top right, top
left..)
Multivariate combinations – powerful &
scalable
18
DBC – Interactions
1) Pre-calculate a
lookup table, with
the past avg
target
for each set of
prior conditions
2) Apply by
looking up
conditions for
a given storeitem
returning the
target estimate
19
DBC – Dependent By Category
(Update Tables to help model live longer)
1 hour
Recalculate the cell values weekly or monthly
Low computational cost, low effort
Capture the ”latest fraud trends”
The model weights on the field can remain the same
Adapt to 1,000’s of small, incremental changes
Without having to Retrain or Refresh the model
Can choose to keep pockets of past bad behavior, to recognize in the future
“Balance Stability vs. Placticity”
© 2016 LigaData, Inc. All Rights Reserved. | 41
Contents
Develop a Robust Solution (or get fired)
Selecting the Best Model w/ Model Notebook
Describing the Model
Putting a Model in Production
Model Drift over Time (Non-Stationary)
Retrain, Refresh or update DBC Preprocessing
Kamanja Open Source PMML Scoring Platform
© 2016 LigaData, Inc. All Rights Reserved. | 42
Solution Architecture for Threat and Compliance
Lambda Architecture with Continuous Decisioning
2
1
3
6
5
4
© 2016 LigaData, Inc. All Rights Reserved. | 43
Solution Stack for Threat and Compliance
Leveraging Primarily Open Source Big Data Technologies
© 2016 LigaData, Inc. All Rights Reserved. | 44
Continuous Decisioning
Use Case: Cyber Threat Detection & Response
Use Kamanja to detect potential cyber security breaches
Problem
Solution
•
Ingest IP addresses, malware signatures, hash values,
email addresses, etc. in real time
•
Automatically enrich with third party data
•
Check historical logs against new threats continuously
•
Predictive analytics based on machine learning flag
suspicious activity before it becomes a problem
•
Direct integration with dashboards to generate alerts
and speed up investigation
Diverse Inputs
•
Structured and unstructured data, with varying
latencies
Data Enrichment
•
Long and laborious process, manual and ad hoc
Quality of Threat Intelligence
•
Lots of false positives waste analyst resources
Poor Integrations with Response Teams
•
Manual and Time Consuming Process
© 2016 LigaData, Inc. All Rights Reserved. | 45
Continuous Decisioning
Use Case: Application Monitoring
Use Kamanja to detect insider attacks to sensitive data
Problem
•
Legacy system is batch oriented
•
Months required to create and implement new
alerts
•
Slow speed-to-market developing new source
system extracts. Months required to assimilate
new data.
•
Risks to PII and NPI, with compliance
implications.
Solution
•
•
•
•
Use open source big data stack to migrate to real time
data streaming, rapid model deployment, and alerts
with no manual intervention.
Calculate number of times PII/NPI accessed over eight
hour period, and calculate risk to generate alerts
Machine learning to identify normal pattern of out of
office hours access. Trigger automatic alerts when
anomalies occur.
Rapid implementation of new models to deal with
emerging threats.
© 2016 LigaData, Inc. All Rights Reserved. | 46
Continuous Decisioning
Use Case: Unauthorized Trading Detection
Use Kamanja to reduce the risk of rogue behavior at an investment bank
Problem
•
•
•
Need timely alerting of potentially unauthorized
trading activity
Solution
•
Create a Trader Surveillance Dashboard
•
Provide a holistic view of a trader, based on all relevant
information about the trader, the marketplace, and peers
•
Build supervised and unsupervised machine learning
models based on operational, transactional, and financial
data.
•
Real-time analysis and monitoring of trader activity
automatically highlights unusual activity and triggers alerts
on trades to investigate
Must tie together voluminous data, reports, and
risk measures
Meet increasingly stringent time requirements
© 2016 LigaData, Inc. All Rights Reserved. | 47
Continuous Decisioning
Use Case: Credit Card Fraud Detection
Use Kamanja to incrementally reduce fraud losses by applying multiple predictive models for
transaction authorization
Problem
•
$16.3 billion in credit card fraud losses annually
•
Fraud is growing more quickly than transaction
value
•
New types of fraud are one step ahead of
existing solutions
•
Dependence on third party proprietary systems
means slow reaction times and expensive
changes
Solution
•
Apply Kamanja to IVR, web, and transactional data to
trigger alerts
•
Initial models detect suspicious web traffic, common
purchase points, and application rarity
•
Leverage existing infrastructure as well as existing
third party systems (Falcon and TSYS)
•
Reduce costs by 80% with open source software
© 2016 LigaData, Inc. All Rights Reserved. | 48
Summary
You can have it all: accurate, general & describable
•
You may fully understand a bike – but drive a car to work (level of detail)
Control and plan complexity: track in a model notebook
• Reuse notebook when you need to retrain
• Balance accuracy and generalization in the notebook outcomes
• Track business net value per model (be more competitive)
Model and record level description helps model lifecycle
• Helps during model building, to improve preprocessing, DBC
• Helps gain trust
• Helps track model drift and degradation
Use Kamanja, a real time decisioning engine for production deployment
© 2016 LigaData, Inc. All Rights Reserved. | 49
Thank You
Wed, Aug 24, 2016
[email protected]
www.Linkedin.com/in/GregMakowski
www.Kamanja.org
(Apache open source licensed
© 2015 LigaData, Inc. All Rights Reserved.