Boosting techniques

Machine learning tehniques for
credit risk modeling in practice
Balaton Attila
OTP Bank
Analysis and Modeling Department
2017.02.23.
„Machine learning is the subfield of computer science that gives computers the ability to
learn without being explicitly programmed (Arthur Samuel, 1959). Evolved from the study of
pattern recognition and computational learning theory in artificial intelligence, machine
learning explores the study and construction of algorithms that can learn from and make
predictions on data…”*
* Wikipedia
New, complex databases needed new modeling tools
Powerful models
Internal
databases
GIRINFO
Utility
companies
Comlex,
large
database to
be
analyzed
Retailers
Social
networks
Better GINI
Machine
learning
Wide range dataset
about costumer
behavior
Recognise
connection between
different variables
3
Why Machine Learning: to mine new, large complex datasets
The actual phenomenon
Machine Learning
Traditional Stats
BAD
BAD
GOOD
GOOD
Traditional stats will fit a predetermined (linear,
quadratic, logarithmic) function to the data
ML algorithms do not use predetermined function so
that they can build a model closely to fit with data
Self-learning
Not available
Self-learning possible to some extent (variable weight
can be changed automatically)
Regular expert supervision needed
Dataset and
Complexity
Adequate for well-structured databases
Can’t handle complex, poorly structured
datasets
Works well with small or poorly-structured datasets
Recognizes complex patterns
Easy to interpret the results and the effect of
explanatory variables
Model interpretation requires expertise
Less computationally intensive
Demands more computational power
Description
Intrepetation of
results
Hardware capacity
4
Need of interpretability
Completely
understandable
Well
comprehensible
„Black box”
sec
min
hour
day week month year
Forecast timeframe
New risk models is a sub-project of the Banks’ Digital Strategy
•
OBRU, OBR
application models
•
OBR application
•
models
Internal
development of at
least two scorecards
with AMM
Internal
OTP HU application •
development of a
models
new scorecard using
• Trainings for
AMM techniques
• Regular trainings about
subsidiaries about
the alternative
AMM
techniques
•
•
AMM belongs to
the folclor during model
development
• Subsidiaries test AMM
techniques
• External teams support
the validation process
• Establishing a
Python server
Jun. 2016
• Involving OTP HU
Big Data enviroment
• Weblog and detailed
transactional data
for fraud prevention
Dec. 2016
Jun. 2017
Dec. 2017
6
Machine Learning in practice
Machine Learning techniques can
not replace the whole „classic”
model development lifcycle
We tested the following Machine Learning techniques…
Random forest: „average” of a lot of random
trees. Did not perform well for scorecards.
Support Vector Machine: the goal is to find the appropriate
hyperplane with optimal separational power. The extended
version with kernel functions had capacity issues so it was only
a supporting algorithm combined with regression.
Neural network: can not be used for real time
decision making
Boosting techniques: supervised repeating of „weak classificators” (decision trees,
regression, …) lead to a stronge classificator
Main types of boosting:
• AdaBoost: underweight well classified and overweight misclassified elements in every
round
• LogitBoost: special type of Adaboost where the loss function is logistic
• Gradient Boosting Trees: special type of LogitBoost, loss function is decreased along
the gradient
Random Forest
AdaBoost
• Base classifiers: C1, C2, …, CT
• In step m, find best Cm in predefined class using weights wi
• Error rate:
N
 m   wi Cm ( xi )  yi 
i 1
• Importance of a classifier:
1  1  m 

 m  ln
2  m 
AdaBoost
• Weight update:
wi
( m 1)
wi( m )

Zm
exp  m

m
exp

if Cm ( xi )  yi
if Cm ( xi )  yi
where Z m ensures  wi
( m 1)
1
• Classification:
T
C * ( x i )  arg max   m Cm ( x i )  y 
y
m 1
AdaBoost example
from “A Tutorial on Boosting”
by Yoav Freund and Rob Schapire
AdaBoost example
AdaBoost example
Compute , α and the weight of the instances in Round 2
AdaBoost example
AdaBoost example
Main AdaBoost idea and a new idea
• “Shortcomings” are identified by high weight data points
• The new model (e.g. stump) is fit irrespective to previous predictions
• In next iteration, learn just the residual of present model F(x):
• Fit a model to
(x1; y1 – F(x1)); (x2; y2 – F(x2)); …; (xn; yn – F(xn))
• Regression, no longer classification!
LogitBoost: The additive logistic regression model
• Logistic regression learns linear combination of classifiers for the “log
odds-ratio”
M
1
P y  1 | x 
log
 F ( x)   f (m) ( X )
2
P  y  1 | x 
m 1
• The logit transformation guarantees that for any F(x), p(x) is a probability
in [0,1].
inverting, we get:
p( x) : P y  1 | x  
1
1  e 2 F ( x )
• Function of real label y – p(x) will be instance weigth
LogitBoost Algorithm
Step 1: Initialization
committee function:
F ( 0) ( x)  0
p ( 0) ( x)  1
initial probabilities:
2
Step 2: LogitBoost iterations
for m=1,2,...,M repeat:
A. Fitting the weak learner:
1. Compute working response and weights for i=1,...,n
wi( m )  p ( m 1)  (1  p ( m 1) )
z
(m)
i
yi  p ( m 1) ( xi )

wi( m )
2. Fit a stump f
(m )
by weighted regression
f ( m )  arg min f
n
(m)
(m)
2
w
(
z

f
(
x
))
 i i
i
i 1
Optimization is no longer for error rate but for (root) mean squared error (RMSE)
LogitBoost Algorithm
B. Updating and classifier output
F
(m)
C
( xi )  F
(m)
p
( m 1)
1 (m)
( xi )  f ( xi )
2

( xi )  Sign F
( m)
( xi )
(m)
( xi )
1
1 e
 2 F ( m ) ( xi )

Regression Tree, Regression Stump
 Regression stump example:
TIME_FROM_FIRST_SEND <= 25.5
true
false
0.14627427718697633
-0.14621897522944
Gradient Boosting
As in LogitBoost
• Iterative prediction y*m
• Residuals: y*m+1 = y*m + h(x) where h(x) is a simple regressor, e.g.
stump, shallow tree
New idea: optimize by gradient descent
• If we minimize the mean squared error to true values y averaged over
training data:
 Derivative of (y*-y)2 can be computed and will be proportional to the
error y*-y
2
𝑑
𝑦𝑖 −
𝐹𝑗 𝑥𝑖
= 2 𝑦𝑖 −
𝐹𝑗 𝑥𝑖 = 2 ∙ residual
𝑑𝐹𝑗 𝑥𝑖
𝑗
𝑗
 Well… this is just LogitBoost without the logistic loss function
Other loss functions
• Squared loss is overly sensitive to outliers
• Absolute loss more robust to outliers but has infinite derivative
𝐿 𝑦; 𝐹 = |𝑦 − 𝐹(𝑥)|
• Huber loss
1
(𝑦 − 𝐹)2
𝐿 𝑦; 𝐹 =
2
𝛿( 𝑦 − 𝐹 − 𝛿/2)
if 𝑦 − 𝐹 ≤ 𝛿
if 𝑦 − 𝐹 > 𝛿
 Negative gradient is
−𝑔 𝑥𝑖
𝑑𝐿 𝑦; 𝐹 𝑥𝑖
=
𝑑𝐹 𝑥𝑖
𝑦 − 𝐹(𝑥𝑖 )
=
𝛿sign(𝑦 − 𝐹(𝑥𝑖 ))
if 𝑦 − 𝐹(𝑥𝑖 ) ≤ 𝛿
if 𝑦 − 𝐹(𝑥𝑖 ) > 𝛿
Two main Gradient Boosting Loss Functions
Deviance (as in Logistic Regression)
1
1 + 𝑒 −𝐹(𝑥)
Exponential (as in AdaBoost)
 1  1  y  F ( x)  

exp  ln

 2
y

F
(
x
)



Python server
OTP Python server
(2016Q2)
Red Hat Linux with Anaconda
Jupyter notebook
- More effective than localhost
enviroments
- Background mode
- Web application
Application credit fraud prevention
SAS Fraud Solution – Logical system architecture
System architecture and working logic of OTPHU’s fraud system
Hybrid approach
Social
Network
Analysis
Predictive
Modeling
Anomaly
Detection
Automated
Business Rules
Alert Generation
Process
Entity
matching
Rejection
Process
Rejection
Application data
Historical data
Behavioural data
Entity matching
Network building
Scenario scores
Fraud
scoring
Manual
investigation
Approval
Approval
Investigation
List of suspicious applications
Scenario analysis
Network analysis
Thank you for your attention!