Advanced statistical methods for
credit risk modeling in practice
Kádár Ferenc
OTP Bank
Analysis and Modeling Department
2017.02.16
Modeling what? And why?
• Type of risks in a bank: Market risk, Operational risk, Credit risk
• How can we measure credit risk? Expected loss of lending:
Risk cost = PD·LGD·EAD
Risk parameters:
• PD – Probability of Default (0, 1)
• LGD – Loss Given Default (0, 1]
• EAD – Exposure at Default (0, ∞) - generally limited
Default: 90+ day delinquency in 1 year.
0 or 1 (good client – bad client)
„Risk is not measurable (outside of
casinos or the minds of people who
call themselves ’risk experts’)”
2
PD modeling (scorecards)
• Application
scorecard for
walk-in clients
Modelling
database
• Behaviour,
Collection
scorecards
• Application
scorecards for
OTP clients
sociodemographic
and financial
status,
info on
employment
sociodemographic
and financial
status, info on
employment,
historical client
data
application data +
behavior data of
the credit
application data +
behavior data of
the client,
network data
• Early default
(Fraud)
scorecards
3
CRISP-DM methodology
CRoss
Industry
Standard
Process for
Data
Mining
4
Datasets used
Observation period
(one year in general)
Now
Modeling
dataset
Validation
dataset
Stability
dataset
MODELING dataset is divided as follows:
• Training dataset
• Testing dataset – out-of-sample validation
• Filtered dataset
VALIDATION dataset serves as out-of-time validation
(„More recent database”)
STABILITY dataset is used for checking stability
(„Most recent database”)
„Time series analysis is similar to
sending troops after the battle”
5
Data preparation
Missing value handling
Data transformation
Outliers
Correlated variables
Is our modeling method
sensitive to them?
6
Variable transformation
Categories
Weight of Evidence (WoE) = 𝑙𝑛
𝐺𝑜𝑜𝑑𝑠 %
𝐵𝑎𝑑𝑠 %
7
Grouping in SAS Enterprise Miner
8
Information value
Default
Variable X =No
8096
=Yes
5765
Total
13861
Nondefault
Total
175821 183917
Default (%) Non-default (%)
Total (%)
B(ads)
G(oods)
WoE
IV
58.41%
88.30%
86.36%
0.4133
0.1236
29051
41.59%
11.70%
13.64% -1.2687
0.3793
199107 212968
100.00%
100.00%
23286
100.00%
0.5029
WoE = ln(G/B)
IV = (G−B )ln(G/B)
Information
Value
Predictive Power
< 0.02
useless for
prediction
0.02 to 0.1
Weak predictor
0.1 to 0.3
Medium predictor
0.3 to 0.5
Strong predictor
>0.5
Suspicious or too
good to be true
9
Variable selection
Variable strength
Correlation matrix
10
Loss functions
X – the vector space of all inputs,
𝑦 – the space of binary targets,
𝑓: 𝑋 → ℝ, the estimator to 𝑦
We seek to minimize empirical risk:
1
𝐼𝑓 =
𝐿(𝑓(𝑋𝑖 ), 𝑦𝑖 )
𝑛
𝐿 is the loss function
Hinge loss: 𝐿 𝑓 𝑥 , 𝑦) = |1 − 𝑦𝑓 𝑥 |+
Square loss: 𝐿 𝑓 𝑥 , 𝑦) = (1 − 𝑦𝑓 𝑥 )2 (extremely penalizes the outliers)
Huber loss: Quadratic for 𝑥 < 𝑟 and linear for 𝑥 > 𝑟
Logistic loss: 𝐿 𝑓 𝑥 , 𝑦) = log(𝑒 −𝑦𝑓
𝑥
+ 1)
…
11
Logistic Regression
We look for the probability of default in form:
𝑝=
1
1 + 𝑒 −(𝛽0+
Advanced log-loss:
𝛽𝑖 𝑋𝑖 )
1
2
or equivalent:
𝑛
2
𝑖=1 𝛽𝑖
𝑝
𝑙𝑜𝑔𝑖𝑡 𝑝 = 𝑙𝑜𝑔
= 𝛽0 +
1−𝑝
+ C · log(𝑒 −𝑦(𝛽0+
𝛽𝑖 𝑥𝑖 )
𝛽𝑖 𝑋𝑖
+ 1)
12
LogReg Formula Example
1/(1+exp(-(
𝛽0
-2.252096 +
𝛽1
nvl(miota_ugyfel,0) * (-0.0026184) +
𝛽2
nvl(tx_kozv_terh_db_avg,0) * (-0.66677) +
(case when nvl(eletkor,40) <= 25
when nvl(eletkor,40) >= 56
else 0 end) +
then 0.62735
then -1.25598
(case when nvl(cs_hazib_mind_db_avg,0) > 0 then -0.42836
else 0 end) +
(case when (nvl(te_egy_lak_foly_avg,116000) > -5000 and
nvl(te_egy_lak_foly_avg,116000) <= 0) then 0.49162413
when (nvl(te_egy_lak_foly_avg,116000) > 40000 and
nvl(te_egy_lak_foly_avg,116000) <= 120000) then -0.4852995
when nvl(te_egy_lak_foly_avg,116000) > 120000 then
-1.40977
else 0 end) +
...
13
Decision Tree
Classic methods are simple and easily interpretable.
Logistic regression is sensitive to missing values, outliers and
correlated variables, decision trees are not.
We generally use the combination of the two above methods.
Most often with one-deep trees (decision stumps).
14
Model evaluation – performance
AUC (Area Under Curve) = 𝑃(𝑋𝑃 > 𝑋𝑁 )
Gini = 2 x AUC – 1
Lift(h) =
𝐷𝑅𝑥<ℎ
𝐷𝑅
15
Model evaluation – distribution
Evaluate the distribution:
-
Stability index
-
Hosmer-Lemeshow test
-
Binomial test
16
Stability index (PSI)
Pool
1
2
3
4
5
6
7
8
9
10
PD distr. on modeling
PD distr. on stability
Pool limit low Pool limit high
data (A)
data (B)
10.00%
20.05%
0.00%
0.63%
10.00%
12.80%
0.63%
0.92%
10.00%
9.37%
0.92%
1.15%
10.00%
7.30%
1.15%
1.40%
10.00%
6.75%
1.40%
1.73%
10.00%
6.91%
1.73%
2.38%
10.00%
8.36%
2.38%
4.37%
10.00%
9.44%
4.37%
10.75%
10.00%
9.04%
10.75%
35.76%
10.00%
9.97%
35.76%
100.00%
Stability index
PSI Value
(A-B)*LN(A/B)
0.06993
0.00691
0.00040
0.00848
0.01276
0.01144
0.00292
0.00032
0.00096
0.00000
0.1141
Inference
Insignificant
Less than 0.1 change
0.1 – 0.25
Some minor
change
Greater than Major shift in
0.25
population
17
Classic vs. Ensemble methods
Keep balance between power, stability, interpretability, simplicity.
„Less is more,
and usually
more effective”
What we do is not „high-frequency trading”, takes months to reveal whether our
estimation is good or bad.
We should avoid black-boxes.
We have to understand our models – the knowledge of business experts is essential.
18
Model error, Model risk
The model itself also entail risk!
Predicting the „Crisis” yet Blowing Up
Sources of model risk:
•
•
•
•
Data errors
Parameter uncertainty
Misuse of the model
…
„We go from reality to models
not from models to reality”
19
Propagation of error
Quantification of parameter uncertainty: confidence intervals 𝐼𝛼
= 𝛽𝑖 ±𝑧1−𝛼/2 𝜎𝑖
•
A large amount (𝑛) of random numbers 𝑥 is simulated, according to an even distribution between
0 and 1, 𝑋 ~ 𝑈(0,1).
•
For each 𝑘 ∈ {1..𝑛}, a full set of 𝛽𝑖 estimators for the logistic regression is simulated through the
inverse of its respective distribution functions, 𝛽𝑖𝑘 = 𝐹𝑖−1 (𝑥𝑘 )
•
The entire portfolio is scored with each set of estimators.
𝑓 =𝐴∙𝐵
𝜎𝑓2
≈
𝑓2
𝜎𝐴
𝐴
2
𝜎𝐵
+
𝐵
2
+2
𝑐𝑜𝑣𝐴𝐵
𝐴𝐵
Risk cost = PD·LGD·EAD
„Understand model error
before you use a model”
20
Software
Programming language
Graphical
Open source
Commercial
21
Modeling Streams
SAS Enterprise Miner
IBM SPSS Modeler
22
Modeling with python
more alternative models
teaching the models
evaluation
23
Thanks to…
My colleagues in OTP Bank
Benczúr András, SZTAKI
Gáspár Csaba, BME dmlab
Nassim Nicholas Taleb (quotes from Antifragile and Silent Risk)
… and many more
Thank you for your attention!
24
© Copyright 2026 Paperzz