Comparisons of the Performance of Computational Intelligence

Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
Comparisons of the Performance of Computational Intelligence Methods for
Loan Granting Decisions
Jozef Zurada
University of Louisville, College of Business
[email protected]
Abstract
The importance to financial institutions of accurately
evaluating the credit risk posed by their loan granting
decisions cannot be underestimated; it is underscored
by recent credit assessment failures that contributed
greatly to the so-called “great recession” of the late
2000s. The paper compares the classification accuracy
rates of several traditional and computational
intelligence methods. We construct models and assess
their classification accuracy rates on five very versatile
real world data sets obtained from different loan
granting decision areas. The results obtained from
computer experiments provide a fruitful ground for
interpretation.
1. Introduction
The importance to financial institutions of
accurately evaluating the credit risk posed by their loan
granting decisions cannot be underestimated; it is
underscored by recent credit assessment failures that
contributed greatly to the so-called “great recession” of
the late 2000s. Lending institutions have been using
credit scoring models for a few decades [1]. The
objective of credit scoring models is to assess the
likelihood that a customer will default on a credit
extension, allowing the credit granting institution to
determine whether or not to extend credit to a
customer. Typically customers are grouped into “good
credit” customers – those likely to repay credit
extended to them-- and “bad credit” customers-- those
likely to default on credit repayments. Getting these
decisions right has broad financial implications for
credit granting institutions as well as financial markets
as a whole: failure to identify good credit risks leads to
lost income, while extending credit to bad credit risks
is a threat to profitability. Studies have shown that
even a 1% improvement in the accuracy of a credit
scoring model can save millions of dollars in a large
loan portfolio [1]. Thus, the need for accurate credit
K. Niki Kunene
University of Louisville, College of Business
[email protected]
scoring models is essential in economies that rely on
credit availability for daily economic activity.
The research on credit scoring models continues to
grow and explore a variety of methods including
survival analysis [2], linear discriminant analysis
(LDA), logistic regression (LR) [3], k-nearest neighbor
(kNN) [4], classification trees (CT), neural networks
(NN) [1], [5], radial basis function neural networks
(RBFNN), support vector machines (SVM) [6], [7],
[8], [9], [10], [11], decision trees (DT) [12], [13], [14],
ensemble techniques [15], [16], genetic programming
[17], [18], [19], [20]. Survival analysis has been used
to predict the time to default or time to early repayment
[2], other methods have focused on predicting the
probability of defaulting or the probability of not
defaulting on a loan. Although the application of
parametric statistical methods such as LDA —one of
the first credit scoring models— to the categorical,
non-normally distributed data found in credit data has
been criticized [1], more recently LDA has been
applied in conjunction with other models, for example
with SVM providing the input data [21]. The
application of NN models has been more widespread
[5], [22]. NNs make no assumptions about the
distribution of the data; some researchers have also
investigated the accuracy of hybrid models combining
NN with other models such as decision tables [23], self
organizing maps[24], k-nearest neighbor clustering
algorithm, and multivariate adaptive regression splines
(MARS) [25]. The use of SVM in credit scoring
models is more recent [11]. Bellotti and Crook [11]
compare SVM to LDA, LR, and kNN using a large
data set and find that a large number support vectors is
required to achieve the best performance.
Chrzanowska, Alfaro, & Witkowska [15] and [26] use
classification trees and ensemble classification models
on credit scoring models with success. Recently, some
researchers have applied genetic algorithm (GA) or
genetic programming models to credit scoring models
[17], [18], [19], [20], [27]. Finlay [20] found that
1530-1605/11 $26.00 © 2011 IEEE
1
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
genetic algorithms perform comparably with LR and
linear regression models. While the integration of
multi-criteria decision making with machine learning
models has been used in other applications [28],
however it is new to credit scoring models: Yu, Wang
& Lai [29], in a technique similar to ensemble models,
employ intelligent agents as decision experts that
generate varying credit scoring judgments that are
subsequently fuzzified and aggregated into a group
consensus decision. Ben-David and Frank [30]
compared a number of the above models to expert
systems (ES).
In many of the aforementioned and similar studies
researchers compare a model of interest with multiple
models usually with respect to accuracy [31].
However, these comparisons are regularly made using
a single data set. Comparisons based on a single data
set may be susceptible to the idiosyncrasies of the data
set, its context, and the computational method. In this
study we investigate the classification accuracy of six
models on five data sets from different financial
contexts: logistic regression (LR), neural network
(NN), radial basis function neural network (RBFNN),
support vector machine (SVM), k-nearest neighbor
(kNN), and decision tree (DT). For each of the six
models and five data sets 10-fold cross-validation is
applied, and each experiment is repeated ten times to
achieve reliable and unbiased error estimates. The
classification accuracy rates across 10 folds and 10
runs are averaged and a 2-tailed paired t-test at α=0.05
is used to verify if the classification accuracy rates at a
0.5 probability cut-off across the models and data sets
are significantly different from LR model which is
used as the baseline. This methodology is faithful to
the recommendations of [32]. ROC charts are also
employed to examine the performance of the models at
the probability cut-offs ≠ 0.5 which are more likely to
be used by financial institutions.
As indicated above and Table 1, the models
proposed in this study have already been applied
successfully to credit scoring models, the contribution
of our study is found in the evaluation of these models
on five variable real-world data sets where each data
set has different characteristics with respect to the type
and number of variables, the distribution of “bad
credit” and “good credit” samples in the data (Table 2),
the extent of missing values, and the number of
samples. The obtained results offer fruitful ground for
interpretation (Tables 3-6). One can look at the
efficiency of the models or the predictive power of the
attributes contained in each of the five data sets. One
can examine the ROC curves to determine the
efficiency of the models at various operating points as
well (Figure 1).
The paper is organized as follows. Section 2 briefly
summarizes several previous studies regarding credit
scoring and loan granting decisions. Section 3
discusses six methods used. Section 4 presents the
basic characteristics of the five data sets used, whereas
section 5 describes computer simulation and the
results. Finally, section 6 concludes the paper and
outlines possible future research in this area.
2. Prior research
Machine learning techniques such as LR, NNs,
RBFNNs, DTs, fuzzy systems (FS), neuro-fuzzy
systems (NFS), GAs, and many other techniques have
been applied to credit scoring problems in a number of
studies. Most studies report classification accuracy
rates obtained on different competing models and
computer simulation scenarios, others concentrate on
addressing the typical problems surrounding the nature
of credit data, e.g. that the data s frequently highly
unbalanced, and that part of the information may be
missing. A few studies focus on the models’
interpretability issues and feature reduction methods.
Table 1 is a summary of a representative sample of
some of the more recent research in the field.
Table 1. Summary of prior studies
Study
[1]
[23]
Methods
Used
Five NN
architecture
s and LDA,
LR, kNN,
kernel
density
estimation,
and DTs.
Data Sets
Properties
German
credit data
University
of
Hamburg,
and
Australian
credit data
LR, NN
A data set
from UK
financial
institution
Results
10-fold crossvalidation used.
Among neural
architectures the
mixture-of-experts
and RBFNN did
best, whereas
among traditional
methods LR
analysis was most
accurate.
Five NNs: MLP,
mixture-of-experts,
RBFNN, learning
vector quantization,
and fuzzy adaptive
resonance.
NN approach did
not significantly
outperform
estimated
proportional hazards
models.
2
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
Study
[5]
[30]
[27]
Methods
Used
NN
Data Sets
Properties
The
Australian
credit data
set
ES, and NN,
LR, Bayes,
DT, kNN,
SVM, CT,
RBFNN
A real
world data
set from
an Israeli
financial
institution
NN, and GA
[6]
Hybrid
SVM using
CART,
MARS and
grid search
[9]
SVM, NN
[15]
CTs:
boosting
and bagging
Dataset
from UCI
Repositor
y
A credit
card
dataset
from a
bank in
China
A real
world data
set from
Taiwan
A real
world
dataset
from a
commerci
al bank in
Poland
Results
Study
Methods
Used
A proposed
reassigning
credit
scoring
model
(RCSM) is
compared to
LDA, LR,
NN
Data Sets
Properties
Credit
card
dataset
obtained
from UCI
repository
A training to
validation ratio of
(300:390) or 43.5%56.5% is the best
training scheme on
the data, and singlehidden layer NN
outperforms doublehidden layer NN
When problem is
treated as regression
some machine
learning models
outperform the
expert system's
accuracy, but most
models do not.
When the same
problem is treated
as a classification
no machine learning
model outperforms
ES’s accuracy.
Using GA-based
inverse
classification allows
creditors to suggest
conditional
acceptance and
further explain the
conditions used to
reject applicants.
A hybrid SVM has
best classification
rate, and lowest
Type II error in
comparison with
CART, MARS
SVM surpasses
traditional NN
models in
generalization
performance and
visualization
Best performer is an
ensemble classifier
using boosting, in
terms of accuracy
and recognition of
non-creditworthy
borrowers
[33]
[34]
CART,
MARS,
LDA, LR,
SVM
[25]
A hybrid
NN, MARS
model
compared to
LDA, LR,
NN, MARS
[26]
Subagged
versions of:
kernel
SVM, kNN,
DTs and
Adaboost
A real
world
bank
credit card
data set
from
China.
A real
world
housing
loan
dataset
from bank
in China
A real
world data
set of
IBM
Italian
customers
[11]
SVM, LR,
LDA and knearest
neighbors
(kNN).
A very
large real
world data
set of
25000
records
from a
financial
institution
[13],
[14]
LR, NN,
DT, MBR,
and
Ensemble
Model.
Data sets
3 and 4
from
Table 2.
Results
With RSCM, NN
rejected good credit
records are
reassigned to the
preferable accepted
class using a CBRbased classification
technique. RCSM
outperforms LDA,
LR, NN in terms of
accuracy, Type I
and Type II errors
CART and MARS
outperform
traditional DA, LR,
NN, and (SVM) in
terms of accuracy
Hybrid NN
outperforms results
from LDA, LR,
NN, and MARS
Subagging, an
ensemble
classification
technique for
unbalanced data
sets, improves
performance of base
classifier, and
subagging decision
trees obtain bestperforming
classifier
SVMs
comparatively
successful
classifying credit
card customers who
default. And unlike
many other models,
a large number of
support vectors are
required to achieve
the best
performance.
Both are
comparative studies.
DTs did best
classification-wise.
DTs are attractive
tools as they can
generate easy to
interpret if-then
rules.
3
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
Study
[29]
Methods
Used
Individual
& ensemble
methods for:
MLR, LR,
NN, SVM
RBFNN.
Ensemble
models’
decisions
based on
fuzzy
voting, and
averaging.
Data Sets
Properties
Three data
sets,
including
modified
data set 1
(without
missing
values),
and data
set 3 from
Table 2.
Results
then the entropy of S relative to this k-wise
classification is defined as
Fuzzy group
decision making
(GDM) model
outperformed other
models on all 3 data
sets.
k
Entropy ( S ) = − ∑ p log p
i
2 i
i =1
3.1. Decision trees
The operation of DTs are based on the ID3 or C4.5
divide-and-conquer algorithms [35] and search
heuristics which make the clusters at the node
gradually purer by progressively reducing disorder in
the original data set. The algorithms place the attribute
that has the most predictive power at the top node of
the tree and they have to find the optimum number of
splits and determine where to partition the data to
maximize the information gain. The fewer the splits,
the more explainable the output is as there are less
rules to understand. Selecting the best split is based on
the degree of impurity of the child nodes. For example,
a node which contains only cases of class good_loan or
class bad_loan has the smallest disorder = 0.
Similarly, a node that contains an equal number of
cases of class good_loan and class bad_loan has the
highest disorder = 1. Disorder is measured by the wellestablished concept of entropy and information gain
which we formally introduce below.
Given a collection S, containing the positive
(good_loan) and negative examples (bad_loan) of
some target concept, the entropy of S relative to this
Boolean classification is
p
bad _ loan
log
2
p
log
bad _ loan
Sv
Entropy ( S v )
∑
v ∈ Values ( A) S
(3)
where Values(A) is the set of all possible values for
attribute A, and Sv is the subset of S for which attribute
A has the value v (i.e., S v = { s ∈ S | A( s) = v } . For
more details on DTs, refer to [36], [37], and [38].
This section briefly describes the six methods, i.e.,
DTs, RBFNNs, SVMs, kNN, LR, and NNs used in this
study.
good _ loan
For example, if disorder is measured by entropy, the
information gain, Gain(S,A) of an attribute A, relative
to a collection of examples S, can be computed as
Gain ( S , A) ≡ Entropy ( S ) −
3. Description of the methods used in this
study
Entropy ( S ) ≡ − p
( 2)
2
p
good _ loan
−
(1)
where pgood_loan is the proportion of positive examples
in S and pbad_loan is the proportion of negative examples
in S. If the output variable takes on k different values,
P( y) =
1
, where
1 + e− z
m
z = b0 + ∑ bi x i .
i =1
(4)
3.2. Radial Basis Function Neural Network
An RFBNN differs from a feed-forward NN with
back-propagation in the way the hidden neurons
perform computations [39]. Each neuron represents a
point in input space, and its output for a given training
pattern depends on the distance between its point and
the pattern. The closer these two points are, the
stronger the activation. The RFBNN uses nonlinear
bell-shaped Gaussian activation functions whose width
may be different for each neuron. The output layer
forms a linear combination from the outputs of neurons
in the hidden layer which are fed to the sigmoid
function. A network learns two sets of parameters: the
centers and width of the Gaussian functions by
employing clustering and the weights used to form the
linear combination of the outputs obtained from the
hidden layer. As the first set of parameters can be
obtained independently of the second set, RFBNN
learns almost instantly if the number of hidden units is
much smaller than the number of training patterns.
Unlike a feed-forward NN with back-propagation, the
RBFNN, however, cannot learn to ignore irrelevant
attributes because it gives them the same weight in
distance computations.
3.3. Support vector machines
Support vector machine (SVM), originally
developed by Vapnik, is a system that represents a
blend of linear modeling and instance-based learning to
implement nonlinear class boundaries [40]. This
system chooses several critical boundary patterns
called support vectors for each class (bad loan and
good loan of the output variable) and create a linear
4
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
discriminant function that separates them as widely as
possible by applying a linear, quadratic, cubic or
higher-order polynomial term decision boundaries. A
hyperplane that gives the greatest separation between
the classes is called the maximum margin hyperplane
in the form of
x = b + ∑ α i y i ( a (i ) ⋅ a ) n
(5)
where i is support vector, y i is the class value of
training pattern a(i) , while b and α i are parameters
determined by the learning algorithm. The vectors a
and a(i) represent a test pattern and support vectors,
respectively, while an expression (a(i) ⋅ a)n , which
computes the dot product of the test pattern with one of
the support vectors and raises the result to the power n
, is called a polynomial kernel. Other kernel functions
could also be used to implement a different nonlinear
mapping. Constrained quadratic optimization is applied
to find support vectors for the pattern sets as well as
parameters b and α i . Compared with DTs, for
example, SVMs are slow but often yield accurate
classifiers because they create subtle and complex
decision boundaries.
3.4. K-nearest neighbor
Broadly construed, kNN is the method of solving
new problems based on the solutions of similar past
cases [37]. The method requires no model to be fitted,
or function to be estimated. Instead it requires all cases
with their known solutions to be maintained in
memory, and when a prediction is required, the method
recalls items from memory and predicts the value of
the target. In solving a new case, the kNN approach
retrieves a case it deems sufficiently similar and uses
that case as a basis for solving the new case. The
method uses a k-nearest neighbor algorithm to classify
cases. The k-nearest neighbor algorithm takes a data set
of existing cases and a new case to be classified, where
each existing case in the data set is composed of a set
of variables and the new case has one value for each
variable. The algorithm computes the normalized
Euclidean or Manhattan distance for numeric attributes
or Hamming distance for nominal or ordinal attributes
between each existing case and the new case (to be
classified). The k existing cases that have the smallest
distances to the new case are the k-nearest neighbors to
that case. Based on the target values of the k-nearest
neighbors, each of the k-nearest neighbors votes on the
target value for a new case. The votes are the posterior
probabilities for the class dependent variable. There are
two challenging tasks in successful application of kNN,
i.e., choosing the right value for k and the proper
distance measure.
3.5. Logistic regression
The purpose of the logistic regression model is to
obtain a regression equation that could predict in which
of two or more groups an object could be placed (i.e.
whether a loan should be classified as a good loan or a
bad loan). The logistic regression also attempts to
predict the probability that a binary or ordinal target
will acquire the event of interest (e.g. loan payoff or
loan default) as a function of one or more independent
variables (i.e. amount of loan, borrower job category,
reason of loan). The logit model is represented by the
logistic response function P(y) of the form:
The function P(y) describes a dependent variable y
containing two or more qualitative outcomes. z is the
function of m independent variables x called predictors,
and b represents the parameters. The x variables can be
categorical or continuous variables of any distribution.
The value of P(y) that varies from 0 to 1 denotes the
probability that a dependent variable y belongs to one
of two or more groups. The principal of maximum
likelihood can commonly be used to compute estimates
of the b parameters. This means that the calculations
involve an iterative process of improving
approximations for the estimates until no further
changes can be made.
Unlike neural networks, logistic regression models
are designed to predict one dependent variable at a
time. On the positive side, one can note that logistic
regression output provides statistics on each variable
included in the model. Researchers then can analyze
these statistics to test the usefulness of specific
information.
3.6. Neural networks
Neural networks are mathematical models that are
inspired by the architecture of the human brain. They
are nonlinear systems built of highly interconnected
neurons that process information. The most attractive
features of these networks are their ability to adapt,
generalize, and learn from training patterns. Neural
network models are characterized by their three
properties: the computational property, the architecture
of the network, and the learning property.
A typical neuron contains a summation node and an
activation function. A neuron accepts vectors on input
called training patterns. Neurons are organized in
layers and are connected by weights represented by
small numerical values. The most common type of the
neural network architecture is a two-layer feed-forward
neural network with error back-propagation, which is
typically used for prediction and classification tasks.
5
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
Most commonly, the network has two layers: a hidden
layer and an output layer. The neurons at the hidden
layer receive the values of input vectors and propagate
them concurrently to the output layer.
Neural networks’ learning is a process in which a
set of input vectors is presented sequentially and
repeatedly to the input of the network in order to adjust
its weights in such a way that similar inputs give the
same output. In supervised learning the training set
consists of the training patterns/examples that appear
on input to the neural network and the corresponding
desired responses provided by a teacher. The
differences between the desired response and the
network’s actual response for each single training
pattern modify the weights of the network in all layers.
The training continues until the mean sum of squares
of the network errors, for the entire training set
containing training vectors is reduced to a sufficiently
small value close to zero.
4. Data sets used in the study
Almost all data sets used in loan decision context,
including five data sets used in this study, contain
information about loan applicants that financial
institutions considered to be creditworthy individuals
as all of them obtained a loan. There are also other
applicants who did not qualify for a loan at the time of
their application and they are not included in the data
sets used for modeling. Simply, we do not know if
these applicants would have paid a loan off or
defaulted upon a loan, if the loan had been granted.
Though this situation does not affect the validity of the
analysis, we should keep them in mind.
The five data sets used in this study are drawn from
different financial contexts and they describe financial,
family, social, and personal characteristics of loan
applicants. In two of the five data sets the names of the
attributes have not been revealed because of the
confidentiality issues. Two of the data sets are publicly
available at the UCI Machine Learning Repository at
http://www.ics.uci.edu/~mlearn/databases/. Table 2
presents the general features of each of the five data
sets.
Table 2. The general characteristics of the five data
sets used in computer simulation
Data
set
Characteristics
# of
cases
# of
variables
Class values
target
variable:
B: bad loans
G: good
loans
Comments
Data
set
1
Characteristics
690
16
2
252
14
3
1000
21
Comments
B: 383
G: 307
The Quinlan
data set used in
the number of
studies.
B: 71
G: 181
B: 300
G: 700
A data set from
a German
financial
institution used
in a number of
studies.
4
5960
13
B: 1189
The attributes
G: 4771
describe
financial,
family, social,
and personal
characteristics
of loan
applicants.
5
3364
13
B: 300
The attributes
G: 3064
describe
financial,
family, social,
and personal
characteristics
of loan
applicants.
Data set 1: The Quinlan data set describes financial
attributes of Japanese customers. It is available at the UCI
Machine Learning Repository. The names of the attributes
are not revealed; it is well balanced and bad loans are slightly
overrepresented. It contains numeric and nominal variables.
There are some missing values.
Data set 2: The names of the attributes are not available.
Unbalanced data set: bad loans are underrepresented. No
missing values. Includes only financial data of loan
applicants. No missing values
Data set 3: This is an unbalanced data set where bad
loans are underrepresented. It contains both numeric and
nominal variables. The names of the attributes available, and
there are no missing values.
Data set 4: This is an unbalanced data set: bad loans are
underrepresented by a ratio of about 1:4. It contains a large
number of missing values which have not been replaced. The
data set is available from the SAS company.
Data set 5: This is a very unbalanced data set: bad loans
are significantly underrepresented by a ratio of about 1:10. It
is obtained from data set 4 by removing all missing values.
5. Model and parameter settings,
experiments and results
5.1. Model and parameter settings
The computer simulation for this study was
performed in Weka 3.7, written in Java
(www.cs.waikato.ac.nz/ml/weka/).
6
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
In this study the LR uses a Quasi Newton Method
with a ridge estimator for parameter optimization [41].
The RBFNN implements a normalized Gaussian
radial basis function network. It uses the k-means
clustering algorithm to provide the basis functions and
learns a logistic regression on top of that. Symmetric
multivariate Gaussians are fit to the data from each
cluster. The minimum standard deviation for the
clusters varied between 0.2 and 1, and the number of
clusters varied from 2 to 10 for the 5 data sets.
The standard 2-layer feed-forward NN with backpropagation is used. Momentum was set to 0.2, and the
learning rate was initially set to 0.3. A decay
parameter, which causes the learning rate to decrease,
was enabled. This may help to stop the network from
diverging from the target output as well as improve
general performance. The number of neurons in the
hidden layer was computed as a=(number of attributes
+ number of classes)/2. For a 2-class target attribute
a=(number of attributes + 2)/2=number of
attributes/2+1, and depending on the data set was
varied from 15 to 23.
The SVM implements Platt's sequential minimal
optimization (SMO) algorithm for training a support
vector classifier [42], [43]. The complexity parameter
and the power of the polynomial kernels varied from 1
to 2.
The k-NN implements a k-nearest neighbor
classifier (k=10) according to the algorithm presented
by [44]. The Euclidean distance measure is used to
determine the similarity of the samples. The inverse
normalized distance weighting method and the brute
force linear search algorithm are used to search for the
nearest neighbors.
The DT generates a pruned C4.5 decision tree [35].
The confidence factor that determines the amount of
pruning is set to 0.2. The default factor is 0.25. Smaller
values assigned to the confidence factor incur more
pruning.
In all 5 data sets, missing values for numeric
attributes were replaced with the mean value of the
attribute, and for nominal attributes the missing values
are replaced with the mode value of the attribute for
the given class. Multi-valued nominal attributes are
transformed into binary attributes, replacing each
nominal attribute with m values by m-1 binary
attributes. No samples were allocated for the validation
data set. Depending on the method, the values of the
attributes were also normalized to the [-1,1] range or to
a zero mean and a unit variance.
5.2. Experiments and results
We investigate the classification accuracy of six
models on five data sets from different financial
contexts: LR, NN, RBFNN, SVM, k-NN, and DT. For
each of the six models and five data sets 10-fold crossvalidation is applied, and each experiment is repeated
10 times to achieve reliable and unbiased error
estimates. The classification accuracy rates across 10
folds and 10 runs are averaged and a 2-tailed paired ttest at α=0.05 is used to verify if the classification
accuracy rates at a 0.5 probability cut-off across the
models and data sets are significantly different from
LR model which is used as the baseline. This
methodology is faithful to the recommendations of
[32]. A ROC chart is also employed to examine the
performance of the models at the probability cut-offs ≠
0.5 which are more likely to be used by financial
institutions. For example, a 0.3 cut-off means that Type
II error (classifying a bad loan as a good loan) is 3.3
times more costly than the Type I error (classifying a
good loan as a bad loan). This cutoff may be applicable
to situations in which banks do not secure smaller
loans, i.e., do not hold any collateral, whereas the 0.7
cutoff implies that the cost of making a Type I error is
smaller than the cost of Type II error. This cut-off may
typically be used when a financial institution secures
larger loans by holding collateral such as customer’s
home.
The results obtained offer a fruitful ground for
investigations which could go in several directions.
There are several dimensions to consider. For example,
(1) the methods, (2) the data sets, (3) the classification
performance for overall, bad loans, and good loans at
0.5 probability cut-off, and (4) the area under the ROC
curve and the ROC curves themselves which allow one
to examine the models’ performance at cut-offs
different from 0.5. For example, one may want to
compare the performance of the six methods on five
versatile data sets in an attempt to find the best two or
the best method which work the best across all data
sets. One may also look at the five data sets to find out
one or two data sets which contain the best selection of
features for reliable loan granting decisions. If
detecting bad loans is of paramount interest, one could
concentrate on finding the best model which does
exactly that, etc. Due to space constraints, we leave
most of these considerations to the reader and give
only a brief interpretation of the results.
The classification accuracy rates are reported in
Tables 3 through 5, and the area under the ROC curve
which testifies to a general detection power of the
models is presented in Table 6. Generally, one can see
that the overall performance of the five models (out of
six) is the highest and most stable on data set 1. This
data set seems to contain the right balance of bad and
good loans, with bad loans slightly overrepresented
(Table 2). The results presented in Table 6 confirm
these observations. Looking at each Table 3 through 6,
however, enables one to draw more subtle conclusions.
7
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
Table 3 presents the overall percentage
classification accuracy rates. Only RBFNNs appear to
perform significantly worse than LR and other methods
for data set 1. For data set 5, NNs, RBFNNs, SVM,
and k-NN methods classify worse than LR. Similar
patterns could be observed for the 3 remaining data
sets. However, DTs seem to outperform LR and the
remaining methods on data sets 4 and 5 of which both
are highly unbalanced (Table 2). The best overall
classification accuracy rate across all six models is for
data set 5 in which the bad loan class is
underrepresented by a ratio of 1:10. It mainly occurred
due to the fact that good loans have been classified
almost perfectly well.
Table 3. Overall correct classification accuracy
rates [%] for the 6 models at a 0.5 probability cutoff
LR
NN
RBF
NN
SVM
kNN
DT
Data
Set
1
85.3 85.5
79.5w
84.9
86.1
85.6
2
78.0 78.6
74.5w 75.6w 79.0
75.5w
3
75.6 75.2
73.5w
75.5
72.9w 71.6w
b
4
83.6 86.9
83.5
82.9
78.9w 88.6b
w
w
w
5
92.5 92.1
91.1
91.2
91.5w 94.4b
w,b
Significantly worse/better than LR at the 0.05 significance
level.
Table 4. Bad loans correct classification accuracy
rates [%] for the 6 models at a 0.5 probability cutoff
Data
LR
NN
RBF SVM kNN
DT
Set
NN
1
86.4 84.2w 65.0w 92.1b
86.1
84.1w
2
45.5 35.5w 29.8w
48.7
40.3w 37.0w
3
49.0 50.3b 42.2w 47.8w 27.1w 41.0w
4
30.4 59.0b
35.5
18.5w 31.6
54.8b
5
22.7 14.2w
5.1w
1.4w
6.1w
47.3b
w,b
Significantly worse/better than LR at the 0.05 significance
level.
Table 5. Good loans correct classification accuracy
rates [%] for the 6 models at a 0.5 probability cutoff
Data
LR
NN
RBF
SVM
kNN
DT
Set
NN
1
84.5 86.6
91.2b
79.1w
86.1b 86.7b
b
w
2
90.6 95.2
91.8
85.9
93.9b
90.3
w
3
87.0 85.8
86.9
87.4
92.5b 84.8w
4
96.9 93.8w 94.3w
98.9b
90.0w 97.4b
b
b
b
5
99.4 99.7
99.5
100.0
99.9b 99.0w
w,b
Significantly worse/better than LR at the 0.05 significance
level.
Table 6. The area under the ROC curve for the 6
models
Data
LR
NN
RBF
SVM
kNN
DT
Set
NN
1
84.5 86.6
91.2b
79.1w
86.1b 86.7b
2
90.6 95.2b
91.8
85.9w
93.9b
90.3
w
3
87.0 85.8
86.9
87.4
92.5b 84.8w
4
96.9 93.8w 94.3w
98.9b
90.0w 97.4b
b
b
b
5
99.4 99.7
99.5
100.0
99.9b 99.0w
w,b
Significantly worse/better than LR at the 0.05 significance
level.
Table 4 shows that all models classify bad loans
consistently poorly on data sets 2 through 5. For these
four data sets, the best and the worst classification
accuracy rates are 59.0% (NNs) and 1.4% (SVMs).
The latter do not appear to tolerate well, the data sets
which are highly unbalanced. However, for data set 1,
in which bad loans are slightly overrepresented, SVMs
exhibit an extraordinary performance of 92.1%. This is
important, especially when detecting bad loans is the
target event. DTs and NNs seem to be most efficient
classifiers of bad loans for data sets 4 and 5 in which
bad loans are underrepresented by a ratio of 1:4 and
1:10, respectively.
For four out of five data sets, the kNN method, in
general, outperforms the remaining methods in
detecting good loans (Table 5). It is also evident that
for data sets 4, and 5, in which good loans substantially
overrepresented bad loans, the classification models’
classification performance for good loans is well above
90%.
The area under the ROC curve is an important
measure as it illustrates the overall performance of the
models at various operating points. For example, if a
target event is detecting a bad loan and misclassifying
the bad loan as a good loan is 3 times more costly than
misclassifying a good loan as a bad loan, a lending
institution may choose to use a 0.3 cut-off as the
decision threshold. This means that a customer whose
probability of default upon a loan is ≥0.3 will be
denied the loan. Table 6 shows that data set 1 appears
to contain the best attributes for distinguishing between
good and bad loans across all six models. The RBFNN,
SVM, and DT models significantly underperform the
remaining three models. It is also apparent that more
experiments are needed to find the best settings for the
parameters of RBFNN and SVM.
To avoid clutter on the ROC charts and make them
more transparent we chose to illustrate the performance
of the best three models only for data set 1 (Figure 1).
These models are: LR, NN, and kNN. The three curves
overlap to a large extent for most operating points
exhibiting very good classification ability. However,
kNN appears to outperform LR and NN at the
operating points within the range of [0.6, 0.7]. A
8
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
similar ROC chart could be developed for bad loans as
well.
[2] M. Stepanova, and L. Thomas, "Survival Analysis
Methods for Personal Loan Data", Operations Research, vol.
50, no. 2, pp. 277-289, Mar-Apr, 2002.
ROC Chart for "Loan Granted"
[3] S. Y. Sohn, and H. S. Kim, "Random Effects Logistic
Regression Model for Default Prediction of Technology
Credit Guarantee Fund", European Journal of Operational
Research, vol. 183, no. 1, pp. 472-478, Nov, 2007.
Sensitivity
1
0.8
[4] A. Laha, "Building Contextual Classifiers by Integrating
Fuzzy Rule Based Classification Technique and k-NN
Method for Credit Scoring", Advanced Engineering
Informatics, vol. 21, no. 3, pp. 281-291, Jul, 2007.
Approximate operating point: 0.6-0.7
0.6
LR
NN
0.4
[5] A. Khashman, "A Neural Network Model for Credit Risk
Evaluation", International Journal of Neural Systems, vol.
19, no. 4, pp. 285-294, Aug, 2009.
kNN
0.2
0
0
0.2
0.4
0.6
1- Specificity
0.8
1
Figure 1. The ROC charts for the LR, NN, and kNN
Model for Data Set 1 (Quinlan)
6. Conclusions and suggestions for future
research
This study assesses the classification accuracy rates
of the six models on five versatile real world data sets
obtained from different financial fields. The Quinlan
data set 1 appears to contain the best attributes for
building effective models to classify consumer loans
into the bad and good categories.
Increasing the complexity parameter C in the SVM
models improves very slightly but steadily the
classification accuracy rates for bad loans, but it
significantly extends the time to build the models. In
general, the SVM method does not do well when one
of the classes (bad loans in this case) is heavily
underrepresented in data set 5.
More experimentation is needed with settings of the
parameters for the RBFNN, SVM, and k-NN models,
which proved to be very efficient classifiers in many
other applications. It is also advisable to explore
feature reduction methods for possible enhancements
of the results as well as investigate more thoroughly
DTs, which can generate if-then rules, to better
interpret the models.
7. References
[1] D. West, "Neural Network Credit Scoring Models",
Computers & Operations Research, vol. 27, no. 11-12, pp.
1131-1152, Sep-Oct, 2000.
[6] W. M. Chen, C. Q. Ma, and L. Ma, "Mining the Customer
Credit Using Hybrid Support Vector Machine Technique",
Expert Systems with Applications, vol. 36, no. 4, pp. 76117616, May, 2009.
[7] C. F. Tsai, "Financial Decision Support Using Neural
Networks and Support Vector Machines", Expert Systems,
vol. 25, no. 4, pp. 380-393, Sep, 2008.
[8] L. G. Zhou, K. K. Lai, and L. A. Yu, "Credit Scoring
Using Support Vector Machines with Direct Search for
Parameters Selection", pp. 149-155, 2009.
[9] S. T. Li, W. Shiue, and M. H. Huang, "The Evaluation of
Consumer Loans Using Support Vector Machines", Expert
Systems with Applications, vol. 30, no. 4, pp. 772-782, May,
2006.
[10] S. T. Luo, B. W. Cheng, and C. H. Hsieh, "Prediction
Model Building with Clustering-launched Classification and
Support Vector Machines in Credit Scoring", Expert Systems
with Applications, vol. 36, no. 4, pp. 7562-7566, May, 2009.
[11] T. Bellotti, and J. Crook, "Support Vector Machines for
Credit Scoring and Discovery of Significant Features",
Expert Systems with Applications, vol. 36, pp. 3302–3308,
2009.
[12] A. Owen, "Data Squashing by Empirical Likelihood",
Data Mining and Knowledge Discovery, vol. 7, no. 1, pp.
101-113, Jan, 2003.
[13] J. Zurada, "Rule Induction Methods for Credit Scoring",
Review of Business Information Systems, vol. 11, no. 2, pp.
11-22, 2007.
[14] J. Zurada, "Could Decision Trees Improve the
Classification Accuracy and Interpretability of Loan Granting
Decisions?", Proceedings of the 43rd Hawaii International
Conference on System Sciences (HICSS), (R. Sprague, Ed.),
IEEE Computer Society Press, Hawaii, January 5-8, 2010.
[15] M. Chrzanowska, E. Alfaro, and D. Witkowska, "The
Individual Borrowers Recognition: Single and Ensemble
Trees", Expert Systems with Applications, vol. 36, no. 3, pp.
6409-6414, Apr, 2009.
[16] D. West, S. Dellana, and J. X. Qian, "Neural Network
Ensemble Strategies for Financial Decision Applications",
9
Proceedings of the 44th Hawaii International Conference on System Sciences - 2011
Computers & Operations Research, vol. 32, no. 10, pp.
2543-2559, Oct, 2005.
[17] P. G. Espejo, S. Ventura, and F. Herrera, "A Survey on
the Application of Genetic Programming to Classification",
IEEE Transactions on Systems, Man, and Cybernetics - Part
C: Applications and Reviews, to be published.
[18] C. S. Ong, J. J. Huang, and G. H. Tzeng, "Building
Credit Scoring Models Using Genetic Programming", Expert
Systems with Applications, vol. 29, no. 1, pp. 41-47, Jul,
2005.
[19] J. J. Huang, G. H. Tzeng, and C. S. Ong, "Two-stage
Genetic Programming (2SGP) for the Credit Scoring Model",
Applied Mathematics and Computation, vol. 174, no. 2, pp.
1039-1053, Mar, 2006.
[20] S. Finlay, "Are We Modeling the right Thing? The
Impact of Incorrect Problem Specification in Credit Scoring",
Expert Systems with Applications, vol. 36, no. 5, pp. 90659071, Jul, 2009.
[21] K. B. Schebesch, and R. Stecking, "Support Vector
Machines for Classifying and Describing Credit Applicants:
Detecting Typical and Critical Regions", Journal of the
Operational Research Society, vol. 56, no. 9, pp. 1082-1088,
Sep, 2005.
[22] K. A. Smith, Introduction to Neural Networks and Data
Mining for Business Applications, Australia: Eruditions
Publishing, 1999.
[23] B. Baesens, T. Van Gestel, M. Stepanova et al., "Neural
network Survival Analysis for Personal Loan Data", Journal
of the Operational Research Society, vol. 56, no. 9, pp. 10891098, Sep, 2005.
[24] J. Huysmans, B. Baesens, J. Vanthienen et al., "Failure
Prediction with Self Organizing Maps", Expert Systems with
Applications, vol. 30, no. 3, pp. 479-487, Apr, 2006.
[25] T. S. Lee, and I. F. Chen, "A Two-stage Hybrid Credit
Scoring Model Using Artificial Neural Networks and
Multivariate Adaptive Regression Splines", Expert Systems
with Applications, vol. 28, no. 4, pp. 743-752, May, 2005.
[26] G. Paleologo, A. Elisseeff, and G. Antonini, "Subagging
for Credit Scoring Models", European Journal of
Operational Research, vol. 201, no. 2, pp. 490-499, Mar,
2010.
[27] M. C. Chen, and S. H. Huang, "Credit Scoring and
Rejected Instances Reassigning Through Evolutionary
Computation Techniques", Expert Systems with Applications,
vol. 24, no. 4, pp. 433-441, May, 2003.
[28] K. N. Kunene, and H. R. Weistroffer, "An Approach for
Predicting and Describing Patient Outcome Using
Multicriteria Decision Analysis and Decision Rules",
European Journal of Operational Research, vol. 185, no. 3,
pp. 984-997, 2008.
[29] L. Yu, S. Y. Wang, and K. K. Lai, "An Intelligent-agentbased Fuzzy Group Decision Making Model for Financial
Multicritera Decision Support: The Case of Credit Scoring",
European Journal of Operational Research, vol. 195, pp.
942-959, Jun 16, 2009.
[30] A. Ben-David, and E. Frank, "Accuracy of Machine
Learning Models versus "Hand Crafted" Expert Systems - A
Credit Scoring Case Study", Expert Systems with
Applications, vol. 36, no. 3, pp. 5264-5271, Apr, 2009.
[31] B. Baesens, R. Setiono, C. Mues et al., "Using Neural
Network Rule Extraction and Decision Tables for Credit-risk
Evaluation", Management Science, vol. 49, no. 3, pp. 312329, 2003.
[32] I. H. Witten, and E. Frank, Data Mining: Practical
Learning Tools and Techniques, Morgan Kaufmann
Publishers, 2005.
[33] C. L. Chuang, and R. H. Lin, "Constructing a
Reassigning Credit Scoring Model", Expert Systems with
Applications, vol. 36, no. 2, pp. 1685-1694, Mar, 2009.
[34] T.-S. Lee, C.-C. Chiu, Y.-C. Chou et al., "Mining the
Customer Credit Using Classification and Regression Tree
and
Multivariate
Adaptive
Regression
Splines",
Computational Statistics & Data Analysis, vol. 50, no. 4, pp.
1113-1130, 2006.
[35] J. R. Quinlan, "Simplifying Decision Trees",
International Journal of Man-Machine Studies, vol. 27, pp.
221-234, 1987.
[36] J. Han, and M. Kamber, Data Mining: Concepts and
Techniques. Morgan Kaufmann Publishers, San Francisco,
CA, 2001.
[37] T. M. Mitchell, Machine Learning, WCB/McGraw-Hill,
Boston, Massachusetts, 1997.
[38] P-N Tan, M. Steinbach, and V. Kumar, Introduction to
Data Mining, Addison-Wesley, 2006.
[39] T. Poggio, and F. Girosi, "Network for Approximation
and Learning", Proceedings of IEEE, 78, 1481-1497, 1990.
[40] V. N. Vapnik, Statistical Learning Theory, New York:
Wiley, 1998.
[41] S. le Cessie, and J. C. van Houwelingen, "Ridge
Estimators in Logistic Regression", Applied Statistics, vol.
41, no. 1, pp. 191-201, 1992.
[42] J. Platt, "Machines Using Sequential Minimal
Optimization", In B. Schoelkopf and C. Burges and A.
Smola, editors, Advances in Kernel Methods - Support
Vector Learning, 1998.
[43] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and
K.R.K. Murthy, "Improvements to Platt's SMO Algorithm
for SVM Classifier Design", Neural Computation, vol. 13, no
3, 637-649, 2001.
[44] D. Aha, and D. Kibler, "Instance-based Learning
Algorithms",
Machine
Learning,
6:37-66,
1991
10