Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 Comparisons of the Performance of Computational Intelligence Methods for Loan Granting Decisions Jozef Zurada University of Louisville, College of Business [email protected] Abstract The importance to financial institutions of accurately evaluating the credit risk posed by their loan granting decisions cannot be underestimated; it is underscored by recent credit assessment failures that contributed greatly to the so-called “great recession” of the late 2000s. The paper compares the classification accuracy rates of several traditional and computational intelligence methods. We construct models and assess their classification accuracy rates on five very versatile real world data sets obtained from different loan granting decision areas. The results obtained from computer experiments provide a fruitful ground for interpretation. 1. Introduction The importance to financial institutions of accurately evaluating the credit risk posed by their loan granting decisions cannot be underestimated; it is underscored by recent credit assessment failures that contributed greatly to the so-called “great recession” of the late 2000s. Lending institutions have been using credit scoring models for a few decades [1]. The objective of credit scoring models is to assess the likelihood that a customer will default on a credit extension, allowing the credit granting institution to determine whether or not to extend credit to a customer. Typically customers are grouped into “good credit” customers – those likely to repay credit extended to them-- and “bad credit” customers-- those likely to default on credit repayments. Getting these decisions right has broad financial implications for credit granting institutions as well as financial markets as a whole: failure to identify good credit risks leads to lost income, while extending credit to bad credit risks is a threat to profitability. Studies have shown that even a 1% improvement in the accuracy of a credit scoring model can save millions of dollars in a large loan portfolio [1]. Thus, the need for accurate credit K. Niki Kunene University of Louisville, College of Business [email protected] scoring models is essential in economies that rely on credit availability for daily economic activity. The research on credit scoring models continues to grow and explore a variety of methods including survival analysis [2], linear discriminant analysis (LDA), logistic regression (LR) [3], k-nearest neighbor (kNN) [4], classification trees (CT), neural networks (NN) [1], [5], radial basis function neural networks (RBFNN), support vector machines (SVM) [6], [7], [8], [9], [10], [11], decision trees (DT) [12], [13], [14], ensemble techniques [15], [16], genetic programming [17], [18], [19], [20]. Survival analysis has been used to predict the time to default or time to early repayment [2], other methods have focused on predicting the probability of defaulting or the probability of not defaulting on a loan. Although the application of parametric statistical methods such as LDA —one of the first credit scoring models— to the categorical, non-normally distributed data found in credit data has been criticized [1], more recently LDA has been applied in conjunction with other models, for example with SVM providing the input data [21]. The application of NN models has been more widespread [5], [22]. NNs make no assumptions about the distribution of the data; some researchers have also investigated the accuracy of hybrid models combining NN with other models such as decision tables [23], self organizing maps[24], k-nearest neighbor clustering algorithm, and multivariate adaptive regression splines (MARS) [25]. The use of SVM in credit scoring models is more recent [11]. Bellotti and Crook [11] compare SVM to LDA, LR, and kNN using a large data set and find that a large number support vectors is required to achieve the best performance. Chrzanowska, Alfaro, & Witkowska [15] and [26] use classification trees and ensemble classification models on credit scoring models with success. Recently, some researchers have applied genetic algorithm (GA) or genetic programming models to credit scoring models [17], [18], [19], [20], [27]. Finlay [20] found that 1530-1605/11 $26.00 © 2011 IEEE 1 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 genetic algorithms perform comparably with LR and linear regression models. While the integration of multi-criteria decision making with machine learning models has been used in other applications [28], however it is new to credit scoring models: Yu, Wang & Lai [29], in a technique similar to ensemble models, employ intelligent agents as decision experts that generate varying credit scoring judgments that are subsequently fuzzified and aggregated into a group consensus decision. Ben-David and Frank [30] compared a number of the above models to expert systems (ES). In many of the aforementioned and similar studies researchers compare a model of interest with multiple models usually with respect to accuracy [31]. However, these comparisons are regularly made using a single data set. Comparisons based on a single data set may be susceptible to the idiosyncrasies of the data set, its context, and the computational method. In this study we investigate the classification accuracy of six models on five data sets from different financial contexts: logistic regression (LR), neural network (NN), radial basis function neural network (RBFNN), support vector machine (SVM), k-nearest neighbor (kNN), and decision tree (DT). For each of the six models and five data sets 10-fold cross-validation is applied, and each experiment is repeated ten times to achieve reliable and unbiased error estimates. The classification accuracy rates across 10 folds and 10 runs are averaged and a 2-tailed paired t-test at α=0.05 is used to verify if the classification accuracy rates at a 0.5 probability cut-off across the models and data sets are significantly different from LR model which is used as the baseline. This methodology is faithful to the recommendations of [32]. ROC charts are also employed to examine the performance of the models at the probability cut-offs ≠ 0.5 which are more likely to be used by financial institutions. As indicated above and Table 1, the models proposed in this study have already been applied successfully to credit scoring models, the contribution of our study is found in the evaluation of these models on five variable real-world data sets where each data set has different characteristics with respect to the type and number of variables, the distribution of “bad credit” and “good credit” samples in the data (Table 2), the extent of missing values, and the number of samples. The obtained results offer fruitful ground for interpretation (Tables 3-6). One can look at the efficiency of the models or the predictive power of the attributes contained in each of the five data sets. One can examine the ROC curves to determine the efficiency of the models at various operating points as well (Figure 1). The paper is organized as follows. Section 2 briefly summarizes several previous studies regarding credit scoring and loan granting decisions. Section 3 discusses six methods used. Section 4 presents the basic characteristics of the five data sets used, whereas section 5 describes computer simulation and the results. Finally, section 6 concludes the paper and outlines possible future research in this area. 2. Prior research Machine learning techniques such as LR, NNs, RBFNNs, DTs, fuzzy systems (FS), neuro-fuzzy systems (NFS), GAs, and many other techniques have been applied to credit scoring problems in a number of studies. Most studies report classification accuracy rates obtained on different competing models and computer simulation scenarios, others concentrate on addressing the typical problems surrounding the nature of credit data, e.g. that the data s frequently highly unbalanced, and that part of the information may be missing. A few studies focus on the models’ interpretability issues and feature reduction methods. Table 1 is a summary of a representative sample of some of the more recent research in the field. Table 1. Summary of prior studies Study [1] [23] Methods Used Five NN architecture s and LDA, LR, kNN, kernel density estimation, and DTs. Data Sets Properties German credit data University of Hamburg, and Australian credit data LR, NN A data set from UK financial institution Results 10-fold crossvalidation used. Among neural architectures the mixture-of-experts and RBFNN did best, whereas among traditional methods LR analysis was most accurate. Five NNs: MLP, mixture-of-experts, RBFNN, learning vector quantization, and fuzzy adaptive resonance. NN approach did not significantly outperform estimated proportional hazards models. 2 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 Study [5] [30] [27] Methods Used NN Data Sets Properties The Australian credit data set ES, and NN, LR, Bayes, DT, kNN, SVM, CT, RBFNN A real world data set from an Israeli financial institution NN, and GA [6] Hybrid SVM using CART, MARS and grid search [9] SVM, NN [15] CTs: boosting and bagging Dataset from UCI Repositor y A credit card dataset from a bank in China A real world data set from Taiwan A real world dataset from a commerci al bank in Poland Results Study Methods Used A proposed reassigning credit scoring model (RCSM) is compared to LDA, LR, NN Data Sets Properties Credit card dataset obtained from UCI repository A training to validation ratio of (300:390) or 43.5%56.5% is the best training scheme on the data, and singlehidden layer NN outperforms doublehidden layer NN When problem is treated as regression some machine learning models outperform the expert system's accuracy, but most models do not. When the same problem is treated as a classification no machine learning model outperforms ES’s accuracy. Using GA-based inverse classification allows creditors to suggest conditional acceptance and further explain the conditions used to reject applicants. A hybrid SVM has best classification rate, and lowest Type II error in comparison with CART, MARS SVM surpasses traditional NN models in generalization performance and visualization Best performer is an ensemble classifier using boosting, in terms of accuracy and recognition of non-creditworthy borrowers [33] [34] CART, MARS, LDA, LR, SVM [25] A hybrid NN, MARS model compared to LDA, LR, NN, MARS [26] Subagged versions of: kernel SVM, kNN, DTs and Adaboost A real world bank credit card data set from China. A real world housing loan dataset from bank in China A real world data set of IBM Italian customers [11] SVM, LR, LDA and knearest neighbors (kNN). A very large real world data set of 25000 records from a financial institution [13], [14] LR, NN, DT, MBR, and Ensemble Model. Data sets 3 and 4 from Table 2. Results With RSCM, NN rejected good credit records are reassigned to the preferable accepted class using a CBRbased classification technique. RCSM outperforms LDA, LR, NN in terms of accuracy, Type I and Type II errors CART and MARS outperform traditional DA, LR, NN, and (SVM) in terms of accuracy Hybrid NN outperforms results from LDA, LR, NN, and MARS Subagging, an ensemble classification technique for unbalanced data sets, improves performance of base classifier, and subagging decision trees obtain bestperforming classifier SVMs comparatively successful classifying credit card customers who default. And unlike many other models, a large number of support vectors are required to achieve the best performance. Both are comparative studies. DTs did best classification-wise. DTs are attractive tools as they can generate easy to interpret if-then rules. 3 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 Study [29] Methods Used Individual & ensemble methods for: MLR, LR, NN, SVM RBFNN. Ensemble models’ decisions based on fuzzy voting, and averaging. Data Sets Properties Three data sets, including modified data set 1 (without missing values), and data set 3 from Table 2. Results then the entropy of S relative to this k-wise classification is defined as Fuzzy group decision making (GDM) model outperformed other models on all 3 data sets. k Entropy ( S ) = − ∑ p log p i 2 i i =1 3.1. Decision trees The operation of DTs are based on the ID3 or C4.5 divide-and-conquer algorithms [35] and search heuristics which make the clusters at the node gradually purer by progressively reducing disorder in the original data set. The algorithms place the attribute that has the most predictive power at the top node of the tree and they have to find the optimum number of splits and determine where to partition the data to maximize the information gain. The fewer the splits, the more explainable the output is as there are less rules to understand. Selecting the best split is based on the degree of impurity of the child nodes. For example, a node which contains only cases of class good_loan or class bad_loan has the smallest disorder = 0. Similarly, a node that contains an equal number of cases of class good_loan and class bad_loan has the highest disorder = 1. Disorder is measured by the wellestablished concept of entropy and information gain which we formally introduce below. Given a collection S, containing the positive (good_loan) and negative examples (bad_loan) of some target concept, the entropy of S relative to this Boolean classification is p bad _ loan log 2 p log bad _ loan Sv Entropy ( S v ) ∑ v ∈ Values ( A) S (3) where Values(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has the value v (i.e., S v = { s ∈ S | A( s) = v } . For more details on DTs, refer to [36], [37], and [38]. This section briefly describes the six methods, i.e., DTs, RBFNNs, SVMs, kNN, LR, and NNs used in this study. good _ loan For example, if disorder is measured by entropy, the information gain, Gain(S,A) of an attribute A, relative to a collection of examples S, can be computed as Gain ( S , A) ≡ Entropy ( S ) − 3. Description of the methods used in this study Entropy ( S ) ≡ − p ( 2) 2 p good _ loan − (1) where pgood_loan is the proportion of positive examples in S and pbad_loan is the proportion of negative examples in S. If the output variable takes on k different values, P( y) = 1 , where 1 + e− z m z = b0 + ∑ bi x i . i =1 (4) 3.2. Radial Basis Function Neural Network An RFBNN differs from a feed-forward NN with back-propagation in the way the hidden neurons perform computations [39]. Each neuron represents a point in input space, and its output for a given training pattern depends on the distance between its point and the pattern. The closer these two points are, the stronger the activation. The RFBNN uses nonlinear bell-shaped Gaussian activation functions whose width may be different for each neuron. The output layer forms a linear combination from the outputs of neurons in the hidden layer which are fed to the sigmoid function. A network learns two sets of parameters: the centers and width of the Gaussian functions by employing clustering and the weights used to form the linear combination of the outputs obtained from the hidden layer. As the first set of parameters can be obtained independently of the second set, RFBNN learns almost instantly if the number of hidden units is much smaller than the number of training patterns. Unlike a feed-forward NN with back-propagation, the RBFNN, however, cannot learn to ignore irrelevant attributes because it gives them the same weight in distance computations. 3.3. Support vector machines Support vector machine (SVM), originally developed by Vapnik, is a system that represents a blend of linear modeling and instance-based learning to implement nonlinear class boundaries [40]. This system chooses several critical boundary patterns called support vectors for each class (bad loan and good loan of the output variable) and create a linear 4 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 discriminant function that separates them as widely as possible by applying a linear, quadratic, cubic or higher-order polynomial term decision boundaries. A hyperplane that gives the greatest separation between the classes is called the maximum margin hyperplane in the form of x = b + ∑ α i y i ( a (i ) ⋅ a ) n (5) where i is support vector, y i is the class value of training pattern a(i) , while b and α i are parameters determined by the learning algorithm. The vectors a and a(i) represent a test pattern and support vectors, respectively, while an expression (a(i) ⋅ a)n , which computes the dot product of the test pattern with one of the support vectors and raises the result to the power n , is called a polynomial kernel. Other kernel functions could also be used to implement a different nonlinear mapping. Constrained quadratic optimization is applied to find support vectors for the pattern sets as well as parameters b and α i . Compared with DTs, for example, SVMs are slow but often yield accurate classifiers because they create subtle and complex decision boundaries. 3.4. K-nearest neighbor Broadly construed, kNN is the method of solving new problems based on the solutions of similar past cases [37]. The method requires no model to be fitted, or function to be estimated. Instead it requires all cases with their known solutions to be maintained in memory, and when a prediction is required, the method recalls items from memory and predicts the value of the target. In solving a new case, the kNN approach retrieves a case it deems sufficiently similar and uses that case as a basis for solving the new case. The method uses a k-nearest neighbor algorithm to classify cases. The k-nearest neighbor algorithm takes a data set of existing cases and a new case to be classified, where each existing case in the data set is composed of a set of variables and the new case has one value for each variable. The algorithm computes the normalized Euclidean or Manhattan distance for numeric attributes or Hamming distance for nominal or ordinal attributes between each existing case and the new case (to be classified). The k existing cases that have the smallest distances to the new case are the k-nearest neighbors to that case. Based on the target values of the k-nearest neighbors, each of the k-nearest neighbors votes on the target value for a new case. The votes are the posterior probabilities for the class dependent variable. There are two challenging tasks in successful application of kNN, i.e., choosing the right value for k and the proper distance measure. 3.5. Logistic regression The purpose of the logistic regression model is to obtain a regression equation that could predict in which of two or more groups an object could be placed (i.e. whether a loan should be classified as a good loan or a bad loan). The logistic regression also attempts to predict the probability that a binary or ordinal target will acquire the event of interest (e.g. loan payoff or loan default) as a function of one or more independent variables (i.e. amount of loan, borrower job category, reason of loan). The logit model is represented by the logistic response function P(y) of the form: The function P(y) describes a dependent variable y containing two or more qualitative outcomes. z is the function of m independent variables x called predictors, and b represents the parameters. The x variables can be categorical or continuous variables of any distribution. The value of P(y) that varies from 0 to 1 denotes the probability that a dependent variable y belongs to one of two or more groups. The principal of maximum likelihood can commonly be used to compute estimates of the b parameters. This means that the calculations involve an iterative process of improving approximations for the estimates until no further changes can be made. Unlike neural networks, logistic regression models are designed to predict one dependent variable at a time. On the positive side, one can note that logistic regression output provides statistics on each variable included in the model. Researchers then can analyze these statistics to test the usefulness of specific information. 3.6. Neural networks Neural networks are mathematical models that are inspired by the architecture of the human brain. They are nonlinear systems built of highly interconnected neurons that process information. The most attractive features of these networks are their ability to adapt, generalize, and learn from training patterns. Neural network models are characterized by their three properties: the computational property, the architecture of the network, and the learning property. A typical neuron contains a summation node and an activation function. A neuron accepts vectors on input called training patterns. Neurons are organized in layers and are connected by weights represented by small numerical values. The most common type of the neural network architecture is a two-layer feed-forward neural network with error back-propagation, which is typically used for prediction and classification tasks. 5 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 Most commonly, the network has two layers: a hidden layer and an output layer. The neurons at the hidden layer receive the values of input vectors and propagate them concurrently to the output layer. Neural networks’ learning is a process in which a set of input vectors is presented sequentially and repeatedly to the input of the network in order to adjust its weights in such a way that similar inputs give the same output. In supervised learning the training set consists of the training patterns/examples that appear on input to the neural network and the corresponding desired responses provided by a teacher. The differences between the desired response and the network’s actual response for each single training pattern modify the weights of the network in all layers. The training continues until the mean sum of squares of the network errors, for the entire training set containing training vectors is reduced to a sufficiently small value close to zero. 4. Data sets used in the study Almost all data sets used in loan decision context, including five data sets used in this study, contain information about loan applicants that financial institutions considered to be creditworthy individuals as all of them obtained a loan. There are also other applicants who did not qualify for a loan at the time of their application and they are not included in the data sets used for modeling. Simply, we do not know if these applicants would have paid a loan off or defaulted upon a loan, if the loan had been granted. Though this situation does not affect the validity of the analysis, we should keep them in mind. The five data sets used in this study are drawn from different financial contexts and they describe financial, family, social, and personal characteristics of loan applicants. In two of the five data sets the names of the attributes have not been revealed because of the confidentiality issues. Two of the data sets are publicly available at the UCI Machine Learning Repository at http://www.ics.uci.edu/~mlearn/databases/. Table 2 presents the general features of each of the five data sets. Table 2. The general characteristics of the five data sets used in computer simulation Data set Characteristics # of cases # of variables Class values target variable: B: bad loans G: good loans Comments Data set 1 Characteristics 690 16 2 252 14 3 1000 21 Comments B: 383 G: 307 The Quinlan data set used in the number of studies. B: 71 G: 181 B: 300 G: 700 A data set from a German financial institution used in a number of studies. 4 5960 13 B: 1189 The attributes G: 4771 describe financial, family, social, and personal characteristics of loan applicants. 5 3364 13 B: 300 The attributes G: 3064 describe financial, family, social, and personal characteristics of loan applicants. Data set 1: The Quinlan data set describes financial attributes of Japanese customers. It is available at the UCI Machine Learning Repository. The names of the attributes are not revealed; it is well balanced and bad loans are slightly overrepresented. It contains numeric and nominal variables. There are some missing values. Data set 2: The names of the attributes are not available. Unbalanced data set: bad loans are underrepresented. No missing values. Includes only financial data of loan applicants. No missing values Data set 3: This is an unbalanced data set where bad loans are underrepresented. It contains both numeric and nominal variables. The names of the attributes available, and there are no missing values. Data set 4: This is an unbalanced data set: bad loans are underrepresented by a ratio of about 1:4. It contains a large number of missing values which have not been replaced. The data set is available from the SAS company. Data set 5: This is a very unbalanced data set: bad loans are significantly underrepresented by a ratio of about 1:10. It is obtained from data set 4 by removing all missing values. 5. Model and parameter settings, experiments and results 5.1. Model and parameter settings The computer simulation for this study was performed in Weka 3.7, written in Java (www.cs.waikato.ac.nz/ml/weka/). 6 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 In this study the LR uses a Quasi Newton Method with a ridge estimator for parameter optimization [41]. The RBFNN implements a normalized Gaussian radial basis function network. It uses the k-means clustering algorithm to provide the basis functions and learns a logistic regression on top of that. Symmetric multivariate Gaussians are fit to the data from each cluster. The minimum standard deviation for the clusters varied between 0.2 and 1, and the number of clusters varied from 2 to 10 for the 5 data sets. The standard 2-layer feed-forward NN with backpropagation is used. Momentum was set to 0.2, and the learning rate was initially set to 0.3. A decay parameter, which causes the learning rate to decrease, was enabled. This may help to stop the network from diverging from the target output as well as improve general performance. The number of neurons in the hidden layer was computed as a=(number of attributes + number of classes)/2. For a 2-class target attribute a=(number of attributes + 2)/2=number of attributes/2+1, and depending on the data set was varied from 15 to 23. The SVM implements Platt's sequential minimal optimization (SMO) algorithm for training a support vector classifier [42], [43]. The complexity parameter and the power of the polynomial kernels varied from 1 to 2. The k-NN implements a k-nearest neighbor classifier (k=10) according to the algorithm presented by [44]. The Euclidean distance measure is used to determine the similarity of the samples. The inverse normalized distance weighting method and the brute force linear search algorithm are used to search for the nearest neighbors. The DT generates a pruned C4.5 decision tree [35]. The confidence factor that determines the amount of pruning is set to 0.2. The default factor is 0.25. Smaller values assigned to the confidence factor incur more pruning. In all 5 data sets, missing values for numeric attributes were replaced with the mean value of the attribute, and for nominal attributes the missing values are replaced with the mode value of the attribute for the given class. Multi-valued nominal attributes are transformed into binary attributes, replacing each nominal attribute with m values by m-1 binary attributes. No samples were allocated for the validation data set. Depending on the method, the values of the attributes were also normalized to the [-1,1] range or to a zero mean and a unit variance. 5.2. Experiments and results We investigate the classification accuracy of six models on five data sets from different financial contexts: LR, NN, RBFNN, SVM, k-NN, and DT. For each of the six models and five data sets 10-fold crossvalidation is applied, and each experiment is repeated 10 times to achieve reliable and unbiased error estimates. The classification accuracy rates across 10 folds and 10 runs are averaged and a 2-tailed paired ttest at α=0.05 is used to verify if the classification accuracy rates at a 0.5 probability cut-off across the models and data sets are significantly different from LR model which is used as the baseline. This methodology is faithful to the recommendations of [32]. A ROC chart is also employed to examine the performance of the models at the probability cut-offs ≠ 0.5 which are more likely to be used by financial institutions. For example, a 0.3 cut-off means that Type II error (classifying a bad loan as a good loan) is 3.3 times more costly than the Type I error (classifying a good loan as a bad loan). This cutoff may be applicable to situations in which banks do not secure smaller loans, i.e., do not hold any collateral, whereas the 0.7 cutoff implies that the cost of making a Type I error is smaller than the cost of Type II error. This cut-off may typically be used when a financial institution secures larger loans by holding collateral such as customer’s home. The results obtained offer a fruitful ground for investigations which could go in several directions. There are several dimensions to consider. For example, (1) the methods, (2) the data sets, (3) the classification performance for overall, bad loans, and good loans at 0.5 probability cut-off, and (4) the area under the ROC curve and the ROC curves themselves which allow one to examine the models’ performance at cut-offs different from 0.5. For example, one may want to compare the performance of the six methods on five versatile data sets in an attempt to find the best two or the best method which work the best across all data sets. One may also look at the five data sets to find out one or two data sets which contain the best selection of features for reliable loan granting decisions. If detecting bad loans is of paramount interest, one could concentrate on finding the best model which does exactly that, etc. Due to space constraints, we leave most of these considerations to the reader and give only a brief interpretation of the results. The classification accuracy rates are reported in Tables 3 through 5, and the area under the ROC curve which testifies to a general detection power of the models is presented in Table 6. Generally, one can see that the overall performance of the five models (out of six) is the highest and most stable on data set 1. This data set seems to contain the right balance of bad and good loans, with bad loans slightly overrepresented (Table 2). The results presented in Table 6 confirm these observations. Looking at each Table 3 through 6, however, enables one to draw more subtle conclusions. 7 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 Table 3 presents the overall percentage classification accuracy rates. Only RBFNNs appear to perform significantly worse than LR and other methods for data set 1. For data set 5, NNs, RBFNNs, SVM, and k-NN methods classify worse than LR. Similar patterns could be observed for the 3 remaining data sets. However, DTs seem to outperform LR and the remaining methods on data sets 4 and 5 of which both are highly unbalanced (Table 2). The best overall classification accuracy rate across all six models is for data set 5 in which the bad loan class is underrepresented by a ratio of 1:10. It mainly occurred due to the fact that good loans have been classified almost perfectly well. Table 3. Overall correct classification accuracy rates [%] for the 6 models at a 0.5 probability cutoff LR NN RBF NN SVM kNN DT Data Set 1 85.3 85.5 79.5w 84.9 86.1 85.6 2 78.0 78.6 74.5w 75.6w 79.0 75.5w 3 75.6 75.2 73.5w 75.5 72.9w 71.6w b 4 83.6 86.9 83.5 82.9 78.9w 88.6b w w w 5 92.5 92.1 91.1 91.2 91.5w 94.4b w,b Significantly worse/better than LR at the 0.05 significance level. Table 4. Bad loans correct classification accuracy rates [%] for the 6 models at a 0.5 probability cutoff Data LR NN RBF SVM kNN DT Set NN 1 86.4 84.2w 65.0w 92.1b 86.1 84.1w 2 45.5 35.5w 29.8w 48.7 40.3w 37.0w 3 49.0 50.3b 42.2w 47.8w 27.1w 41.0w 4 30.4 59.0b 35.5 18.5w 31.6 54.8b 5 22.7 14.2w 5.1w 1.4w 6.1w 47.3b w,b Significantly worse/better than LR at the 0.05 significance level. Table 5. Good loans correct classification accuracy rates [%] for the 6 models at a 0.5 probability cutoff Data LR NN RBF SVM kNN DT Set NN 1 84.5 86.6 91.2b 79.1w 86.1b 86.7b b w 2 90.6 95.2 91.8 85.9 93.9b 90.3 w 3 87.0 85.8 86.9 87.4 92.5b 84.8w 4 96.9 93.8w 94.3w 98.9b 90.0w 97.4b b b b 5 99.4 99.7 99.5 100.0 99.9b 99.0w w,b Significantly worse/better than LR at the 0.05 significance level. Table 6. The area under the ROC curve for the 6 models Data LR NN RBF SVM kNN DT Set NN 1 84.5 86.6 91.2b 79.1w 86.1b 86.7b 2 90.6 95.2b 91.8 85.9w 93.9b 90.3 w 3 87.0 85.8 86.9 87.4 92.5b 84.8w 4 96.9 93.8w 94.3w 98.9b 90.0w 97.4b b b b 5 99.4 99.7 99.5 100.0 99.9b 99.0w w,b Significantly worse/better than LR at the 0.05 significance level. Table 4 shows that all models classify bad loans consistently poorly on data sets 2 through 5. For these four data sets, the best and the worst classification accuracy rates are 59.0% (NNs) and 1.4% (SVMs). The latter do not appear to tolerate well, the data sets which are highly unbalanced. However, for data set 1, in which bad loans are slightly overrepresented, SVMs exhibit an extraordinary performance of 92.1%. This is important, especially when detecting bad loans is the target event. DTs and NNs seem to be most efficient classifiers of bad loans for data sets 4 and 5 in which bad loans are underrepresented by a ratio of 1:4 and 1:10, respectively. For four out of five data sets, the kNN method, in general, outperforms the remaining methods in detecting good loans (Table 5). It is also evident that for data sets 4, and 5, in which good loans substantially overrepresented bad loans, the classification models’ classification performance for good loans is well above 90%. The area under the ROC curve is an important measure as it illustrates the overall performance of the models at various operating points. For example, if a target event is detecting a bad loan and misclassifying the bad loan as a good loan is 3 times more costly than misclassifying a good loan as a bad loan, a lending institution may choose to use a 0.3 cut-off as the decision threshold. This means that a customer whose probability of default upon a loan is ≥0.3 will be denied the loan. Table 6 shows that data set 1 appears to contain the best attributes for distinguishing between good and bad loans across all six models. The RBFNN, SVM, and DT models significantly underperform the remaining three models. It is also apparent that more experiments are needed to find the best settings for the parameters of RBFNN and SVM. To avoid clutter on the ROC charts and make them more transparent we chose to illustrate the performance of the best three models only for data set 1 (Figure 1). These models are: LR, NN, and kNN. The three curves overlap to a large extent for most operating points exhibiting very good classification ability. However, kNN appears to outperform LR and NN at the operating points within the range of [0.6, 0.7]. A 8 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 similar ROC chart could be developed for bad loans as well. [2] M. Stepanova, and L. Thomas, "Survival Analysis Methods for Personal Loan Data", Operations Research, vol. 50, no. 2, pp. 277-289, Mar-Apr, 2002. ROC Chart for "Loan Granted" [3] S. Y. Sohn, and H. S. Kim, "Random Effects Logistic Regression Model for Default Prediction of Technology Credit Guarantee Fund", European Journal of Operational Research, vol. 183, no. 1, pp. 472-478, Nov, 2007. Sensitivity 1 0.8 [4] A. Laha, "Building Contextual Classifiers by Integrating Fuzzy Rule Based Classification Technique and k-NN Method for Credit Scoring", Advanced Engineering Informatics, vol. 21, no. 3, pp. 281-291, Jul, 2007. Approximate operating point: 0.6-0.7 0.6 LR NN 0.4 [5] A. Khashman, "A Neural Network Model for Credit Risk Evaluation", International Journal of Neural Systems, vol. 19, no. 4, pp. 285-294, Aug, 2009. kNN 0.2 0 0 0.2 0.4 0.6 1- Specificity 0.8 1 Figure 1. The ROC charts for the LR, NN, and kNN Model for Data Set 1 (Quinlan) 6. Conclusions and suggestions for future research This study assesses the classification accuracy rates of the six models on five versatile real world data sets obtained from different financial fields. The Quinlan data set 1 appears to contain the best attributes for building effective models to classify consumer loans into the bad and good categories. Increasing the complexity parameter C in the SVM models improves very slightly but steadily the classification accuracy rates for bad loans, but it significantly extends the time to build the models. In general, the SVM method does not do well when one of the classes (bad loans in this case) is heavily underrepresented in data set 5. More experimentation is needed with settings of the parameters for the RBFNN, SVM, and k-NN models, which proved to be very efficient classifiers in many other applications. It is also advisable to explore feature reduction methods for possible enhancements of the results as well as investigate more thoroughly DTs, which can generate if-then rules, to better interpret the models. 7. References [1] D. West, "Neural Network Credit Scoring Models", Computers & Operations Research, vol. 27, no. 11-12, pp. 1131-1152, Sep-Oct, 2000. [6] W. M. Chen, C. Q. Ma, and L. Ma, "Mining the Customer Credit Using Hybrid Support Vector Machine Technique", Expert Systems with Applications, vol. 36, no. 4, pp. 76117616, May, 2009. [7] C. F. Tsai, "Financial Decision Support Using Neural Networks and Support Vector Machines", Expert Systems, vol. 25, no. 4, pp. 380-393, Sep, 2008. [8] L. G. Zhou, K. K. Lai, and L. A. Yu, "Credit Scoring Using Support Vector Machines with Direct Search for Parameters Selection", pp. 149-155, 2009. [9] S. T. Li, W. Shiue, and M. H. Huang, "The Evaluation of Consumer Loans Using Support Vector Machines", Expert Systems with Applications, vol. 30, no. 4, pp. 772-782, May, 2006. [10] S. T. Luo, B. W. Cheng, and C. H. Hsieh, "Prediction Model Building with Clustering-launched Classification and Support Vector Machines in Credit Scoring", Expert Systems with Applications, vol. 36, no. 4, pp. 7562-7566, May, 2009. [11] T. Bellotti, and J. Crook, "Support Vector Machines for Credit Scoring and Discovery of Significant Features", Expert Systems with Applications, vol. 36, pp. 3302–3308, 2009. [12] A. Owen, "Data Squashing by Empirical Likelihood", Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 101-113, Jan, 2003. [13] J. Zurada, "Rule Induction Methods for Credit Scoring", Review of Business Information Systems, vol. 11, no. 2, pp. 11-22, 2007. [14] J. Zurada, "Could Decision Trees Improve the Classification Accuracy and Interpretability of Loan Granting Decisions?", Proceedings of the 43rd Hawaii International Conference on System Sciences (HICSS), (R. Sprague, Ed.), IEEE Computer Society Press, Hawaii, January 5-8, 2010. [15] M. Chrzanowska, E. Alfaro, and D. Witkowska, "The Individual Borrowers Recognition: Single and Ensemble Trees", Expert Systems with Applications, vol. 36, no. 3, pp. 6409-6414, Apr, 2009. [16] D. West, S. Dellana, and J. X. Qian, "Neural Network Ensemble Strategies for Financial Decision Applications", 9 Proceedings of the 44th Hawaii International Conference on System Sciences - 2011 Computers & Operations Research, vol. 32, no. 10, pp. 2543-2559, Oct, 2005. [17] P. G. Espejo, S. Ventura, and F. Herrera, "A Survey on the Application of Genetic Programming to Classification", IEEE Transactions on Systems, Man, and Cybernetics - Part C: Applications and Reviews, to be published. [18] C. S. Ong, J. J. Huang, and G. H. Tzeng, "Building Credit Scoring Models Using Genetic Programming", Expert Systems with Applications, vol. 29, no. 1, pp. 41-47, Jul, 2005. [19] J. J. Huang, G. H. Tzeng, and C. S. Ong, "Two-stage Genetic Programming (2SGP) for the Credit Scoring Model", Applied Mathematics and Computation, vol. 174, no. 2, pp. 1039-1053, Mar, 2006. [20] S. Finlay, "Are We Modeling the right Thing? The Impact of Incorrect Problem Specification in Credit Scoring", Expert Systems with Applications, vol. 36, no. 5, pp. 90659071, Jul, 2009. [21] K. B. Schebesch, and R. Stecking, "Support Vector Machines for Classifying and Describing Credit Applicants: Detecting Typical and Critical Regions", Journal of the Operational Research Society, vol. 56, no. 9, pp. 1082-1088, Sep, 2005. [22] K. A. Smith, Introduction to Neural Networks and Data Mining for Business Applications, Australia: Eruditions Publishing, 1999. [23] B. Baesens, T. Van Gestel, M. Stepanova et al., "Neural network Survival Analysis for Personal Loan Data", Journal of the Operational Research Society, vol. 56, no. 9, pp. 10891098, Sep, 2005. [24] J. Huysmans, B. Baesens, J. Vanthienen et al., "Failure Prediction with Self Organizing Maps", Expert Systems with Applications, vol. 30, no. 3, pp. 479-487, Apr, 2006. [25] T. S. Lee, and I. F. Chen, "A Two-stage Hybrid Credit Scoring Model Using Artificial Neural Networks and Multivariate Adaptive Regression Splines", Expert Systems with Applications, vol. 28, no. 4, pp. 743-752, May, 2005. [26] G. Paleologo, A. Elisseeff, and G. Antonini, "Subagging for Credit Scoring Models", European Journal of Operational Research, vol. 201, no. 2, pp. 490-499, Mar, 2010. [27] M. C. Chen, and S. H. Huang, "Credit Scoring and Rejected Instances Reassigning Through Evolutionary Computation Techniques", Expert Systems with Applications, vol. 24, no. 4, pp. 433-441, May, 2003. [28] K. N. Kunene, and H. R. Weistroffer, "An Approach for Predicting and Describing Patient Outcome Using Multicriteria Decision Analysis and Decision Rules", European Journal of Operational Research, vol. 185, no. 3, pp. 984-997, 2008. [29] L. Yu, S. Y. Wang, and K. K. Lai, "An Intelligent-agentbased Fuzzy Group Decision Making Model for Financial Multicritera Decision Support: The Case of Credit Scoring", European Journal of Operational Research, vol. 195, pp. 942-959, Jun 16, 2009. [30] A. Ben-David, and E. Frank, "Accuracy of Machine Learning Models versus "Hand Crafted" Expert Systems - A Credit Scoring Case Study", Expert Systems with Applications, vol. 36, no. 3, pp. 5264-5271, Apr, 2009. [31] B. Baesens, R. Setiono, C. Mues et al., "Using Neural Network Rule Extraction and Decision Tables for Credit-risk Evaluation", Management Science, vol. 49, no. 3, pp. 312329, 2003. [32] I. H. Witten, and E. Frank, Data Mining: Practical Learning Tools and Techniques, Morgan Kaufmann Publishers, 2005. [33] C. L. Chuang, and R. H. Lin, "Constructing a Reassigning Credit Scoring Model", Expert Systems with Applications, vol. 36, no. 2, pp. 1685-1694, Mar, 2009. [34] T.-S. Lee, C.-C. Chiu, Y.-C. Chou et al., "Mining the Customer Credit Using Classification and Regression Tree and Multivariate Adaptive Regression Splines", Computational Statistics & Data Analysis, vol. 50, no. 4, pp. 1113-1130, 2006. [35] J. R. Quinlan, "Simplifying Decision Trees", International Journal of Man-Machine Studies, vol. 27, pp. 221-234, 1987. [36] J. Han, and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, CA, 2001. [37] T. M. Mitchell, Machine Learning, WCB/McGraw-Hill, Boston, Massachusetts, 1997. [38] P-N Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison-Wesley, 2006. [39] T. Poggio, and F. Girosi, "Network for Approximation and Learning", Proceedings of IEEE, 78, 1481-1497, 1990. [40] V. N. Vapnik, Statistical Learning Theory, New York: Wiley, 1998. [41] S. le Cessie, and J. C. van Houwelingen, "Ridge Estimators in Logistic Regression", Applied Statistics, vol. 41, no. 1, pp. 191-201, 1992. [42] J. Platt, "Machines Using Sequential Minimal Optimization", In B. Schoelkopf and C. Burges and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, 1998. [43] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, and K.R.K. Murthy, "Improvements to Platt's SMO Algorithm for SVM Classifier Design", Neural Computation, vol. 13, no 3, 637-649, 2001. [44] D. Aha, and D. Kibler, "Instance-based Learning Algorithms", Machine Learning, 6:37-66, 1991 10
© Copyright 2026 Paperzz