Maximize AUC in Default Prediction: Modeling and Blending

Maximize AUC in Default Prediction: Modeling and Blending
Liang Sun
[email protected]
Tomonori Honda
[email protected]
Vesselin Diev, Gregory Gancarz, Jeong-Yoon Lee, Ying Liu, Mona Mahmoudi, Raghav Mathur,
Shahin Rahman, Steve Wickert, Xugang Ye, Hang Zhang
Abstract
In this paper we present models and blending
algorithms to maxmize the Area Under Receiver Operating Characteristic (ROC) curve
in default prediction. We summarize all techniques and algorithms we have applied in the
Give Me Some Credit competition so that future users can benefit from our experiences in
this competition, including feature creation
algorithms, single model constructions and
blending algorithms. In particular, in this paper we highlight the following aspects: (i) different feature creation methods we explored,
(ii) diverse packages utilized for single model
building and the parameter tuning for each
model, (iii) different blending, and (iv) additional postprocessing methods.
1. Introduction
The recent Give Me Some Credit (GMSC ) competition1 organized by Kaggle focused on predicting the
probability that some bank customers would experience financial distress in the next two years. This is a
very interesting classification problem having limited
(11) original features and low target rate ( 5%) which
represents customers who actually default. The approaches developed for this competition are applicable
not only to the credit & risk community, but also to
other industries with similar classification problems.
In this paper, we summarize all techniques and algorithms we applied in the GMSC competition, including feature creation, single models and blending models. In particular, we would like to highlight (i) different feature creation methods, (ii) diverse packages
1
http://www.kaggle.com/c/GiveMeSomeCredit
Appearing in Proceedings of the 1 st Technical and Analytical Conference of Opera Solutions, San Diego, USA, 2012.
and algorithms utilized for single model building, (iii)
different blending, and (iv) additional postprocessing
of these models.
During the GMSC competition, several data sets were
created. One of the main differences between these
data sets was the handling of missing values and outliers. Additionally, many data sets transformed original features into weight of evidences. In particular,
a novel effective variable 2D weight of evidence was
created to capture the underlying information. Based
on the created data sets, many different single models
were developed; e.g., random forest, gradient boosting decision tree, alternating decision tree, logistic regression, artificial neural networks, support vector machine, elastic net, and k-nearest neighbors. In addition, we applied residual postprocessing to improve the
peformance of single models. For example, the residual of the Gradient Boosted Machine (GBM) model
was successfully modeled using a random forest model
and improved performance was achieved.
Many different blending algorithms were explored during the GMSC competition. The transformation
which converts the probability of default into a rank
tends to improve the performance of blending algorithms when the evaluation criterion is AUC. We have
designed a class of statistical aggregation algorithms to
combine a set of predictions together. In order to effectively utilize the public leaderboard score for each individual prediction, we further propose the rank-based
oracle blending. Compared with the statistical aggregation algorithms, the weight of each single model is a
function of its public AUC score so that better models
have larger weights in rank-based oracle blending.
Note that blending is typically the last step in creating
final predictions. However, we can further utilize the
blending result to improve the performance of single
models and the overall performance. One approach
we followed was to use predictions from blending as
new targets for single models. This can help remove
target outliers and improve performance. Another approach we followed was to determine the problematic
Maximize AUC in Default Prediction: Modeling and Blending
population (that segment of data which does not rank
order well) using the blended results and build separate models for this problematic population. Note
that the original blended prediction can be a new key
feature for this problematic segment. It works best
when we build many different single models again for
this problematic segment. A key requirement to implement this approach is that predictions on training,
validation and test data sets should be available for
each single model.
2. Evaluation Criterion: Area Under
ROC Curve
The evaluation criterion for this GMSC competition
was the Area Under ROC Curve (AUC). The Receiver
Operating Characteristic (ROC) curve of a decision
function f plots the true positive rate on the y-axis
against the false postive rate on the x-axis. Thus the
ROC curve characterizes every possible trade-off between false positives and true positives that can be
achieved by comparing f (x) to a threshold. Note that
the ROC curve is a 2-dimensional measure of classification performance, and the AUC is a scalar measure
to summarize the ROC curve. Formally, the AUC is
equal to the probability that the decision function f
assigns a higher value to a randomly drawn positive
example x+ than to a randomly drawn negative example x− , i.e.,
AU C(f ) = P r(f (x+ ) > f (x− )).
(1)
Theoretically, the AUC refers to the true distribution
of positive and negative instances, but it is usually estimated from samples. It can be shown that the normalized Wilconxon-Mann-Whitney statistic gives the
maximum likelihood estimate of the true AUC given
n+ positive and n− negative examples (Yan et al.,
2003):
Pn+ Pn−
ˆ C(f ) =
AU
i=1
j=1
1f (x+ )>f (x− )
n+ n−
i
j
,
(2)
where 1fi >fj is the indictator function which is 1 if
fi > fj and 0 otherwise. In fact, the two sums in
Eq. (2) iterate all pairs of positive and negative examples. Each pair that satisfies f (x+ ) > f (x− ) contributes 1/(n+ n− ) to the overall AUC estimate. As a
result, maximizing AUC is equivalent to maximizing
the number of pairs satisfying f (x+ ) > f (x− ). Note
that the number of all pairs is O(n2 ) where n is the
sample size in the training data set. We will discuss
below how to focus attention on the problematic population, which happens to be the top 20% people with
highest blended scores for the GMSC Competition.
The AUC can also be calculated in a few different
ways, e.g., numerically integrating the ROC curve.
Another alternative to compute it is via the Wilcoxon
rank sum test as follows:
+
+
n (n
2
ˆ C(f ) = U 1 −
AU
n+ n−
+1)
(3)
where U 1 is the sum of ranks of members in the positive class (or members who defaulted). This AUC
equation helps us to blend models in the ranking space
rather than the probability space.
3. Data Description and Feature
Creation
3.1. Raw Data Description
There were 10 raw feature variables and 1 target
variable.
These raw feature variables were the
following:
RevolvingUtilizationOfUnsecuredLines,
Age,
NumberOfTime30-59DaysPastDueNotWorse,
DebtRatio, MonthlyIncome, NumberOfOpenCreditLinesAndLoans,
NumberOfTimes90DaysLate,
NumberRealEstateLoansOrLines, NumberOfTime6089DaysPastDueNotWorse,
NumberOfDependents.
The meaning of these variables should be selfexplanatory. Note that there were some missing
values in the variables DebtRatio and MonthlyIncome. Also there were some obvious outliers in a few
variables.
3.2. Data Imputation and Feature Creation
During the GMSC competition, at least 12 different
data sets were created and the main differences were
in the handling of the missing values and outliers. In
some data sets, new derived features such as the weight
of evidence were also created. In this subsection, we
focus on the imputation of missing values and the new
variable 2D weight of evidence.
In a few data sets that utilized binning, missing values were assigned separate bins. In some data sets,
the missing values of the MonthlyIncome and NumberOfDependent variables were imputed with their
median values. In other data sets, the missing value of
the MonthlyIncome was imputed by regressing on the
remaining variables. Additionally, in some data sets,
special attention was given to how to handle the missing values in the variable MonthlyIncome and the unreasonably huge values in the variable DebtRatio that
were almost always accompanying the missing values
in MonthlyIncome. It is speculated that huge values
in DebtRatio were the actual monthly debt instead of
DebtRatio since a reasonable DebtRatio should be less
Maximize AUC in Default Prediction: Modeling and Blending
than 1 or close to 1. Thus, it is reasonable to assume
that when the MonthlyIncome was missing, the actual MonthlyDebt was used as the DebtRatio since the
MonthlyIncome is used as the denominator in computing DebtRatio. As a result, the MonthlyDebt, where
MonthlyIncome was not missing was calculated by:
MonthlyDebt = MonthlyIncome × Ratio
(4)
For those with missing MonthlyIncome, the MonthlyIncome was set 0, and the MonthlyDebt took the
original value in DebtRatio. This demonstrates the
usefulness of spending time for analyzing raw data.
One of the common transformations for the original
variables is weight of evidences. Many of the generated data sets contain weight of evidences for each
original variable separately after binning them. Additionally, one data set utilized moving bins rather than
static bins by determining local weight of evidence using a fixed percentile around each individual member’s
variable values.
In the credit and risk community, it is well-known that
high credit line utilization and high number of days
past due often lead to serious delinquency. Therefore,
a 2D weight of evidence on these two variables could
be a predictive variable. A new data set was created
with this 2D weight of evidence as variable dlqUtil.
First the three variables NumberOfTimes90DaysLate,
NumberOfTime60-89DaysPastDueNotWorse,
and
NumberofTime30-59DaysPastDueNotWorse,
which
are related to delinquency, were added up to create
new variable dlq. Then it was crossed with the
variable RevolvingUtilizationOfUnsecuredLines to
produce the risk table. The 2D risk table is presented
in Table 1. It can be observed that the risk increases
as the number of times past due increases. Also the
risk increases as the credit line utilization increases.
One interesting feature noticeable is that people with
strictly zero utilization of credit line have higher
risk than those with slight utilization (0-0.1). These
are well-known phenomena in the credit and risk
community and are thought to be from the fact that
people with no credit utilization are inexperienced
with financial management and tend to be higher risk.
4. Single Models
Based on the data sets created in Section 3.2, numerous single models were built that investigate the data
from different perspectives. In this competition, the
investigated single models include:
1. Linear regression and its variants, e.g., Ordinary
Least Squares (OLS) and Elastic Net (Zou &
Table 1. Risk table crossing total times past due and credit
line utilization.
Util \ Dlq
0
0 - 0.1
0.1 - 0.3
0.3 - 0.6
0.6 - 0.8
0.8 - 0.9
0.9 - 0.95
0.95 - 1
1+
0
0.01430
0.00873
0.01773
0.03625
0.06380
0.08148
0.09560
0.08108
0.13738
1
0.05628
0.05012
0.07252
0.10145
0.13140
0.18105
0.18083
0.24508
0.35062
2
0.18868
0.15085
0.13501
0.18249
0.26316
0.26346
0.32273
0.32388
0.40761
3+
0.38095
0.28483
0.25829
0.34615
0.42464
0.43797
0.48795
0.51308
0.61748
Hastie, 2005).
2. Logistic Regression.
3. Decision Tree based algorithms, such as Random
Forest, Gradient Boosting Tree, and Alternating
Decision Tree.
4. Artificial Neural Networks, e.g., Multi-Layer Neural Networks (MLNs) and Restricted Boltzmann
Machine (RBM).
5. Classifiers based on Bayesian statistics, e.g., Naive
Bayes classifier.
6. Support Vector Machine (SVM).
7. k-Nearest Neighbor (kNN).
Note that this is not a complete list but it outlines
the most important single models attempted in this
competition. Since different models have different assumptions, some data sets will work better with one
model than with others. For example, Decision Tree
based algorithms can handle outliers better than Neural Network based algorithms, but they tend to introduce overfitting in the model training.
In general, the best performing single models were Decision Tree based models like Gradient Boosting Tree,
Random Forest, Alternating Decision Tree, and REPTree. The next best algorithm was Artificial Neural Networks. These models were followed by Naive
Bayes, SVM, Elastic Net, and Logistic, which produced models with similar accuracy. These results
imply that the data set may still contain outliers even
with extensive outlier treatments and also showed that
there may be significant coupling that must be exploited by the model.
As an example of a single model building effort, we will
demonstrate the best single model, which is gradient
boosted decision tree with residual postprocessing by
random forest, and another method (SVMperf) that
optimizes AUC directly.
Maximize AUC in Default Prediction: Modeling and Blending
4.1. Gradient Boosted Decision Tree with
Residual Postprocessing by Random
Forest
is a complexity factor that balances complexity of the
model.
Gradient boosted decision tree (Friedman, 2001) is
a machine learning technique for regression problems
which produces a prediction model in the form of an
ensemble of weak prediction models, typically decision
trees. It builds the model in a stage-wise fashion like
other boosting methods do, and it generalizes them
by allowing optimization of an arbitrary differentiable
loss function. The gradient boosting method can also
be used for classification problems by reducing them
to regression with a suitable loss function.
In this approach, the residual of the GBM is further modeled using a random forest (Breiman, 2001).
Specifically, a GBM model was built first to predict the labels of members in the training set, i.e.,
YT rain = f GBM (XT rain )+T rain , where is the residual from the GBM model. The parameters for the
GBM model were as follows: 1500 trees, shrinkage parameter 0.01, depth of each tree 14, and minimal size
of each leaf node 10. In the second stage, a random forest (RF) model was then built to model the residuals
. Random forest (or random forests) is an ensemble
classifier that consists of many decision trees and outputs the class that is the mode of the class’s output
by individual trees. The RF model we built could be
0
denoted as T rain = f RF (XT rain ) + T rain . The parameters of the RF model were as follows: 3 variables
in each tree, 500 trees, and the minimal size of the leaf
nodes 300.
Thus, the prediction on the test set was ŶT est =
GBM
GBM
(XT est ), and
RF
ŶTGBM
T est , where ŶT est = f
est + ˆ
RF
RF
ˆ
T est = f (XT est ).
4.2. Support Vector Machine
A variation of SVM called SVMperf (Joachims, 2006)
maximizes the AUC directly as the algorithm minimizes 1-AUC as its loss function. A typical SVM loss
function based on minimizing the error rate is given
by:
min
w,ζi ≥0
n
CX
1 T
w w+
ζi
2
n i=1
!
(5)
such that
yi (wT xi ) ≥ 1 − ζi , ∀i ∈ {1, . . . , n}
(6)
where w is the weight vector, ζi are the slack variables
indicating the amount of error we are allowing, and C
A loss function based on minimizing 1-AUC (maximizing AUC) is:

+
1
C
min  wT w + + −
2
n n
w,ζi ≥0
−
n X
n
X

ζi,j 
(7)
i=1 j=1
such that
(wT xi ) − (wT xj )
∀i ∈ {1, ..., n+ }
(8)
∀j ∈ {1, ..., n− }
≥
1 − ζi,j ,
⇔
∀(i, j) : yi > yj
One can see that the constraints relate directly to
Equation 8 for AUC. The problem is that in this
formulation we have O(n2 ) such constraints for all
pairs of observations. Joachims (2006) proposed a
new approximation algorithm (called cutting plane algorithm) that reduces the number of constraints to
O(n) thus making the overall algorithm complexity
O(n log n) driven by a sorting operation. This algorithm starts with an empty set of constraints W and
keeps the most violated constraint in each iteration.
5. Blending Algorithms
Note that the performance measure in the GMSC competition was AUC, which is different from the RMSE
criterion widely used in many competitions such as the
Heritage Health Prize and Netflix. There are broadly
two types of blending algorithms, one that focuses
on blending similar models and another that focuses
on blending a wide variety of different models. For
blending of similar models, we have successfully implemented (1) Genetic Algorithms to boost the performance of Artificial Neural Networks and (2) Voting Method to boost Naive Bayes Classifier and Particle Filter. These approaches are designed to search
through a particular model space to optimize the AUC
directly (see GMSC Technical Report (Diev et al.,
2012)).
In the rest of this section, we discuss implemented
blending algorithms, including genetic algorithms using populations of feedforward neural networks, statistical aggregation of predictions, and oracle blending that incorporates AUC scores in the public leaderboard as feedback.
Maximize AUC in Default Prediction: Modeling and Blending
5.1. Genetic Algorithms Using Populations of
Neural Networks
Artificial neural networks trained using backpropagation (Rumelhart et al., 1986) typically seek to reduce
the mean squared error between actual and expected
values during training. The metric for the GMSC competition, however, was AUC. Training by backpropagation normally produces a neural network model with
good AUC as well, but these two performance metrics are not completely aligned. For competitions such
as the GMSC where even incremental gains are important, we may be able to boost the performance of
neural nets by optimizing AUC directly.
Genetic Algorithms (GAs) are inspired by evolution
and natural selection in biology (see, for example,
(Mitchell, 1996)). We can evolve populations of neural
nets. The great advantage of doing so is that we may
use AUC directly as the fitness function. We encoded
each neural net population member as a single linear
chromosome of real-valued numbers representing the
ordered weights and biases of each network unit. In
the simplest approach, all population members maintain the same network topology (inputs, numbers of
hidden units, and connectivity), though topological
variations can be accommodated as well.
The driving mechanism for fitness increases in the population is the creation of new population members that
are fitter than any of their parents. The population
represents a pool of genetic diversity from which newly
born entities can draw, and we used both single-point
mutations and multi-point crossovers to generate new
genetic diversity. We evaluated several different mutation schemes, and found the most effective for this
competition to be mutation by an amount proportional
to the current value of the allele and a randomly selected lognormal mutation factor. This scheme avoids
large disruptions in the scale of alleles, while exploring
occasional larger deviations than Gaussian perturbations would provide.
We explored a range of combinations for the tunable
parameters of the GA runs, including population size,
number of hidden units, crossover rate, mutation rate,
and various other parameters. Results for some representative runs are shown in Table 2.
5.2. Statistical Aggregation
The basic idea behind statistical aggregation is the socalled wisdom of crowds (Mitchell, 1997). Statistical
aggregation works best when predictions from different
models produce random unbiased error by looking at
data from different perspectives. Ideally, error will be
Table 2. GMSC Results for Sample GA Runs Using Neural
Networks.
GA Run
garoc 7
garoc 8
garoc 9
Pop. Size
500
800
1000
Public AUC
0.862817
0.862783
0.862814
Private AUC
0.868301
0.868267
0.868182
reduced proportional to the square root of the number
of predictions.
The statistical aggregation of predictions methods
we applied include: (1) Mean Blending, (2) Median
Blending, and (3) Expected Value Blending. These
aggregation algorithms can be performed in both (a)
the probability space which includes original predictions from each single model and (b) the ranking space
which converts the probability into the ranking order in each prediction. Since AUC is a function of
ranking rather than probability, it is reasonable to
perform aggregation in the ranking space. We compared mean, median, and expected value blending algorithms in both the probability space and the ranking
space and the results are summarized in Table 3. It
can be observed that the aggregation using ranking
usually achieves better AUC performance. Furthermore, expected value aggregation considers covariance
of the predictions, but tends to underperform compared to mean and median blending. One possible
reason is that the Gaussian distribution assumption is
too strong to be true for real data.
Table 3. Comparison of AUC for Statistical Aggregation
Blending Algorithms.
Num
Model
32
63
Alg.
EV
Mean
Median
EV
Mean
Median
Probability Space
Public
Private
0.8614
0.8678
0.8620
0.8679
0.8622
0.8680
0.8609
0.8670
0.8626
0.8684
0.8624
0.8681
Ranking Space
Public Private
0.8621
0.8678
0.8621
0.8680
0.8622
0.8681
0.8610
0.8664
0.8628
0.8686
0.8628
0.8684
5.3. Rank Based Oracle Blending
Rank Based Oracle Blending is an extension of the
statistical aggregation algorithms. Compared with the
statistical aggregation algorithms, the weight of each
single model is a function of its public AUC score so
that better models have larger weights in rank based
oracle blending. Note that we assume that the public AUC score reflects private AUC score accurately
(which may not be true if the populations underlying
Maximize AUC in Default Prediction: Modeling and Blending
public and private scores have a different distribution).
Formally, given N predictions {~
pi , i = 1, ...N }, the
mathematical formulation for the rank based oracle is
given in Eq. (9):
P
w(AU Ci )R(~
pi (j))
Rf (j) = i P
,
(9)
w(AU
C
i)
i
where R is the operator which maps the prediction
from the probability space to the ranking space (in ascending order), Rf is final ranking, AU Ci is the model
i’s public AUC score, and w represents weighting as a
function of AU Ci . Note that the final ranking is normalized by dividing by the sum of the weights.
Specifically, the following weighting functions were explored:
wa (AU Ci )
wb (AU Ci )
=
=
(AU Ci − AU C)
trained two models on the jl-WOE (jl-11) data set.
Both models used gradient boosting trees as the first
stage and then postprocessing by training a random
forest model on the residual of the GBM scores and the
true target. The differece between these two models is
that different targets were used at the GBM stage: one
used the true target and the other used the blended
results as target. The AUC scores are summarized in
Table 4.
Table 4. Comparison of AUC between two models using
different targets.
Target
True target
Blending target
Training
0.89169
0.88925
Validation
0.86199
0.86369
Test
0.85961
0.86099
(10)
2
(11)
6.2. Top 20% Segmentation
4
Another approach successfully boosting the blended
model is a method which segmented the problematic
population first and then built new models on this segmentation. In the GMSC competition, we focused on
the top 20% of the population that is likely to default.
The intuition behind this approach is that classifying
80% of the population as most likely NOT to default is
much easier than classifying the remaining 20% of the
population that is ranked near the borderline. Note
that this cutoff will be different for different problems
and is highly dependent on the problem details.
(AU Ci − AU C)
wc (AU Ci )
=
(AU Ci − AU C)
(12)
wd (AU Ci )
=
(13)
we (AU Ci )
=
R(AU Ci )
1
,
N − R(AU Ci ) + 1
(14)
where AU C is a constant which should be optimized
for particular problems. However, most of these
weighting formulas are not robust to mediocre predictions and require prescreening to reduce the noise.
Interestingly, we , which is motivated by weighting for
the modified Borda count used in Nauru, is significantly more robust to the mediocre predictions and
correlated predictions.
6. Boosting Model Performance by
using Blended Results
The blending is not the last step in our modeling process. In our modeling, the blended results are further
utilized to improve the performance of single models.
Specificially, we have attempted two approaches: 1)
use the blended result as the new target and retrain
the single model, which can remove the outliers and
lead to a better single model; 2) isolate the problematic population using blended results.
6.1. Single Model Trained on Blending AUC
Scores
In this approach, after training several single models
and building the blending model, the target on the
training data set is replaced with the prediction of the
blending model and the new target is used to retrain
single models. The effectiveness of this approach is
confirmed by our empirical results. Specifically, we
Formally, we assume that {~
ptrn
pvld
i , i = 1, ...N }, {~
i ,i =
tst
1, ...N }, {~
pi , i = 1, ...N } are sets of training, validation, and test predictions from N single models,
respectively. Note that this approach requires the predictions on the training data set for all single models.
For N predictions on the training, validation and test
data sets, we aggregate them using either rank based
oracle blending or statistical aggregation. Specifically,
we applied weighted mean oracle blending with inverse
of rank of test AUC score as the weight, i.e., we in
Eq. (14), to aggregate them. Note that in training,
validation and test set aggregation the same weight
was used for a specific single model. Based on the
aggregated results, we can identify the top 20% population (most likely to default). We denote the top 20%
of the population in training, validation, and test data
trn
vld
tst
sets as IDslct
, IDslct
, and IDslct
, respectively. Denote
trn
vld
tst
X , X , and X as features for these three sets,
respectively. Thus, we can build new models for the
trn
vld
top 20% population using X trn (IDslct
), X vld (IDslct
),
tst
tst
and X (IDslct ).
An interesting observation is that the weighted mean
ranking and the weighted standard deviation of the
Maximize AUC in Default Prediction: Modeling and Blending
ranking from original blended single models (that utilized the whole population rather than the top 20%)
can be used as new features. The addition of these two
features resulted in significant improvement of AUC on
the top 20% population for many models, such as logistic regression, SVM, neural networks, and random
forest. We summarize the performance comparison before and after adding these two features for several
variants of logistic regression and SVM in Table 5. It
can be observed in Table 5 that SVM and logistic regression have been boosted from an AUC score of 0.68
(which is worse than the AUC score of the top 20% of
the population using the blended model for the whole
population) to 0.75.
7.1. Software and Data
All algorithms have been implemented and are available on SVN2 .
Acknowledgments
These algorithms were developed during the Give Me
Some Credit (GMSC) Competition. We would like to
thank everyone who contributed to the GMSC competition including: Jacob Spoelstra, Jenny Zhang, Abhikesh Nag, Dan Nabutovsky, Kevin Chen, William
Roberts, Kimi Minnick, Jason Lu, Michael Kennedy,
Michael Alton, Yonghui Chen and Yun Wang.
References
Table 5. The AUC scores of different algorithms for the top
20% of populations on the validation data set before and
after adding these two new features based on mean and
standard deviation of rankings.
Algorithm
`2 -reg Logistic regression
`2 -reg `2 -loss SVM
`2 -reg `1 -loss SVM
`1 -reg `2 -loss SVM
`1 -reg Logistic regression
Before
0.68127
0.68106
0.68036
0.67620
0.67767
After
0.75247
0.74899
0.72437
0.75036
0.75261
We performed median blending on 8 models with the
best performance for the top 20% population and combined that with the oracle blended prediction from
the bottom 80% population. AUC was improved from
0.86165 to 0.86204.
7. Conclusions
We have summarized some of the selected contributions that we provided during the GMSC Competition. For feature creation, we focused on the significance of imputation of missing values as well as weight
of evidence transformations including 2D weight of evidences. In the single model section, we demonstrated
the variety of the models that can be built. Also, we
showed how to boost the single model performance by
residual post-processing by describing the best single
model. We have elaborated on our exploration of different blending algorithms for both similar models and
different model predictions. Finally, we have discussed
our investigation on utilizing blending results to boost
the performance further. These different techniques
provide single coherent guides for future competition
and also we believe that we can refine these approaches
further to improve our model performance in the future.
Breiman, L. Random Forest. Machine Learning, 45
(1):5–32, 2001.
Diev, V., Gancarz, G., Honda, T., Lee, J. Y., Liu, Y.,
Mahmoudi, M., Mathur, R., Rahman, S., Sun, L.,
Wickert, S., Ye, X., and Zhang, H. Technical Report
for Give Me Some Credit competition. Technical
Report, Opera Solutions, San Diego, CA, January
2012.
Friedman, J. H. Greedy Function Approximation: A
Gradient Boosting Machine. The Annals of Statistics, 29(5):1189–1232, 2001.
Joachims, T. Training Linear SVMs in Linear Time. In
Proceedings of the ACM Conference on Knowledge
Discovery and Data Mining (KDD), 2006.
Mitchell, M. An Introduction to Genetic Algorithms.
MIT Press, Cambridge, MA USA, 1996.
Mitchell, T. M. Machine Learning. McGraw-Hill, New
York, 1997.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J.
Learning Representations by Back-Propagating Errors. Nature, 323:533–536, 1986.
Yan, L., Dodier, R. H., Mozer, M., and Wolniewicz,
R. H. Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic. In Proceedings of the 20th International Conference on Machine Learning, pp. 848–855, 2003.
Zou, H. and Hastie, T. Regularization and Variable
Selection via the Elastic Net. Journal Of the Royal
Statistical Society: Series B, 67(2):301–320, 2005.
2
http://subversion/repo/GMSC