A PROBABILITY-MAPPING ALGORITHM FOR - LEM-CNRS

Document de travail du LEM
2011-06
A PROBABILITY-MAPPING ALGORITHM FOR
CALIBRATING THE POSTERIOR PROBABILITIES: A
DIRECT MARKETING APPLICATION
Kristof Coussement*, Wouter Buckinx**
*
IESEG School of Management (LEM-CNRS)
**
Python Predictions, Brussels, Belgium
A Probability-Mapping Algorithm for Calibrating the Posterior
Probabilities: A Direct Marketing Application*
Kristof Coussement£, Wouter Buckinx+
£
IESEG School of Management (LEM-CNRS), Department of Marketing, 3 Rue de la Digue,
F-59000, Lille (France).
+
Managing partner (PhD), Python Predictions,
Avenue R. Van den Driessche 9, B-1150 Brussels, Belgium.
First and corresponding author: Kristof Coussement, [email protected] ,
Tel.:+33320545892
Second author: Wouter Buckinx, [email protected]
This paper is accepted for publication in European Journal of Operational Research
1
A Probability-Mapping Algorithm for Calibrating the Posterior Probabilities:
A Direct Marketing Application
Abstract
Calibration refers to the adjustment of the posterior probabilities output by a classification
algorithm towards the true prior probability distribution of the target classes. This adjustment
is necessary to account for the difference in prior distributions between the training set and the
test set. This article proposes a new calibration method, called the probability-mapping
approach. Two types of mapping are proposed: linear and non-linear probability mapping.
These new calibration techniques are applied to 9 real-life direct marketing datasets. The
newly-proposed techniques are compared with the original, non-calibrated posterior
probabilities and the adjusted posterior probabilities obtained using the rescaling algorithm of
Saerens, Latinne, & Decaestecker (2002). The results recommend that marketing researchers
must calibrate the posterior probabilities obtained from the classifier. Moreover, it is shown
that using a ‘simple’ rescaling algorithm is not a first and workable solution, because the
results suggest applying the newly-proposed non-linear probability-mapping approach for best
calibration performance.
Keywords: data mining, direct marketing, response modeling, calibration, decision support
systems
This paper is accepted for publication in European Journal of Operational Research
2
1. Introduction
Due to recent developments in IT infrastructure and the ever-increasing trust placed in
complex computer systems, analysts are showing an increasing interest in classification
modeling in a variety of disciplines such as credit scoring (Martens et al., 2010; Paleologo et
al., 2010), medicine (Conforti & Guido, 2010), text classification (Bosio & Righini, 2007),
SMEs fund management (Kim and Sohn, 2010), revenue management (Morales & Wang,
2010), and so on. The same interests are shared by the direct marketing community. Direct
marketing analysts have an increasing interest in building prediction models that assign a
probability of response to each and every individual customer in the database (Lamb et al.,
1994). The task of classification is made even more interesting by the fact that nowadays
current marketing environments store incredible amounts of customer information at a very
low cost, including socio-demographics, transactional buying behavior, attitudinal data, etc.
(Naik et al., 2000), while at the same time there has been a tremendous increase in academic
interest in direct marketing applications (e.g. Allenby et al., 1999; Baumgartner & Hruschka,
2005; Hruschka, 2010; Lee et al., 2010; Piersma & Jonker, 2004). Therefore response models
are defined as classification models that attempt to discriminate between responders and nonresponders on a certain company mailing.
In the past, purely statistical methods like logistic regression, discriminant analysis and naive
bayes models have been proposed to discriminate between responders and non-responders in
a direct marketing context (Baesens et al., 2002; Bult, 1993; Deichmann et al., 2002).
Although these techniques may be very effective, they make a stringent assumption about the
underlying relationship between the independent variables and the dependent or response
variable. In response to this, more advanced data mining algorithms like decision treegenerating techniques, artificial neural networks and support vector machines have been
applied (Baesens et al., 2002; Bose & Chen, 2009; Crone and et al., 2006; Haughton &
Oulabi, 1997; Zahavi & Levin, 1997). All these binary classification models are used for two
reasons. First, researchers rely on them to obtain robust parameter estimates of the
independent variables by modeling the probability of response as a function of the
independent variables. Second, these models are used to obtain consistent predicted
probabilities of response, which are then used (i) to rank the customers based on their
This paper is accepted for publication in European Journal of Operational Research
3
responsiveness to the campaign, (ii) to optimize the overall campaign strategy by offering the
customer the product with the highest response probability over the different response models
and (iii) for the discrimination task of the response event itself where one classifies customers
into responders and non-responders. For (ii) and (iii), the absolute size of the posterior
response probabilities is crucial. This study focuses on the process of obtaining correct
response probabilities, where calibrating the posterior probabilities could have a positive
impact on the optimization of the overall campaign strategy and the efficiency of the
discrimination task.
In practice, a classification model is built on a training set, i.e. a set of customers where both
the independent variables and the dependent variable are present. In order to correctly
measure the discrimination power of the trained classifier, the classification model is applied
to a group of customers who have not been used for training, called the scoring or test set. The
purpose is to obtain robust and consistent predictions for the response probability of these
unseen customers. As one is interested to divide the customers into responders and nonresponders, a judicious classification based on the posterior response probabilities of the
customers is needed. In other words, customers having a response probability exceeding a
certain threshold will be classified as responders and vice versa.
However, it often happens that a classifier is trained using a dataset that does not reflect the
true prior probabilities of the target classes in the real-life population. This may have serious
negative consequences on the discrimination performance because the posterior probabilities
do not reflect the true probability of interest. This phenomenon occurs in a direct marketing
context as well where the prior probabilities between the training set and the (out-of-sample)
test set are significantly different. More specifically, the training set consists of customers
who are preselected by an earlier response model as being customers with a high response
probability, while the test set does not make any restrictions based on the customer profiles in
the database. In such a case, a large discrepancy exists between the response distributions on
the training set and the test set. The incidence, which is the percentage of responders in a data
set, is much higher in the training set as compared to the incidence of real response in the outof-sample test set. This inconsistency has a negative effect on the discrimination performance
on the test set, especially because the classifier’s decision to classify customers into
This paper is accepted for publication in European Journal of Operational Research
4
responders or non-responders is based on setting a threshold on the raw posterior probabilities
of class membership. For instance, when a classifier is trained on a dataset with a higher
incidence than the one in the test set, the posterior probabilities on the test set are inflated.
Thus making a classification decision based on the absolute value of the posterior
probabilities may significantly harm the discrimination performance. Moreover, optimizing
the campaign strategy by offering the product with the highest response probability to the
customer becomes useless because the response probabilities for different products for a
particular customer are not comparable. This paper focuses on how researchers can adjust the
posterior probabilities based on the true prior distribution of the response variable. This
process of adjustment is called calibration.
This paper proposes a new methodology to be used to calibrate the posterior probabilities
from the test set with the real-world situation, a process called probability-mapping. It maps
the posterior response probabilities obtained from the classifier onto the prior distribution of
real response. The new probability-mapping approaches using generalized linear models and
non-parametric generalized additive models are compared with the original, non-calibrated
posterior probabilities and the calibrated probabilities using the rescaling methodology of
Saerens et al. (2002).
This paper is structured as follows. Section 2 describes the methodological framework, while
Section 3 explores the different calibration approaches (rescaling approaches and probabilitymapping approaches). Section 4 explains the characteristics of empirical validation, while
Section 5 explores the results. Section 6 gives managerial recommendations, and finally
Section 7 concludes this paper.
2. Methodological framework
Figure 1 shows the methodological framework for the different calibration methods applied in
this study.
This paper is accepted for publication in European Journal of Operational Research
5
[INSERT FIGURE 1 OVER HERE]
Define a training set TRAIN M
(x i , y i )
m
i 1
consisting of m customers. Each customer
( x i , y i ) is a combination of an input vector x i representing the independent variables and a
dependent variable y i with y i
0,1 corresponding to whether or not a customer responded
on a certain mailing. TRAIN M consists of all customers who were selected by a previous
response model, thus received a direct mailing to buy the product, and therefore indicated as
customers having a high response probability. During the training phase, a classifier C maps
the input vector space onto the binary response variable using the training set observations.
For the test set TEST N
(x i )
n
i 1
consisting of n customers, the trained classifier C is applied
and for every customer in TEST N a response probability Porg is obtained. The purpose of this
paper is to adjust the posterior probabilities Porg to the real response distribution because the
trained sample TRAIN M is not representative for TEST N which corresponds to the true
population. Therefore for every observation (x i ) in TEST N , the real response is collected
and summarized in REAL N
(yi )
n
i 1
with y i
0,1 corresponding to whether or not the
customer spontaneously bought that particular product in a time window without direct
mailing actions. The real response represents a response of pure interest in the product. In
other words, REAL N is used to represent the true prior probabilities.
The purpose of the calibration phase is to adjust Porg, the non-calibrated posterior probabilities
of TEST N , in order to truly represent the probability of response. With the aim of
methodologically benchmarking the different calibration methods, a k-fold cross-validation is
applied. In a k-fold cross-validation, the dataset is randomly split into k equal parts of which
one after the other is used during the scoring phase; while the other k-1 parts are used for
training the calibration model. Note that TEST kN ( REAL kN ) represents the k-fold for TEST N
( REAL N ), while Pkorg represents the non-calibrated posterior probabilities of TEST kN .
3. Calibration approaches
This paper is accepted for publication in European Journal of Operational Research
6
Two types of calibration methods are applied: (i) the rescaling algorithm of Saerens et al.
(2002) and (ii) the newly-proposed probability-mapping approaches. The former algorithm
rescales Pkorg the posterior probabilities of TEST kN taking into account the real incidence of
REAL kN (Saerens et al., 2002), while the latter type adjusts the posterior probabilities of
TEST kN by mapping them onto the real responses of REAL kN .
3.1 Rescaling algorithm (SAERENS)
This section explains the methodology of Saerens et al. (2002). The starting point of the
Saerens et al. (2002) calibration approach is based on Bayes’ rule, i.e. the posterior
probabilities of response depend in a non-linear way on the prior probability distribution of
the target classes. The prior probability distribution of the target class is defined as the
incidence of the target class, or in this setting the percentage of responders in the dataset.
Therefore, a change in the prior probability distribution of the target classes changes the
posterior response probabilities of the classification model. Saerens et al. (2002) describe a
process that adjusts the posterior probabilities of response output by the classifier to the new
prior probability distribution of the target classes making use of a predefined rescaling
formula. In detail, the calibrated posterior probabilities of response for the customers in the
test set of fold k are obtained by weighting the non-calibrated posterior probabilities, Pkorg, by
the ratio of the response incidence of REAL kN , i.e. the new prior probability distribution, to
the response incidence in the training set, i.e. the old prior probability distribution. The
denominator is a scaling factor to make sure that the calibrated posterior probabilities sum up
to one.
In summary,
Pknew
Pk (c1 )
Pkorg
Pkt (c1 )
Pk (c0 )
Pk (c1 )
(1 Pkorg )
Pkorg
Pkt (c0 )
Pkt (c1 )
(1)
with Pknew representing the calibrated posterior response probabilities in fold k, Pk(ci) and
Pkt(ci) the new and old prior probabilities for class i with i
0,1 . A data set NEWkN is
This paper is accepted for publication in European Journal of Operational Research
7
obtained which contains Pknew, the calibrated posterior probabilities for the test data of
TEST kN .
3.2 Probability-mapping approaches
The purpose of the probability-mapping approaches is to map Pkorg, the old posterior
probabilities of TEST kN , onto the real response probabilities of REAL kN . As such, one is able
to build a classification model that maps the non-calibrated probabilities onto the real
response probabilities. This model is then used to calibrate the old probabilities with the
corrected probabilities of response. However, the real probability distribution of the target
classes is not directly available from REAL kN which only contains the real responses yi with
yi
0,1 on an individual customer level. In order to convert the real responses yi with
yi
0,1 on an individual level in REAL kN into a real response probability distribution, a
number of bins b are constructed. The incidence of response is calculated per bin and equals
the percentage of real response. This incidence is used as an approximation for the real
probability of response per bin. In practice, both TEST kN and REAL kN are split into a number
of bins b using the equal frequency binning approach based on the posterior probabilities of
TEST kN . TEST kb ( REAL kb ) represents the b-th bin in the k-fold of TEST kN ( REAL kN
respectively). TEST kb and REAL kb logically contain identical observations, while Pkborg is the
non-calibrated posterior probability average for the b-th bin in TEST kN and Pkbreal is the
percentage of real responders in the b-th bin of REAL kN . Pkbreal serves as a proxy for the true
prior probability. In order to formalize the relationship between the average posterior
probabilities of TEST kN and the approximate real probabilities obtained from REAL kN , a
formal mapping is obtained using the binned training set of fold k by
Pkbreal = fk(Pkborg)
(2)
with fk being the classifier that maps the non-calibrated posterior probabilities onto the real
probabilities in fold k. After the classifier fk is built, it is applied to the unseen test data of
This paper is accepted for publication in European Journal of Operational Research
8
TEST kN to obtain the new posterior probabilities, Pknew, for every individual in the test data
set of the k-th fold. A new data set is obtained NEWkN which contains Pknew, the calibrated
posterior probabilities.
There are several possibilities for fk, a function that links the estimated, non-calibrated
probilities of TEST kb to the approximated real probabilities of REAL kb . This study uses one
probability-mapping approach based on generalized linear models (Section 3.2.1.) and three
non-linear approaches; one based on generalized linear models with log-transformed noncalibrated probabilities (Section 3.2.2.) and two approaches based on generalized additive
models (Section 3.2.3. and Section 3.2.4.).
3.2.1 Generalized linear model (GLM)
Given yi as the dependent variable with y i
0,1 representing Pkbreal, the averaged true prior
probabilities from REAL kb and xi equal to Pkborg, the averaged posterior probabilities of
TEST kb , a generalized linear model with logit link function is employed to model fk(xi)
0,1 .
Moreover, it assumes that the relationship between Pkborg and Pkbreal is linear in the log-odds
via
logit y i
log
yi
1 yi
αk
β ki x i
(3)
or
yi ≡ fk(xi) = logit -1 (α k
β ki x i )
(4)
with α k as the intercept and β ki x i as the predictor. The parameters α k and β ki are estimated using
maximum likelihood (Tabachnick & Fidell, 1996).
3.2.2. Generalized linear model with log transformation (LOG)
This paper is accepted for publication in European Journal of Operational Research
9
Another approach is to log-transform xi in equation (3) and equation (4), because as such one
captures the non-linearity in the log-odds space between yi, Pkbreal the true prior probabilities
from REAL kb , and xi, Pkborg the posterior probabilities of TEST kb .
3.2.3. Generalized additive models
An attractive alternative to standard generalized linear models is generalized additive models
(Hastie & Tibshirani, 1986, 1987, 1990). Generalized additive models relax the linearity
constraint and apply a non-parametric non-linear fit to the data. In other words, the data
themselves decide on the functional form between the independent variable and the dependent
variable. Define yi as the dependent variable with y i
0,1 representing Pkbreal, the true
posterior probabilities from REAL kb , and xi equals to Pkborg, the posterior probabilities of
TEST kb . To model fk(xi)
0,1 , generalized additive models with logit link function are
employed. Methodologically, generalized additive models generalize the generalized linear
model principle by replacing the linear predictor β ki x i in equation (4) with an additive
component where
yi ≡ fk(xi) = logit -1 (α k
s ki ( x i ))
(5)
with s ki ( x i ) as a smooth function. This study uses penalized regression splines s ki ( x i ) to
estimate the non-parametric trend for the dependency of yi on xi (Wahba, 1990; Green and
Silverman, 1994). These smooth functions use a large number of knots leading to a model
quite insensitive to the knot locations, while the penalty term is used to avoid the danger of
over-fitting that would otherwise accompany the use of many knots. The complexity of the
model is controlled by a parameter λ and it is inversely related to the degrees of freedom (df).
If λ is small (i.e. the df are large), a very complex model that closely matches the data is
employed. When λ is large (i.e. the df are small), a smooth model is considered. In order to
optimize the generalized additive model, the fitting amounts to penalized likelihood
maximization by penalized iteratively reweighted least squares (Wood, 2000; 2004; 2008).
This paper is accepted for publication in European Journal of Operational Research
10
3.2.4. Generalized additive models with monotonicity constraint
Due to the fact that generalized additive models produce a non-linear relationship between the
independent variable Pkborg and the dependent variable Pkbreal, the original ranking of the
posterior probabilities of TEST kN and its calibrated version may change. However, marketing
analysts could argue that the mapping from TRAIN M onto TEST N and the corresponding
ranking of the customers in TEST N (and respectively TEST kN ) given by the initial classifier
C should be conserved. As such a non-decreasing monotonicity constraint on the generalized
additive models predictions is introduced to retain the original ranking of the customers.
Inspired by rule-set creation advances in the post-learning phase (e.g. pedagogical rule-based
extraction techniques as employed in Martens et al. (2007)), a rule set on the training set of
fold k is produced in the post-estimation phase of the generalized additive models to obtain a
function fk’, a non-decreasing monotone function. This ensures that the initial ranking of Pkborg
is maintained in the corresponding predictions Pkbreal of fold k. Practically, the training set is
sorted by Pkborg. Afterwards the rule-based algorithm detects all non-decreasing monotonic
inconsistencies on the prediction values fk(Pkborg) on the training set. For instance, suppose
that the prediction value for bin X+1 is lower than the prediction value for bin X than the rulebased algorithm adds a rule to the rule-base to change the prediction value of bin X+1 to the
larger prediction value of bin X. In the end, the generalized additive model and the rule-base
describe a non-decreasing monotone generalized additive model based function fk’ with
following characteristics (Denlinger, 2010) if
PkborgX ≤ PkborgX+1 => fk’(PkborgX) ≤ fk’(PkborgX+1)
(7)
with PkborgX and PkborgX+1 original non-calibrated posterior probabilities for bins X and X+1 in
the training data set, and fk’(PkborgX) and fk’(PkborgX+1) the calibrated posterior probabilities in
fold k for bins X and X+1.
This paper is accepted for publication in European Journal of Operational Research
11
4. Empirical validation
The calibration methods are employed on a test bed of 9 real-life direct marketing datasets
provided by a large European financial institution. Each of these datasets corresponds to a
typical financial product. Table 1 shows the characteristics of the response datasets.
[INSERT TABLE 1 OVER HERE]
With the aim of methodologically comparing the different algorithms, a 10-fold crossvalidation is applied. Furthermore, the classifier C which links TRAIN M and TEST N and
outputs Porg is a logistic regression with forward variable selection as it is a robust and wellknown classification technique in the marketing environment (Neslin et al., 2006). Moreover,
the calibration approaches based on generalized additive models use different levels of
degrees of freedom (df) representing the non-linearity of the model. The higher the df, the
higher the non-linearity. On the hand, the df are set manually by the researcher (userspecified), while on the other hand the df are simultaneously estimated in correspondence
with the shape of the response function (automatic). This study opts to manually set the df
equal to {3,4,5} (resulting in GAMdf and GAMdf MONO). This df range is inspired by the
recommendation and the applications in Hastie & Tibshirani (1990) and Hastie et al. (2001)
that use a relatively small number of df to account for different levels of non-linearity.
Additionally, the generalized cross-validation procedure (GCV) is employed to automatically
select the ideal number of df, resulting in GAMgcv and GAMgcv MONO (Gu & Wahba,
1991; Wood, 2000; 2004). The number of bins b for TEST kN and REAL kN is set to 200.
Furthermore, Porg, the non-calibrated posterior probabilities of TEST N , are used as a
benchmark (ORIGINAL). The different algorithms are compared on an individual customer
level using the log-likelihood (LL) by
N
LL
p( x i ) y i 1 p( x i )
ln(
i 1
1 yi
N
)
y i ln p( x i )
(1 y i ) ln 1 p( x i )
(8)
i 1
with N the number of customers, p( x i ) equal to Pknew, the calibrated posterior response
probability, and yi as the real response variable with y i
0,1 . The LL is a well-known
This paper is accepted for publication in European Journal of Operational Research
12
metric in (direct) marketing to evaluate the performance of an algorithm (e.g. Baumgartner &
Hruschka, 2005). The higher the LL, the better the calibration of the posterior probabilities to
the true response distribution is. Moreover, the non-parametric Friedman test (Demšar, 2006;
Friedman, 1937, 1940) with the Bonferroni-Dunn test (Dunn, 1961) is used in order to
significantly compare the different approaches with the best performing algorithm.
5. Results
Table 2 represents the 10-fold cross-validated log-likelihood values for the different datasets
and the different algorithms. Three panels (a,b,c) are included representing the various levels
of the user-selected degrees of freedom for the generalized additive model mappings. For
each dataset, the best performing algorithm in terms of log-likelihood is put in italics.
Moreover, the average ranking (AR) per algorithm over the different datasets is given. The
lower the ranking, the better the algorithm is shown to be. The best performing algorithm is
underlined and set in bold, while the algorithms that are not significantly different to the best
one at a 5% significance level are only set in bold.
[INSERT TABLE 2 OVER HERE]
The algorithms are split into 4 categories; the original, non-calibrated posterior probabilities
(ORIGINAL), the rescaling methodology (SAERENS), the linear probability-mapping
approach (GLM) and the non-linear probability-mapping approaches (LOG, GAMdf, GAMdf
MONO, GAMgcv and GAMgcv MONO).
Table 2 reveals that calibrating the posterior probabilities has a beneficial impact when a
discrepancy exists between the true prior probabilities of the training set and the test set:
ORIGINAL always performs worse than the other calibration approaches.
This paper is accepted for publication in European Journal of Operational Research
13
Comparing the rescaling approach (SAERENS) with the best performing calibration
approaches, one concludes that SAERENS always significantly performs less well than the
non-linear probability-mapping approaches, while SAERENS performs better than the linear
probability-mapping approach (GLM). These results show that the analyst better shifts
towards a non-linear probability-mapping approach, despite the fact that SAERENS is an easy
and workable solution to the calibration problem.
Contrasting the various probability-mapping approaches, Table 2 discloses that the non-linear
calibration approaches (LOG, GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO) are
always amongst the best performing algorithms. The linear mapping approach (GLM) is never
significantly competitive with one of its non-linear counterparts. However, the generalized
linear model with log-transformation (LOG) is competitive to the more advanced GAM
approaches (GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO). Within the nonlinear calibration setting, one concludes that GAMgcv MONO always performs best, followed
by the other non-linear calibration approaches.
Table 3 contains the performance measures for all generalized additive models approaches
(GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO), for all the levels of degrees of
freedom. On a dataset level, the best performing algorithm is put in italics. Furthermore, the
average ranking (AR) for each algorithm is given and the best performing algorithm (i.e. the
one with the lowest ranking) is underlined and set in bold, while the ones that are not
significantly different to the best at a 5% significance level are simply put in bold.
[INSERT TABLE 3 OVER HERE]
Table 3 reveals that GAM5 MONO is the best performing algorithm amongst the GAM and
GAM MONO approaches, quickly followed by GAMgcv MONO. Table 3 shows a better
performance trend for the GAM approaches when the number of df are increased. GAM3
performs less well than GAM4, while GAM4 has a less well performance than GAM5.
Furthermore, it is clear that including the monotonicity constraint has a beneficial impact on
This paper is accepted for publication in European Journal of Operational Research
14
the calibration performance of the GAM approaches. The average ranking of the GAM
approaches including the monotonicity constraint is always better than their original GAM
counterparts (i.e. GAMdf versus GAMdf MONO and GAMgcv versus GAMgcv MONO).
Moreover, the automatic smoothness parameter selection procedure proves its beneficial
impact. For the non-monotonicity models, GAMgcv has always a better ranking than the
GAMdf approaches. For the monotonicity models, GAMgcv MONO performs always better
than GAM3 MONO and GAM4 MONO, while GAMgcv MONO is very competitive to
GAM5 MONO.
6. Discussion
The results suggest that marketing analysts should calibrate the posterior probabilities when
the training set does not represent the true prior distribution. In general, calibrating the
posterior probabilities is more beneficial than using the non-calibrated posterior probabilities.
Moreover, it is shown that a ‘simple’ rescaling algorithm (SAERENS) that takes into account
the ratio of the old and the new priors is not sufficient to be a first and workable solution to
initially solve the calibration problem. SAERENS always performs significantly worse than
the more complex non-linear probability-mapping approaches. Furthermore, marketing
researchers should better not apply the linear probability-mapping approach in this specific
setting. Indeed, amongst the different probability-mapping approaches, it has been shown that
non-linear approaches are preferable over the linear mappings. The LOG approach is
competitive to the more complex GAM-based calibration approaches, and because it is based
on the common generalized linear model framework, LOG could be seen as a first and
workable approach. However if one is interested to optimize the calibration performance, the
GAM-based approaches are preferable. Moreover, one concludes that using the automatic
smoothing parameter selection procedure and imposing a monotonicity constraint on the
GAM method are the most preferred options to be employed in GAM models in order to
optimize calibration performance.
7. Conclusion
Direct marketing receives considerable attention these days in academia as well as in business
due to a serious drop in the cost of IT equipment and the ever increasing usage of response
This paper is accepted for publication in European Journal of Operational Research
15
models in a variety of business settings. In a direct marketing context, a discrepancy
sometimes exists between the prior distributions on the training set and scoring set which is
problematic. This may happen due to the fact that the training set consists entirely of
customers previously selected by a response model, and thus this dataset consists of a higher
percentage of responders. Applying a classification model built on this training set to the
complete set of customers will harm the estimation of the response probabilities. Thoroughly
adjusting the posterior probabilities to the real response probability distribution will improve
the classification performance. This study reveals that the non-linear probability-mapping
approaches are amongst the best performing algorithms and their usage is highly
recommended in a day to day business setting for following reasons. Firstly, the non-linear
probability-mapping approaches deliver a better performance compared to the other
calibration algorithms included in this research paper. This leads to the fact that the calibrated
probabilities better reflect the true probabilities of response. Secondly, there is a possibility to
visualize the relationship between Pkborg and Pkbreal. This gives managers a better and visual
understanding of the calibration process for a particular setting. For instance, the more the
calibration curve is away from the 45 degree line (,i.e. the line where Pkborg=Pkbreal or no
calibration is necessary), the higher the added value of sending a leaflet because the incidence
in TRAIN M is higher than in REAL N . Finally, the underlying techniques like generalized
linear models and generalized additive models are easily implementable in today’s business
environment due to the availability of the classifiers in traditional software packages like SAS
and R.
Whilst we are confident that our study adds significant value to the literature, valuable
directions for future research are identified. Beside the probability-mapping approaches which
map the Pkborg onto the Pkbreal, an extensive research project could be dedicated to investigate
the impact of ‘integrated’ calibration approaches, i.e. methods that integrate the calibration
process into the initial training phase of classifier C in order to come up with a new classifier
C’ which directly outputs calibrated probabilities. For instance, a workable ‘integrated’
calibration approach could be represented by a two-stage Bayesian logistic regression
approach that directly outputs calibrated posterior probabilities. In order to obtain this
‘integrated’ Bayesian calibration model, the following procedure is proposed. Under the
assumption that the commonly-used prior distribution for β ki is multivariate Gaussian, i.e.
p( β ki )~N( β 0 ,∑0), the Bayesian empirical approach could be used to specify the values of
This paper is accepted for publication in European Journal of Operational Research
16
β 0 and ∑0 by fitting a Bayesian logistic regression to TRAIN kM using non-informative priors.
Consequently, the resulting posterior mean vector and variance-covariance matrix of this
initial model could then be used for the values of β 0 and ∑0 for the second Bayesian logistic
regression on REAL kN . The resulting ‘integrated’ Bayesian logistic regression approach C’
will directly output adapted, calibrated posterior probabilities1. Furthermore, the probabilitymapping approaches are validated in a direct marketing setting, whereas future research
efforts could be spent to investigate the external validity to other operational research settings.
Acknowledgements
The authors would like to thank the anonymous company for freely distributing the datasets.
We would like to thank our friendly and journal reviewers for their fruitful comments on
earlier versions of this paper and the editor, Jesus Artalejo, for guiding this paper through the
reviewing process.
1
Nevertheless, this approach is not tested in the current version of the paper for confidentiality reasons.
This paper is accepted for publication in European Journal of Operational Research
17
References
Allenby, G. M., Leone, R. P., & Jen, L. C. (1999). A dynamic model of purchase timing with
application to direct marketing. Journal of the American Statistical Association, 94, 365-374.
Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J., & Dedene, G. (2002). Bayesian
neural network learning for repaeat purchase modeling in direct marketing. European Journal
of Operational Research, 138, 191-211.
Baumgartner, B., & Hruschka, H. (2005). Allocation of catalogs to collective customers based
on semiparametric response models. European Journal of Operational Research, 162, 839849.
Bose, I., & Chen, X. (2009). Quantitative models for direct marketing: A review from systems
perspective. European Journal of Operational Research, 195, 1-16.
Bosio, S., & Righini, G. (2007). Computational approaches to a combinatorial optimization
problem arising from text classification. Computers & Operations Research, 34, 1910-1928.
Bult, J. R. (1993). Semiparametric Versus Parametric Classification Models: An Application
to Direct Marketing. Journal of Marketing Research, 30, 380-390.
Conforti, D., & Guido, R. (2010). Kernel based support vector machine via semidefinite
programming: Application to medical diagnosis. Computers & Operations Research, 37,
1389-1394.
Crone, S. F., Lessmann, S., & Stahlbock, R. (2006). The impact of preprocessing on data
mining: An evaluation of classifier sensitivity in direct marketing. European Journal of
Operational Research, 173, 781-800.
Deichmann, J., Eshghi, A., Haughton, D., Sayek, S., & Teebagy, N. (2002). Application of
multiple adaptive regression splines (MARS) in direct response modeling. Journal of
Interactive Marketing, 16, 15-27.
This paper is accepted for publication in European Journal of Operational Research
18
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of
Machine Learning Research, 7, 1-30.
Denlinger, C.G. (2010). Elements of real analysis. Jones and Bartlett Publishers.
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical
Association, 56, 52-64.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the
analysis of variance. Journal of the American Statistical Association, 32, 675-701.
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m
rankings. The Annals of Mathematical Statistics, 11, 86-92.
Green, P.J. & Silverman, B.W. (1994). Nonparametric regression and generalized linear
models. Chapman and Hall/CRC Press.
Gu, C., & Wahba, G. (1991). Minimizing GCV/GML scores with multiple smoothing
parameters via the Newton method. SIAM Journal of Scientific and Statistical Computing, 12,
383-398.
Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297318.
Hastie, T., & Tibshirani, R. (1987). Generalized Additive Models: Some applications. Journal
of the American Statistical Association, 82, 371-386.
Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. London: Chapman and
Hall.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data
Mining, Inference and Prediction. New York: Springer-Verlag.
Haughton, D., & Oulabi, S. (1997). Direct marketing modeling with CART and CHAID.
Journal of Direct Marketing, 11, 42-52.
This paper is accepted for publication in European Journal of Operational Research
19
Hruschka, H. (2010). Considering endogeneity for optimal catalog allocation in direct
marketing. European Journal of Operational Research, 206, 239-247.
Kim, H.S., & Sohn, S.Y. (2010). Support vector machines for default prediction of SMEs
based on technology credit. European Journal of Operational Research, 201, 838-846.
Lamb, C. W., Hair, J. F., & McDaniel, C. (1994). Principles of marketing (second ed.).
Cincinnati: Soulh-Westem Publishing Co.
Lee, H. J., Shin, H., Hwang, S. S., Cho, S., & MacLachlan, D. (2010). Semi-Supervised
Response Modeling. Journal of Interactive Marketing, 24, 42-54.
Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehensible credit
scoring models using rule extraction from support vector machines. European Journal of
Operational Research, 183, 1466-1476.
Martens, D., Van Gestel, T., De Backer, M., Haesen, R., Vanthienen, J., & Baesens, B.
(2010). Credit rating prediction using Ant Colony Optimization. Journal of the Operational
Research Society, 61, 561-573.
Morales, D. R., & Wang, J. B. (2010). Forecasting cancellation rates for services booking
revenue management using data mining. European Journal of Operational Research, 202,
554-562.
Naik, P. A., Hagerty, M. R., & Tsai, C. L. (2000). A new dimension reduction approach for
data-rich marketing environments: Sliced inverse regression. Journal of Marketing Research,
37, 88-101.
Neslin, S. A., Gupta, S., Kamakura, W., Lu, J. X., & Mason, C. H. (2006). Defection
detection: Measuring and understanding the predictive accuracy of customer churn models.
Journal of Marketing Research, 43, 204-211.
Paleologo, G., Elisseeff, A., & Antonini, G. (2010). Subagging for credit scoring models.
European Journal of Operational Research, 201, 490-499.
This paper is accepted for publication in European Journal of Operational Research
20
Piersma, N., & Jonker, J.J. (2004). Determining the optimal direct mailing frequency.
European Journal of Operational Research, 158, 173-182.
Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to
new a priori probabilities: A simple procedure. Neural Computation, 14, 21-41.
Tabachnick, B. G. & Fidell, L. S. (1996). Using multivariate statistics. HarperCollings
Publishers, New York.
Wahba, G. (1990). Spline models for observational data. Society for Industrial and Applied
Mathematics (SIAM) Capital City Press, Montpelier (Vermont).
Wood, S.N. (2000). Modelling and Smoothing Parameter Estimation with Multiple Quadratic
Penalties. Journal of the Royal Statistical Society B, 62, 413-428.
Wood, S.N. (2004). Stable and efficient multiple smoothing parameter estimation for
generalized additive models. Journal of the American Statistical Association, 99, 673-686.
Wood, S.N. (2008). Fast stable direct fitting and smoothness selection for generalized additive
models. Journal of the Royal Statistical Society B, 70, 495-518.
Zahavi, J., & Levin, N. (1997). Applying neural computing to target marketing. Journal of
Direct Marketing, 11, 76-93.
This paper is accepted for publication in European Journal of Operational Research
21
k = 1 to 10
TRAINM
C
TESTN
LIN | LOG|
GAM|GAM MONO
ORIGINAL | SAERENS
TESTkN
TESTk1
TESTk2
TESTk3
….
….
TESTkb
TESTkN
NEWkN
NEWN
NEWk1
NEWk2
NEWk3
….
….
NEWkN
NEW1
NEW2
NEW3
….
….
NEWN
REALkN
REALN
REALk1
REALk2
REALk3
….
….
REALkb
Figure 1: Methodological framework.
This paper is accepted for publication in European Journal of Operational Research
22
Dataset ID
1
2
3
4
5
6
7
8
9
TRAINM
TESTN
# customers
% responders
# customers
% responders
70,463
1.29%
119,329
0.18%
56,301
2.40%
119,104
0.44%
23,328
7.57%
117,7433
0.14%
9,027
11.94%
305,567
0.57%
14,946
17.11%
1,073,346
0.18%
14,586
5.04%
1,223,703
0.05%
25,660
3.10%
748,602
0.18%
12,603
0.56%
127,651
0.24%
19,190
0.95%
113,496
0.23%
Table 1: Dataset characteristics.
This paper is accepted for publication in European Journal of Operational Research
# variables used by C
10
16
19
12
22
11
14
10
18
23
RESCALING
Panel a
PROBABILITY-MAPPING
NON-LINEAR
LINEAR
DATASET
1
2
3
4
5
6
7
8
9
AR
ORIGINAL
SAERENS
GLM
LOG
GAM3
GAMgcv
-242.91
-479.55
-998.78
-223.14
-243.69
-9884.39
-3823.20
-17802.90
-5493.35
7.67
-179.07
-306.81
-1280.32
-206.79
-140.90
-1192.41
-1032.46
-1290.20
-510.03
5.00
-202.76
-323.14
-1064.30
-223.00
-246.53
-1189.09
-1025.02
-1297.41
-525.81
6.89
-178.02
-304.73
-980.99
-206.29
-140.36
-1173.78
-1016.21
-1294.45
-506.11
3.22
-177.88
-306.83
-982.94
-206.69
-142.71
-1165.40
-1017.24
-1291.47
-523.01
4.44
-180.22
-304.23
-980.74
-207.16
-145.56
-1163.68
-1016.52
-1290.74
-515.27
3.78
RESCALING
Panel b
1
2
3
4
5
6
7
8
9
AR
ORIGINAL
SAERENS
GLM
LOG
GAM4
GAMgcv
-242.91
-479.55
-998.78
-223.14
-243.69
-9884.39
-3823.20
-17802.90
-5493.35
7.66
-179.07
-306.81
-1280.32
-206.79
-140.90
-1192.41
-1032.46
-1290.20
-510.03
5.22
-202.76
-323.14
-1064.30
-223.00
-246.53
-1189.09
-1025.02
-1297.41
-525.81
6.88
-178.02
-304.73
-980.99
-206.29
-140.36
-1173.78
-1016.21
-1294.45
-506.11
3.22
-178.06
-305.68
-983.81
-206.48
-146.48
-1164.36
-1016.80
-1290.92
-522.89
4.55
-180.22
-304.23
-980.74
-207.16
-145.56
-1163.68
-1016.52
-1290.74
-515.27
3.88
RESCALING
Panel c
ORIGINAL
SAERENS
GLM
GAM4
MONO
-177.40
-305.68
-980.17
-206.37
-140.02
-1164.06
-1016.27
-1292.02
-507.05
2.77
GAMgcv
MONO
-177.40
-303.60
-979.81
-206.61
-139.96
-1163.46
-1016.04
-1291.86
-505.91
1.77
GAM5
MONO
-177.38
-304.86
-979.81
-206.33
-139.94
-1163.62
-1016.11
-1291.86
-506.63
2.33
GAMgcv
MONO
-177.40
-303.60
-979.81
-206.61
-139.96
-1163.46
-1016.04
-1291.86
-505.91
1.88
PROBABILITY-MAPPING
NON-LINEAR
LINEAR
DATASET
GAMgcv
MONO
-177.40
-303.60
-979.81
-206.61
-139.96
-1163.46
-1016.04
-1291.86
-505.91
1.56
PROBABILITY-MAPPING
NON-LINEAR
LINEAR
DATASET
GAM3
MONO
-177.58
-307.03
-981.08
-206.56
-140.35
-1165.18
-1016.53
-1292.57
-507.20
3.44
LOG
GAM5
GAMgcv
1
-242.91
-179.07
-202.76
-178.02
-178.70
-180.22
2
-479.55
-306.81
-323.14
-304.73
-305.46
-304.23
3
-998.78
-1280.32
-1064.30
-980.99
-982.74
-980.74
4
-223.14
-206.79
-223.00
-206.29
-206.52
-207.16
5
-243.69
-140.90
-246.53
-140.36
-149.81
-145.56
6
-9884.39
-1192.41
-1189.09 -1173.78
-1163.91 -1163.68
7
-3823.20
-1032.46
-1025.02 -1016.21
-1016.59 -1016.52
8
-17802.90
-1290.20
-1297.41 -1294.45
-1290.75 -1290.74
9
-5493.35
-510.03
-525.81
-506.11
-522.79
-515.27
AR
7.66
5.11
6.88
3.33
4.66
4.00
* 10-fold CV LL values, AR = average ranking
Table 2: The 10-fold cross-validated log-likelihood values.
Panel a: overview with GAM3 & GAM3 MONO.
Panel b: overview with GAM4 & GAM4 MONO.
Panel c: overview with GAM5 & GAM5 MONO.
This paper is accepted for publication in European Journal of Operational Research
24
NON-LINEAR
GAM3
GAM4
GAM5
GAM3
GAM4
GAM5
MONO
MONO
MONO
1
-177.88
-177.58
-178.06
-177.40
-178.70
-177.38
2
-306.83
-307.03
-305.68
-305.68
-305.46
-304.86
3
-982.94
-981.08
-983.81
-980.17
-982.74
-979.81
4
-206.69
-206.56
-206.48
-206.37
-206.52
-206.33
5
-142.71
-140.35
-146.48
-140.02
-149.81
-139.94
6
-1165.40
-1165.18
-1164.36
-1164.06
-1163.91 -1163.62
7
-1017.24
-1016.53
-1016.80
-1016.27
-1016.59 -1016.11
8
-1291.47
-1292.57
-1290.92
-1292.02
-1290.75 -1291.86
9
-523.01
-507.20
-522.89
-507.05
-522.79
-506.63
AR
6.55
5.55
5.88
3.66
5.22
2.11
* 10-fold CV LL values, AR = average ranking
Table 3: The 10-fold cross-validate log-likelihood values for
GAM and GAM MONO calibration models.
DATASET
This paper is accepted for publication in European Journal of Operational Research
GAMgcv
-180.22
-304.23
-980.74
-207.16
-145.56
-1163.68
-1016.52
-1290.74
-515.27
4.55
GAMgcv
MONO
-177.40
-303.60
-979.81
-206.61
-139.96
-1163.46
-1016.04
-1291.86
-505.91
2.33
25