PREDICTIVE MODELING USING SEGMENTATION Nissan Levin

PREDICTIVE MODELING USING SEGMENTATION
Nissan Levin
Faculty of Management
Tel Aviv University
Jacob Zahavi
The Alberto Vitale Visiting
Professor of Electronic Commerce
The Wharton School
On leave from Tel Aviv University
December 1999
JZ-Predictive.doc/mp
Abstract
We discuss the use of segmentation as a predictive model for supporting targeting
decisions in database marketing. We compare the performance of judgmentally -based RFM
and FRAC methods to automatic tree classifiers involving the well-known CHAID algorithm, a
variation of the AID algorithm and a newly-developed method based on genetic algorithm (GA).
We use the logistic regression model as a benchmark for the comparative analysis. The results
indicate that automatic segmentation methods may very well substitute the judgmentally-based
segmentation methods for response analysis, and come only short of the logistic regression
results. The implications of the results for decision making are also discussed.
1.
Introduction
Segmentation is key to marketing. Since customers are not homogenous and differ with
respect to one another in their preferences, wants and needs, the idea is to partition the market
into groups, or segments, of “like” people, with similar needs and characteristics, which are
likely to exhibit similar purchasing behavior. Then, one may offer each segment with the
products/services which are keen to the members of the segment. Ideally, the segmentation
should reflect customers’ attitude towards the product/service involved. Since this is often not
known in advance, the best proxy is to use data that reflect customers’ purchasing habits and
behavior. Weinstein discusses several dimensions to segmentation (1994, p.4) :
* Geography - classifying markets on the basis of geographical
considerations.
* Socioeconomic - segmentation is based on factors reflecting customers’
socioeconomic status, such as income level, education.
* Psychographic - differentiating markets by appealing to customers’ needs,
personality traits and lifestyle.
* Product usage - partitioning the market based on consumption level of various
users group.
* Benefits - splitting the market based upon the benefit obtained from a
product/service, such as price, service, special features and factors.
Hence, the segmentation process is information-based, the more information is available, the
more refined and focused are the resulting segments.
Perhaps more than in any other marketing channel, segmentation is especially powerful
in database marketing (DBM) where one can use the already available wealth of information on
2
customers’ purchase history and demographic and lifestyle characteristics to partition the
customer list to segments. The segmentation process, often referred to as profiling, is used to
distinguish between customers and noncustomers, where “customers” here are extended to
include buyers, payers, loyal customers, etc., and to understand their composition and
characteristics - who they are? what do they look like? what are their attributes? where do they
reside?, etc. This analysis supports a whole array of decisions, ranging from targeting decisions
to determining efficient and cost effective marketing strategies, even evaluating market
competition.
In this paper we discuss the use of segmentation as a predictive model for supporting
targeting decisions in database marketing. We compare the performance of judgmentally-based
RFM and FRAC methods, to several automatic tree-structured segmentation methods (decision
trees). To assess how good is the performance of the segmentation-based models, we compare
them against the results of a logistic regression model, which is undoubtedly one of the most
advanced response models in database marketing and a one which is certainly hard to “beat”.
Logistic regression is widely discussed in the literature, and will not be reviewed here. See for
example, Ben-Akiva (1987), Long (1997) and others.
Several studies have been conducted so far on the use of segmentation methods for
supporting targeting decisions. Haughton and Oulabi (1997) compare the performance of
response models built with CART and CHAID on a case study that contains some 10,000
cases, with about 30 explanatory variables and about the same proportion of responders and
non-responders. Bult and Wansbeek (1995) devise a profit-maximization approach to selecting
customers for promotion, comparing the performance of CHAID against several “parametric”
models (e.g., logistic regression) using a sample of about 14,000 households with only 6
3
explanatory variables.
Novak et al. (1992) devise a “richness” curve for evaluating
segmentation results, defined as the running average of the proportion of individuals in a segment
which are “consumers”, where segments are added in decreasing rank order of their response.
Morwitz and Schmittlein (1992) investigate whether the use of segmentation can improve the
accuracy of sales forecasts based on stated purchase intent involving CART, discriminant
analysis and K-means clustering algorithm. Other attempts have been made to improve the
targeting decisions with segmentation by using prior information. Of these we mention the
parametric approach of Morwitz and Schmittlein (1998), and the non-parametric approach of
Levin and Zahavi (1996).
This paper provides a hard empirical evidence of the relative merits of various
segmentation methods, focussing on several issues:
- How well automatic tree classifiers are capable of discriminating between buying
and non-buying segments?
- How well automatic segmentation perform as compared to manually-based RFM
and FRAC segmentation?
- How the various automatic tree classifiers compare against logistic regression
results?
- What practical implications one needs to look into when using automatic
segmentation?
On the theoretical front, we offer a unified framework to formulate decision trees for
segmenting an audience based on a choice variable, and expand the existing tree classifiers in
several directions.
4
The development and the evaluation of the decision trees were conducted by the
authors’ own computer programs. We note that since tree classifiers are heuristic methods, the
resulting tree is as good as the algorithm used to create the tree. Hence, all results in this paper
reflect the performance of our computer algorithms, which may not extend to other algorithms.
The various methods are demonstrated and evaluated using realistic data from the
collectible industry. For confidentiality reasons, all results are presented in percentage terms.
Also discussed are the implications of the results for decision making.
2.
Segmentation Methods
2.1
Judgmentally-Based Methods
Judgmentally-based or “manual” segmentation methods are still most commonly used in
DBM to partitoin a customers’ list into “homogenous” segments. Typical segmentation criteria
include previous purchase behavior, demographics, geographics and psychographics. Previous
purchase behavior is often considered to be the most powerful criterion in predicting likelihood
of future response. This criterion is operationalized for the segmentation process by means of
Recency, Frequency, Monetary (RFM) variables (Shepard, 1995). Recency corresponds to
the number of weeks (or months) since the most recent purchase, or the number of mailings
since last purchase; frequency to the number of previous purchases or the proportion of mailings
to which the customer responded; and monetary to the total amount of money spent on all
purchases (or purchases within a product category), or the average amount of money per
purchase. The general convention in the DBM industry is that the more recently the customer
has placed the last order, the more items he/she bought from the company in the past, and the
more money he/she spent on the company’s products, the higher is his/her likelihood of
5
purchasing the next offering and the better target he/she is. This simple rule allows one to
arrange the segments in decreasing likelihood of purchase.
The more sophisticated manual methods also make use of product/attribute proximity
considerations in segmenting a file. By and large, the more similar the products bought in the
past are to the current product offering, or the more related are the attributes (e.g., themes), the
higher the likelihood of purchase. For example, in a book club application, customers are
segmented based upon the proximity of the theme/content of the current book to those of
previously-purchased books.
Say, the currently promoted book is “The Art History of
Florence”, then book club members who previously bought Italian art books are the most likely
candidates to buy the new book, and are therefore placed at the top of the segmentation list,
then people who purchased general art books, followed by people who purchased geographical
books, and so on. In cases where males and females may react differently to the product
offering, gender may also be used to partition customers into groups. By and large, the list is
first partitioned by product/attribute type, then by RFM and then by gender (i.e., the
segmentation process is hierarchical). This segmentation scheme is also known as FRAC Frequency, Recency, Amount (of money) and Category (of product) (Kestnbaum, 1998).
Manually-based RFM and FRAC methods are subject to judgmental and subjective
considerations. Also, the basic assumption behind the RFM method may not always hold. For
example, in durable products, such as cars or refrigerators, recency may work in a reverse way
- the longer the time since last purchase, the higher the likelihood of purchase. Finally, to meet
segment size constraints, it may be necessary to run the RFM/FRAC iteratively, each time
combining small segments and splitting up large segments, until a satisfactory solution is obtained.
This may increase computation time significantly.
6
2.2
Decision Trees
Several “automatic” methods have been devised in the literature to take away the
judgmental and subjective considerations inherent in the manual segmentation process. By and
large, these methods map data items (customers, in our case) into one of several predefined
classes. In the most simple case, the purpose is to segment customers into one of two classes,
based on some type of a binary response, such as buy/no buy, loyal/non-loyal, pay/no-pay, etc.
Thus, tree classifiers are choice-based. Without loss of generality we refer to the choice
variable throughout this paper as purchase/no -purchase, thus classifying the customers into
segments of “buyers” and “nonbuyers”.
Several automatic tree classifiers were discussed in the literature, among them AID Automatic Interaction Detection (Sonquist, Baker and Morgan, 1971); CHAID - Chi square
AID (Kass, 1983), CART - Classification and Regression Trees (Breiman, Friedman, Olshen
and Stone, 1984), ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and others. A comprehensive
survey of automatic construction of decision trees from data was compiled recently by Murthy
(1998).
Basically, all automatic tree classifiers share the same structure. Starting from a “root”
node (the whole population), tree classifiers employ a systematic approach to grow a tree into
“branches” and “leaves”. In each stage, the algorithm looks for the “best” way to split a “father”
node into several “children” nodes, based on some splitting criteria. Then, using a set of
predefined termination rules, some nodes are declared as “undetermined” and become the father
nodes in the next stages of the tree development process, some others are declared as
7
“terminal” nodes. The process proceeds in this way until no more node is left in the tree which is
worth splitting any further. The terminal nodes define the resulting segments. If each node in a
tree is split into two children only, one of which is a terminal node, the tree is said to be
“hierarchical”.
Three main considerations are involved in developing automatic trees:
- Growing the tree
- Determining the best split
- Termination rules
Growing the Tree
One grows the tree by successively partitioning nodes based on the data. A node may
be partitioned based on several variables at a time, or even a function of variables (e.g., a linear
combination of variables). With so many variables involved, there are practically infinite number
of ways to split a node. Take as an example just a single continuous variable. This variable
alone can be partitioned in infinite number of ways, let alone when several variables are involved.
In addition, each node may be partitioned into several descendants (or splits), each
becomes a “father” node to be partitioned in the next stage of the tree development process.
Thus, the larger the number of splits per node, the larger the tree and the more prohibitive the
calculations.
Indeed, several methods have been applied in practice to reduce the number of possible
partitions of a node:
- All continuous variables are categorized prior to the tree development
process into small number of ranges (“binning”). A similar procedure
8
applies for integer variables which assume many values (such as the
frequency of purchase).
- Nodes are partitioned only on one variable at a time (“univariate” algorithm).
- The number of splits per each “father” node is often restricted to two (“binary”
trees).
- Splits are based on a “greedy” algorithm in which splitting decisions are made
sequentially looking only on the impact of the split in the current stage, but never
beyond (i.e., there is no “looking ahead”).
Determining the Best Split
With so many possible partitions per node, the question is what is the best split? There
is no unique answer to this question as one may use a variety of splitting criteria, each may result
in a different “best” split. We can classify the splitting criteria into two “families”: node-value
based criteria and partition-value based criteria.
- Node-value based criteria: seeking the split that yields the best improvement in
the node value.
- Partition-value based criteria: seeking the split that separates the node into
groups which are as different from each other as possible
We discuss the splitting criteria at more length in Appendix A.
Termination Rules
Theoretically, one can grow a tree indefinitely, until all terminal nodes contain very few
customers, as low as one customer per segment. The resulting tree in this case is unbounded
9
and unintelligible, having the effect of “can’t see the forest because of too many trees”. It misses
the whole point of tree classifiers whose purpose is to divide the population into buckets of
“like” people, where each bucket contains a meaningful number of people for statistical
significance. Also, the larger the tree, the larger the risk of overfitting. Hence it is necessary to
control the size of a tree by means of termination rules that determine when to stop growing the
tree. These termination rules should be set to ensure statistical validity of the results and avoid
overfitting.
We discuss three tree classifiers in this paper: a variation of AID which we refer to as
Standard Tree Algorithm (STA), CHAID and a new tree classifier based on Genetic Algorithm
(GA). These algorithms are further described in Appendix B.
3.
A Case Study
We use a real case study and actual results to demonstrate and evaluate the
performance of decision trees vis-a-vis manually-based trees. The case study involves a solo
mailing of a collectible item that was live-tested in the market and then rolled out. The data for
the analysis consists of the test audience with appended orders, containing 59,970 customers,
which we randomly split into two mutually exclusive samples - a training (calibration) sample,
consisting of 60% of the observations, to build the model (tree) with, and a holdout sample,
containing the rest of the customers, to validate the model with.
As alluded to earlier, only binary predictors may take part in the partitioning process.
Hence, all continuous and multi-valued integer variables were categorized, prior to invoking the
tree algorithm, into ranges, each is represented by a binary variable assuming the value of 1 if the
variable falls in the interval, 0-otherwise.
This process is also referred to as “binning”.
10
Depending upon the tree, the resulting categories may be either overlapping (i.e., X ? a i ,
i ? 1, 2, ? , where X is a predictor, a i a given breakpoint) or non-overlapping (i.e.,
ai ? 1 ? X ? ai , i ? 1, 2,? ).
The trees were evaluated using goodness-of-fit criteria which express how well the
profiling analysis is capable of discriminating between the buyers and the nonbuyers. A common
measure is in terms of the percentage of buyers “captured” per the percentage of audience
mailed, the higher the percentage of buyers, the “better” the model. For example, a
segmentation scheme that captures 80% of the buyers for 50% of the audience is better than a
segmentation scheme that captures only 70% of the buyers for the same audience.
Below we discuss several considerations in setting up the study:
Feature Selection
In DBM applications, the number of potential predictors could be very large, often in the
order of magnitude of several hundreds, even more, predictors. Thus, the crux of the modelbuilding process in DBM is to pick the subset of predictors, usually only a handful of which, that
explain the customers choice decision. This problem, referred to as the feature selection (or
specification) problem, is a tough combinatorial problem and is definitely the most complex issue
in building large scale multivariate models.
Decision trees possess an advantage over statistical regression models in that they have
an inherent built-in mechanism to pick the predictors affecting the terminal segments. They do it
by going over all possible combinations to grow a tree (subject to the computational constraints
discussed in the previous section) and selecting the best split for each node using one of many
splitting criteria (discussed in Appendix A).
11
Logistic regression, on the other hand do not enjoy such a benefit, and one needs to
incorporate a specification procedure into the process (e.g., a stepwise regression approach).
Hence, using logistic regression models in DBM applications is not easy, and is definitely not as
straightforward as building decision trees, and may require an extensive expertise in statistics. In
our case, we use a rule-based expert system to weed out the “bad” from the “good” predictors,
using rules that reflect statistical theory and practice. Examples are rules that set up the level of
significance to include a variable in the model, or rules that set up a threshold on the allowed
degree of multicollinearity between predictors, and the like. These rules were calibrated using
an extensive experimentation process.
Number of Predictors for a Split
The number of predictors to split a node by is constrained by the tree classifier:
?? STA - Our AID-like algorithm, was expanded to allow for splitting a none based on
two predictors at a time. This enabled us to also account for the interacton (or
secondary) effect on the decision tree.
?? CHAID - in contrast to STA and GA, CHAID considers all predictors resulting from
a categorical representation of a variable as one group in the partitioning process. For
example, suppose MARITAL denote the marital status of a customer with four values
(single, married, widow, divorce), then CHAID seeks the best way to split a node
from among all possible combinations to group these four predictors (see Appendix B
for more details).
?? Finally, the main benefit of GA is that it can use any number of predictors to split a
node by. However, due to computational constraints, we have limited the number of
12
variables in our study to only three and four predictors at a time.
The Predictors Set
The mix of predictors is a major factor affecting the resulting tree structure, the larger the
number of potential predictors to split a tree, the larger is the number of segments and the
smaller is the size of each terminal segment. To determine the impact of the mix and the number
of predictors on the performance of the tree classifiers, we have used four sets of predictors in
our study.
Set 1
- affinity: the product attributes corresponding to the current product,
grouped into major categories based on similarly measures.
Set 2 -
affinity, recency: product attributes plus recency (number of months
since last purchase) categorized into predefined ranges (0-6 months,
7-12 months, etc)
Set 3 -
affinity, recency, frequency: product attributes, plus recency variables
plus frequency measures broken down by product lines.
Set 4 -
all predictors which exist in the customer file.
Min/Max Segment Size
Size constraints are most crucial in segmentation analysis. To minimize the error
incurred in case wrong decisions are made (e.g., because of sampling errors), segments are
required to be “not-too-small” and “not-too-big”. If a segment is too small, the probability of
making an error increases due to the lack of statistical significance. If the segment is too big,
then if a “good” segment is somehow eliminated from the mailing (Type I error) - large foregone
13
profits are incurred, and if a “bad” segment makes it to the mailing (Type II error) - large out-ofpocket costs are incurred.
Consequently, we have built a mechanism in all our automatic tree algorithms to account
for minimum and maximum constraints on the resulting segment size. In our study, we used two
sets of min/max constraints on the segment size, 150/3000 and 300/6000, respectively.
Splitting Criteria
Finally, as discussed in Appendix A, one may define a variety of splitting criteria which
belong to the node-value and partition-value families. We have used four different criteria in
our study, all of them seek to maximize a statistical measure Y as follows:
Criterion 1 (CHAID): Y is the statistic for the chi-square test of independence:
Y?
?
?Observed ? Expected ?2
splits
Expected
The larger the value of Y, the larger the difference between the response rates of the
resulting child nodes, and the “better” the split. This statistic is distributed as chi-square
with (k - 1) degrees of freedom, where k is the number splits for the node. Then, if the
resulting P_value is less than or equal to the level of significance, we conclude that Y is big
“enough” and that the resulting split is “good”. Since CHAID uses a sequence of tests
(each possible split constitutes a test), an adjusted P_value measure is often used to
determine the “best” split.
Criterion 2 (CHAID): Y is the total entropy associated with a given partition (into M splits). It is
a measure of the information content of the split, the larger the entropy, the better
Criterion 3 (STA): Y is the number of standard deviations that the response rate (RR) of the
smaller child node (the one with the fewer number of customers) is away from the overall
14
response rate of the training audience (TRR). Large values of Y (e.g., Y ? 2 ) mean that
the true (but unknown) response rate of the resulting segment is significantly different from
the TRR, indicating a “good” split.
Criterion 4 (STA, GA): Y is the larger response rate of the two children nodes (in a binary
split). This criterion seeks the split which maximizes the difference in the response rates of
the two descendant nodes.
All these criteria are further discussed in Appendix A.
4.
Results and Analysis
The combination of several tree classifiers, predictor sets and splitting criteria give rise to
numerous profiling algorithms. We provide only selective results in this paper. We evaluate all
trees based on goodness-of-fit criteria. As a reference point, we compare the automatic
segmentation and the manually-based segmentation to logistic regression.
To allow for a “fair” comparison between the models, each model was optimized to
yield the best results: the manually-based segmentation by using domain experts to determine
the FRAC segmentation criteria; the automatic decision trees by finding the best tree for the
given splitting criterion and constraints; and the logistic regression by finding the “best” set of
predictors that explain customer choice.
Goodness-of-Fit Results
Goodness-of-fit exhibits how well a model is capable of discriminating between buyers
and nonbuyers. In a binary yes/no model, it is measured by means of the actual response rate
(the ratio of the number of buyers “captured” to the size of the audience mailed), or the “lift” in
15
the actual response attained by the model over a randomly-selected mailing.
Goodness-of-fit results are typically presented by means of gains tables or gains charts.
In a segmentation model, the gains table exhibits the performance measures by segments, in
decreasing order of the segments’ actual response rates. To evaluate the goodness-of-fit
results, one needs to look on the segments of the holdout sample, where the segments are
arranged in descending order of the response rates of the segments in the training sample.
In a logistic regression model, the gains results are exhibited by decreasing probability
groups, most often by deciles.
Table 1 presents the gains table for RFM-based segmentation. Table 2 - for FRACbased segmentation.
Out of the many tree classifiers that we analyzed, we present two gains tables for STA,
with 1-variable split, predictor set 2 and splitting criterion 4 - one which corresponds to min/max
constraints on the resulting segment size of 150 and 3000, respectively (Table 3), and the other
for min/max constraints of 300 and 6000, respectively (Table 4).
Finally, Table 5 exhibits the logistic regression results by deciles.
The goodness-of-fit results may be assessed by means of several measures:
-
The behavior of the response rates of the holdout audience across segments, which in
a “good” model should exhibits a nicely declining pattern.
-
The difference between the response rates of the top and the bottom segments, the
larger the difference, the better the fit.
Observing Table 1 (RFM segmentation), other than the top two segments, the response
rates across segments in the list are pretty flat and somewhat fluctuating - both are indicative of
relatively poor fit.
16
By comparison, in the FRAC segmentation, the top segments perform significantly better
than the bottom segments, with the top segment yielding a response rate of 12.99% versus an
average response rate of only .69% for the entire holdout audience.
And the automatic segmentation methods are not lagging behind in terms of
discriminating between the buying and the nonbuying segments, with the top segment
outperforming the bottom segments by a wide margin.
Tree Performance
To evaluate and compare the automatic segmentation to the judgmentally-based
segmentation and the logistic regression model, we look on the percentage of buyers captured at
several representative mailing audiences. The reference point consists of the top 30% of the
customers in the list of segments, arranged in descending order of the response rate of the
segments in the training sample. Note that in a tree analysis, the response probability of a
customer is determined by the response rate of his/her peers, i.e. the response rate of the
segment that the customer belongs to. Since segments’ size are discrete, we use interpolation to
exhibit the performance results at exactly 30% of the audience. Of course, no interpolation is
required for logistic regression, since here the response probability is calculated individually for
each customer in the list. As additional reference points, we also present the performance
results for the top 10% and 50% of the audience.
Table 6 presents the performance results at these audience levels for several tree
classifiers, as well as the results of RFM, FRAC and logistic regression. Note that all tree
classifiers were ran using all four sets of predictors; RFM, FRAC and logistic regression - only
using set 4.
17
Comparing the results, we conclude:
-
The logistic regression model outperform all other models - the judgmentally based,
as well as the automatic models.
-
The RFM-based models are the worst.
-
The automatic tree classifiers perform extremely well, being comparable to the
FRAC-based model and getting pretty close to the logistic regression model.
-
Most of the information is captured by the affinity considerations (Set 1). The
additional variables of Set 2 and Set 3 do not seem to add much to improve
performance. This phenomenon may be very typical in the collectible industry, where
a customer either likes a given range of products (e.g., dolls) or not, but it may not
extends to other industries.
-
By comparison, Set 4, which contains all variables in the data set appears to perform
the worst of all sets. This could be a reflection of the overfitting phenomenon, the risk
of which is usually higher, the larger the number of variables.
-
Increasing the minimum size constraint usually increases the variance of the fit results
across all segments of a tree. This is a manifestation of the fact that larger segments
are less “homogenous” and thus exhibit larger variation. Indeed, smaller segments are
more stable, but increase the risk of overfitting. So one needs to trade off segment
size to find the most suitable one for the occasion.
Finally, it would be interesting to compare the various tree classifiers to one another to
find out which one performs the best. But this requires extensive experimentation, running the
automatic segmentation models on many more data sets and more applications, which was
18
beyond the scope of this paper.
6.
Conclusions
In this paper we evaluated the performance of automatic tree classifiers versus the
judgmentally-based RFM and FRAC methods and logistic regression. The methods were
evaluated based on goodness-of-fit measures, using real data from the collectible industry.
Three tree classifiers participated in our study - a modified version of AID that we
termed STA (Standard Tree Algorithm), the commonly used CHAID and a newly-developed
method based on genetic algorithms (GA). The AID, STA and CHAID are combinatorial
algorithms in the sense that they go over all possible combinations of variables to partition a
node. Consequently these algorithms are computationally intensive and therefore limited to splits
which are based on one variable, or at best two variables at a time. In contrast, GA is a noncombinatorial algorithm in the sense that the candidate solutions (splits) are generated by a
random, yet systematic, search method, involving mutations and crossovers. This opens up the
possibility to consider partitions which are based on more than 2 variables at a time, hopefully
yielding more “homogenous” segments.
The evaluation process, which involves several predictor sets, several splitting criteria
and several constraints on the minimum and maximum size of the terminal segments, shows that
the automatic tree classifiers outperform the RFM and FRAC methods, and come only short of
the logistic regression results. The practical implication of these results is that automatic trees
may be used as a substitute to judgmentally-based methods, even to logistic regression models,
for response modeling.
While experience shows that decision trees are outperformed by logistic regression,
19
decision trees have clear benefits from the point of view of the users. Trees are easy to
understand and interpret, if, of course, properly controlled to avoid unbounded growth. The
output can be presented by means of rules which are clearly related to the problem. Unlike
traditional statistical methods, no extensive background in statistics is required to build trees (as
the feature selection process is built in the tree algorithm). No close familiarity with the
application domain is required either. These benefits, and others, have rendered tree analysis
very popular as a data analysis model. Thus, the availability to generate trees automatically and
inexpensively, opens up new frontiers for using tree classifiers to rapidly analyze and understand
the relationship between variables in a data set, in database marketing as well as in other
applications.
20
Table 1: Gains Table for RFM-Based Segmentation
Segments with at Least 100 Customers in the Holdout Sample
Results by Descending Response Rates of Segments in the Calibration Sample.
SEG
144
244
322
434
423
223
143
333
433
134
344
334
131
133
123
233
122
412
323
132
213
234
312
212
211
411
112
444
311
432
413
111
121
313
All
CLB
RR
%
2.38
1.90
1.24
1.10
0.96
0.90
0.90
0.87
0.86
0.84
0.78
0.75
0.72
0.71
0.67
0.62
0.58
0.58
0.57
0.46
0.45
0.44
0.39
0.38
0.34
0.28
0.28
0.27
0.26
0.26
0.23
0.20
0.17
0.07
HLD
RR
%
2.62
2.38
0.58
0.86
0.00
1.02
0.94
1.43
1.02
0.62
1.23
0.56
0.00
0.88
1.02
1.82
0.38
0.67
0.33
0.37
0.00
0.00
1.02
0.24
0.27
0.37
0.40
0.42
0.27
0.85
0.00
0.46
0.00
0.21
CUM
CLB
RR
%
2.38
2.27
2.21
2.11
1.98
1.91
1.83
1.72
1.63
1.58
1.42
1.36
1.35
1.27
1.24
1.22
1.17
1.14
1.11
1.07
1.06
1.05
1.01
0.95
0.86
0.81
0.79
0.76
0.74
0.74
0.72
0.71
0.70
0.67
0.61
CUM
HLD
RR
%
2.50
2.47
2.34
2.21
1.99
1.93
1.84
1.79
1.71
1.63
1.55
1.47
1.44
1.36
1.35
1.36
1.27
1.25
1.19
1.15
1.14
1.12
1.11
1.03
0.92
0.87
0.85
0.83
0.81
0.81
0.79
0.78
0.76
0.74
0.69
CUM
CLB
BUY
%
29.22
36.07
37.44
39.27
41.55
42.92
44.75
47.49
50.23
52.05
58.45
61.19
62.10
66.67
68.49
69.41
72.60
74.43
76.71
78.54
79.00
79.45
81.28
84.47
89.50
92.69
93.61
95.43
96.80
97.26
98.17
99.09
99.54
100.00
100.00
CUM
CLB
AUD
%
7.47
9.66
10.33
11.34
12.75
13.67
14.91
16.83
18.76
20.09
25.08
27.29
28.06
31.97
33.63
34.52
37.88
39.81
42.23
44.65
45.27
45.90
48.76
53.90
62.98
69.81
71.81
75.93
79.11
80.18
82.63
85.40
87.02
90.94
100.00
CUM
HLD
BUY
%
27.88
34.55
35.15
36.36
36.36
37.58
39.39
43.64
46.67
47.88
56.97
58.79
58.79
64.24
66.67
69.09
70.91
72.73
73.94
75.15
75.15
75.15
79.39
81.21
84.85
88.48
89.70
92.12
93.33
94.55
94.55
96.36
96.36
97.58
100.00
CLB = Calibration sample
HLD = Holdout sample
CUM = Cumulative
SEG = RFM segment number:
1 st digit-Recency: 1-most recent ?
4-least recent
nd
2 digit-Frequency: 1-few purchases ? 4-many purchases
3 rd digit Monetary: 1-least spending ? 4 most spending
CUM
HLD
AUD
%
7.68
9.61
10.32
11.29
12.60
13.42
14.75
16.78
18.82
20.16
25.23
27.47
28.13
32.41
34.04
34.96
38.29
40.17
42.69
44.95
45.54
46.15
49.00
54.28
63.44
70.19
72.29
76.26
79.31
80.29
82.54
85.28
86.79
90.73
100.00
%CLB AUD/
%HLD AUD
0.98
1.14
0.94
1.04
1.05
1.12
0.93
0.94
0.95
0.99
0.98
0.99
1.16
0.91
1.01
0.97
1.01
1.03
0.96
1.07
1.03
1.05
1.00
0.97
0.99
1.01
0.95
1.04
1.04
1.09
1.09
1.01
1.08
0.99
0.98
21
Table 2: Gains Table for FRAC-Based Segmentation
Segments with at Least 100 Customers in the Holdout Sample
Results by Descending Response Rates of Segments in the Calibration Sample.
SEG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
All
CLB
RR
%
12.36
6.42
5.07
3.73
3.51
2.33
2.33
1.78
1.52
1.08
0.86
0.84
0.80
0.46
0.45
0.44
0.42
0.36
0.36
0.35
0.26
0.19
0.19
0.16
0.13
0.12
0.11
0.09
0.09
0.07
0.07
HLD
RR
%
12.99
6.53
4.92
3.79
3.41
4.42
1.53
3.02
2.22
0.90
1.87
0.00
0.00
0.70
0.22
0.34
0.00
0.27
0.27
0.78
0.16
0.37
0.15
0.48
0.13
0.25
0.00
0.05
0.20
0.65
0.12
CUM
CLB
RR
%
12.36
8.92
7.59
6.75
6.05
5.33
5.01
4.51
4.24
3.84
3.54
3.28
3.06
2.79
2.38
2.27
2.16
2.05
1.87
1.71
1.43
1.30
1.23
1.19
1.06
0.99
0.95
0.84
0.78
0.74
0.71
0.61
CLB = Calibration sample
HLD = Holdout sample
CUM = Cumulative
SEG = Sequential segment number
CUM
HLD
RR
%
12.99
8.67
7.37
6.29
5.64
5.39
4.98
4.60
4.35
3.91
3.70
3.36
3.08
2.83
2.36
2.25
2.13
2.02
1.83
1.73
1.42
1.31
1.24
1.21
1.08
1.02
0.97
0.86
0.80
0.80
0.77
0.69
CUM
CLB
BUY
%
14.61
31.96
41.55
47.95
54.79
59.82
63.01
67.12
69.41
72.15
73.97
75.80
77.63
79.00
81.74
82.65
83.56
84.47
86.30
88.13
91.32
92.69
93.61
94.06
95.43
96.35
96.80
98.17
99.09
99.54
100.00
100.00
CUM
CLB
AUD
%
0.72
2.18
3.33
4.32
5.51
6.83
7.66
9.07
9.95
11.43
12.72
14.05
15.43
17.25
20.93
22.19
23.52
25.06
28.15
31.33
38.95
43.43
46.43
48.20
54.73
59.37
61.95
71.27
77.60
81.39
85.20
100.00
CUM
HLD
BUY
%
13.94
26.06
33.94
38.79
44.85
53.94
55.76
61.82
64.24
66.67
70.30
70.30
70.30
72.12
73.33
73.94
73.94
74.55
75.76
79.39
81.21
83.64
84.24
85.45
86.67
88.48
88.48
89.09
90.91
94.55
95.15
100.00
CUM
HLD
AUD
%
0.74
2.07
3.17
4.24
5.47
6.88
7.70
9.24
10.15
11.72
13.06
14.38
15.71
17.50
21.33
22.57
23.82
25.34
28.46
31.65
39.44
43.91
46.74
48.47
54.96
59.96
62.67
71.65
77.76
81.60
85.20
100.00
%CLB AUD/
% HLD AUD
0.98
1.10
1.05
0.93
0.97
0.93
1.02
0.90
0.97
0.92
0.96
1.01
1.04
1.02
0.96
1.02
1.06
1.02
0.99
1.00
0.98
1.00
1.06
1.02
1.01
0.93
0.95
1.04
1.04
0.99
1.06
1.00
22
Table 3: Gains Table for STA, 1-variable Split, Predictor Set 2
Splitting Criterion 4, and Min/Max Constraint 150/3000
Segments with at Least 25 customers in the Holdout Sample
Results by Descending Response Rates of Segments in the Calibration Sample.
SEG
9
1
8
10
2
15
13
4
24
16
18
22
12
14
34
3
20
23
30
7
5
11
32
17
19
25
60
38
40
41
48
31
26
59
47
42
39
21
37
49
50
54
64
63
33
69
6
35
53
57
65
43
All
CLB
RR
%
8.12
5.65
4.55
4.05
4.01
3.49
3.37
3.31
3.11
2.41
1.84
1.75
1.51
1.39
1.21
1.09
1.05
0.96
0.96
0.90
0.75
0.70
0.68
0.65
0.63
0.53
0.49
0.48
0.47
0.46
0.46
0.38
0.37
0.36
0.32
0.31
0.30
0.30
0.30
0.29
0.28
0.27
0.27
0.27
0.24
0.24
0.23
0.15
0.13
0.13
0.13
0.13
HLD
RR
%
10.08
1.85
6.50
4.20
3.67
3.46
3.85
4.67
0.93
0.00
1.81
2.87
1.14
2.97
0.00
0.39
0.00
1.00
0.96
0.90
0.61
0.00
0.47
0.00
0.40
0.00
0.78
0.00
0.00
0.00
0.00
0.00
1.04
0.00
0.65
0.00
0.46
1.32
0.85
0.00
0.00
0.28
0.68
0.21
0.19
0.36
0.33
0.11
0.19
0.21
0.29
0.00
CLB = Calibration sample
HLD = Holdout sample
CUM = Cumulative
SEG = STA segment number
CUM
CLB
RR
%
8.12
7.36
6.42
6.03
5.64
5.25
4.99
4.89
4.78
4.64
4.27
4.11
3.92
3.74
3.60
3.40
3.31
3.13
3.00
2.89
2.79
2.70
2.61
2.57
2.48
2.43
2.38
2.33
2.28
2.24
2.19
2.14
2.08
2.03
1.87
1.83
1.78
1.74
1.69
1.65
1.61
1.50
1.47
1.41
1.34
1.31
1.28
1.20
1.11
1.04
0.92
0.90
0.61
CUM
HLD
RR
%
10.08
7.51
7.19
6.72
6.13
5.67
5.43
5.38
5.11
4.85
4.46
4.32
4.10
4.01
3.80
3.52
3.40
3.20
3.07
2.95
2.86
2.74
2.63
2.57
2.46
2.41
2.37
2.31
2.25
2.20
2.14
2.08
2.05
1.99
1.87
1.81
1.77
1.76
1.73
1.68
1.63
1.53
1.50
1.44
1.37
1.34
1.32
1.22
1.14
1.08
0.97
0.94
0.69
CUM
CLB
BUY
%
19.18
25.11
32.88
36.99
42.92
48.86
53.88
56.16
58.45
60.27
63.93
65.75
67.58
69.41
70.78
72.60
73.52
75.34
76.71
78.08
79.00
79.91
80.82
81.28
82.19
82.65
83.11
83.56
84.02
84.47
84.93
85.39
85.84
86.30
87.67
88.13
88.58
89.04
89.50
89.95
90.41
91.78
92.24
93.15
94.06
94.52
94.89
95.89
96.80
97.72
99.54
100.00
100.00
CUM
CLB
AUD
%
1.44
2.08
3.12
3.73
4.63
5.67
6.57
6.99
7.44
7.90
9.11
9.74
10.48
11.28
11.97
12.99
13.52
14.67
15.54
16.47
17.21
18.01
18.83
19.26
20.14
20.66
21.23
21.81
22.39
22.99
23.60
24.34
25.10
25.87
28.48
29.37
30.28
31.21
32.15
33.11
34.10
37.18
38.22
40.30
42.62
43.79
44.99
48.72
52.88
57.15
65.67
67.84
100.00
CUM
HLD
BUY
%
21.82
23.64
33.33
36.97
41.82
46.67
51.52
54.55
55.15
55.15
58.18
61.82
63.03
66.67
66.67
67.27
67.27
69.09
70.30
71.52
72.12
72.12
72.73
72.73
73.33
73.33
73.94
73.94
73.94
73.94
73.94
73.94
75.15
75.15
77.58
77.58
78.18
80.00
81.21
81.21
81.21
82.42
83.64
84.24
84.85
85.45
86.06
86.67
87.88
89.09
92.73
92.73
100.00
CUM
HLD
AUD
%
1.49
2.16
3.19
3.79
4.69
5.66
6.52
6.97
7.42
7.83
8.98
9.85
10.58
11.43
12.06
13.13
13.61
14.86
15.73
16.66
17.34
18.12
19.01
19.45
20.49
20.96
21.50
21.98
22.58
23.13
23.72
24.47
25.27
25.95
28.53
29.46
30.37
31.32
32.30
33.30
34.18
37.11
38.34
40.37
42.60
43.74
45.00
48.76
53.09
56.99
65.48
67.64
100.00
%CLB AUD/
%HLD AUD
0.97
0.95
1.01
1.03
0.99
1.07
1.04
0.94
0.99
1.13
1.05
0.73
1.00
0.95
1.09
0.95
1.10
0.92
1.00
1.00
1.09
1.02
0.92
0.98
0.85
1.09
1.05
1.21
0.97
1.09
1.03
0.99
0.94
1.12
1.01
0.96
1.00
0.97
0.96
0.96
1.12
1.05
0.84
1.03
1.04
1.02
0.96
0.99
0.96
1.09
1.00
1.00
0.99
23
24
Table 4: Gains Table for STA, 1-variable Split, Predictor Set 2
Splitting Criterion 4, and Min/Max Constraint 300/6000
Segments with at Least 25 customers in the Holdout Sample
Results by Descending Response Rates of Segments in the Calibration Sample.
SEG
1
7
3
2
6
8
13
10
5
11
9
12
16
25
15
4
21
30
43
19
24
20
46
33
42
17
23
22
36
38
18
26
27
31
40
47
28
All
CLB
RR
%
6.62
5.08
4.01
2.78
2.60
2.04
1.79
1.37
1.09
0.91
0.90
0.63
0.62
0.56
0.52
0.43
0.38
0.32
0.32
0.32
0.31
0.30
0.29
0.28
0.26
0.24
0.22
0.21
0.20
0.19
0.19
0.18
0.15
0.13
0.13
0.13
0.12
HLD
RR
%
8.62
3.94
3.67
3.17
2.24
1.72
1.11
2.43
0.39
0.58
0.90
0.40
0.44
0.00
1.06
0.43
0.28
0.25
0.50
0.00
0.62
0.44
0.71
0.00
0.19
0.18
0.00
0.65
0.00
0.29
0.09
0.26
0.00
0.21
0.00
0.29
0.18
CLB = Calibration sample
HLD = Holdout sample
CUM = Cumulative
SEG = STA segment number
CUM
CLB
RR
%
6.62
6.15
5.72
4.78
4.35
3.99
3.77
3.46
3.28
2.95
2.83
2.72
2.62
2.51
2.36
2.20
2.08
1.87
1.78
1.69
1.58
1.55
1.52
1.45
1.39
1.32
1.29
1.26
1.23
1.20
1.11
1.08
1.05
0.99
0.96
0.85
0.81
0.61
CUM
HLD
RR
%
8.62
7.14
6.45
5.42
4.84
4.37
4.03
3.79
3.53
3.13
3.01
2.85
2.73
2.61
2.49
2.32
2.19
1.97
1.88
1.79
1.70
1.66
1.63
1.55
1.48
1.41
1.37
1.35
1.30
1.27
1.17
1.15
1.11
1.05
1.01
0.93
0.88
0.69
CUM
CLB
BUY
%
26.94
36.07
42.01
51.60
58.45
63.47
66.67
70.32
72.15
75.34
76.71
77.63
78.54
79.45
80.82
82.19
83.11
84.93
85.84
86.76
88.13
88.58
89.04
89.95
90.87
91.78
92.24
92.69
93.15
93.61
94.98
95.43
95.89
96.80
97.26
99.09
100.00
100.00
CUM
CLB
AUD
%
2.48
3.57
4.47
6.57
8.17
9.67
10.76
12.39
13.40
15.55
16.47
17.36
18.25
19.25
20.84
22.79
24.26
27.70
29.43
31.18
33.89
34.82
35.77
37.72
39.89
42.25
43.50
44.83
46.22
47.66
52.10
53.61
55.52
59.78
61.93
70.59
75.10
100.00
CUM
HLD
BUY
%
31.52
38.18
43.03
52.73
57.58
61.21
63.03
69.70
70.30
72.12
73.33
73.94
74.55
74.55
76.97
78.18
78.79
80.00
81.21
81.21
83.64
84.24
85.45
85.45
86.06
86.67
86.67
87.88
87.88
88.48
89.09
89.70
89.70
90.91
90.91
94.55
95.76
100.00
CUM
HLD
AUD
%
2.51
3.68
4.59
6.69
8.17
9.63
10.76
12.64
13.72
15.86
16.78
17.83
18.78
19.68
21.24
23.18
24.69
27.98
29.64
31.22
33.90
34.84
36.01
37.89
40.03
42.31
43.55
44.83
46.34
47.79
52.23
53.81
55.74
59.63
61.65
70.29
75.01
100.00
%CLB AUD/
%HLD AUD
0.99
0.94
0.99
1.00
1.08
1.03
0.97
0.86
0.95
1.00
1.00
0.85
0.93
1.11
1.02
1.01
0.98
1.05
1.04
1.11
1.01
0.99
0.80
1.04
1.01
1.04
1.01
1.04
0.93
0.99
1.00
0.95
0.99
1.09
1.07
1.00
0.96
1.00
25
Table 5: Gains Table for Logistic Regression by Decils - Holdout Sample
%
PROSPECTS
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
%
RESPONSE
62.42
77.58
83.03
86.67
89.70
92.12
96.36
98.18
100.00
100.00
ACTUAL
RESP RATE
%
4.29
2.67
1.90
1.49
1.23
1.06
0.95
0.84
0.76
0.69
% RESPONSE/
%PROSPECTS
6.24
3.88
2.77
2.17
1.79
1.53
1.38
1.23
1.11
1.00
PRED RESP
RATE
%
4.26
2.57
1.83
1.43
1.18
1.00
0.87
0.78
0.70
0.63
26
Table 6: Summary of Performance Results - Holdout Sample
LOGIT
FRAC
RFM
CHAI
D
CHAI
D
CHAI
D
CHAI
D
GA-3
GA-3
GA-4
GA-4
STA-1
STA-1
STA-1
STA-1
STA-2
STA-2
STA-2
STA-2
LOGIT=
FRAC
RFM
CHAID
STA-1
STA-2
GA-3
GA-4
CRITE
RION
MIN
SEG
SIZE
SET
1
10%
SET
2
10%
1
150
55.1
59.5
1
300
55.1
2
150
2
4
4
4
4
3
3
4
4
3
3
4
4
SET
3
10%
57.0
SET
4
10%
62.4
63.8
34.9
42.4
73.9
SET
4
30%
83.0
77.5
61.2
68.3
86.0
86.0
86.4
SET
4
50%
89.7
86.9
79.7
79.8
78.4
79.2
59.3
49.8
45.8
78.4
79.2
73.0
68.2
86.0
86.0
86.4
75.9
56.1
59.8
58.9
59.1
78.7
78.7
81.3
80.7
86.8
86.8
87.7
87.1
300
54.3
61.5
59.2
61.0
78.7
78.7
81.3
81.3
86.8
86.8
87.7
87.7
150
300
150
300
150
300
150
300
150
300
150
300
56.6
51.9
57.0
54.5
44.8
50.9
55.4
55.1
52.7
45.5
53.7
50.3
60.6
57.5
59.8
62.9
50.3
50.9
62.1
61.8
55.8
45.5
62.4
59.5
61.9
59.4
58.2
59.8
50.4
50.9
63.5
58.5
57.0
45.5
59.1
60.3
60.0
60.7
59.1
60.0
54.5
50.9
61.1
60.4
57.5
50.7
.
64.1
80.8
78.5
80.6
79.2
77.0
79.8
78.5
81.5
80.0
77.8
79.4
79.4
77.6
76.3
76.5
78.1
79.4
79.8
77.9
81.2
77.6
77.8
79.4
79.3
73.9
73.6
72.6
75.7
79.8
81.6
76.4
78.2
79.3
78.5
71.9
77.0
73.1
72.1
69.0
77.0
82.4
82.4
75.8
76.4
80.0
81.8
.
82.4
86.9
85.6
86.8
85.0
84.3
87.4
86.3
86.8
86.2
85.1
87.2
86.4
85.1
84.3
83.6
84.7
85.8
88.7
87.0
88.8
84.2
84.5
85.5
86.1
83.2
82.3
80.8
83.4
86.1
86.7
86.1
85.5
84.2
86.8
80.2
84.2
81.7
79.9
77.8
85.0
87.4
88.3
83.4
84.3
86.5
90.3
.
88.3
Logistic regression
=
FRAC-based segmentation
=
RFM-based segmentation
=
CHAID tree
=
STA tree with 1-variable split
=
STA tree with 2-variable split
=
GA tree with 3-variable split
=
GA tree with 4-variable split
SET
1
30%
SET
2
30%
SET
3
30%
SET
1
50%
SET
2
50%
SET
3
50%
27
Appendix A: Splitting Criteria
Splitting criteria are used to determine the best split, out of the many possible ways to
partition a node. By and large, we can classify the splitting criteria into two families, one which
is based on the value of the node, the other on the value of the partition.
Node-Value Based Criteria
These criteria are based on the value of the node. The objective is to maximize the
improvement in the node value which results by splitting a node into two or more splits.
The value of a node t is a function of the response rate (RR) of the node, RR(t), (i.e.,
the proportion of buyers). The “ideal” split is the one which partitions a father node into two
children nodes, one which contains only buyers, i.e., RR ?t ? ? 1 , and the other only nonbuyers,
RR ?t ? ? 0 . Clearly, in applications where the number of “responders” is more or less equal to
the number of “non-responders”, the “worst” split is the one which results in two children each
having about the same proportion of buyers and nonbuyers, i.e. RR ?t ? ~ 1/ 2 .
We define the value (or quality) of a node t as a function Q(RR) which satisfy the
following conditions:
?? Max Q ?RR ? ? Q ?0? ? Q ?1?
?? Min Q ?RR ? ? Q ?1 / 2 ?
?? Q ?RR ? is a concave function of RR (i.e., the second derivative is positive).
?? Q ?RR ? is symmetric, i.e. Q ?RR ? ? Q ?1 ? RR ?
The first two conditions stem from our definition of the “best” and the “worst” splits; the
28
concavity condition follows from the first two conditions; and the symmetry condition from the
fact that the reference point is ½.
Clearly, there are many functions that satisfy these requirements. While the analysis here
extends to any number of splits per node, we focus here on two-way splits.
Examples:
a.
The (piecewise) linear function, e.g.,
?1 ? RR
Q ?RR ? ? ?
? RR
b.
RR ? 1 / 2
RR ? 1 / 2
(A.1)
The quadratic function
Q ?RR ? ? a ? b ?RR ? c ?RR 2
where a, b and c are parameters.
One can show that the only quadratic function that satisfy all these conditions (up to a
constant) is:
Q ?RR ? ? ? RR ?1 ? RR ? ? RR 2 ? RR
For example, the variance used by AID to define the value of a node, in the binary yes/no case,
is a quadratic node value function. To show this, we evaluate the variance of the choice variable
Y. Denoting by:
Yi
-
the choice value of observation i
Y
-
the mean value over all observations
B
-
the number of buyers
N
-
the number of nonbuyers
Var ?Y ? ?
? ?Y
i
?Y ?
2
i
?B ? N ?? ?
i
Yi 2 ?B ? N ? ? Y 2
29
But since Yi is a binary variable (1-buy, 0-no buy),
?
i
Y ?
Yi 2 ? ? Yi ? B
i
?
Y i ?B ? N ? ? RR
i
and we obtain:
Var ?Y ? ? RR ? RR 2 ? .25 ? ?RR ? .5 ?
2
which is equivalent to the quadratic function (since the variance is not affected by shifting):
Q ?RR ? ? ?RR ? 1/ 2?
2
c.
(A.2)
The entropy function (Michalski, et al., 1993)
Q ?RR ? ? ? ?RR log ?RR ? ? ?1 ? RR ? log ?1 ? RR ??
(A.3)
The entropy is a measure of the information content, the larger the entropy, the better.
Hence the best split is the one with the largest entropy of all possible splits of a node.
Figure A.1 exhibits all the three functions graphically.
Now, the node value resulting by partitioning a node into two children nodes is obtained
as the average of the node value of the two descendant nodes, weighted by the proportion of
customers, i.e.:
N1
N
Q ?RR1 ? ? 2 Q ?RR2 ?
N
N
(A.4)
Q(RR)
30
1
Piecewise Linear
0.8
0.6
0.4
Quadratic
0.2
-0.2
Entropy
-0.4
Figure A.1 - Node Value Functions
0.99
0.92
0.85
0.78
0.71
0.64
0.57
0.5
0.43
0.36
0.29
0.22
0.15
0.08
RR
0.01
0
31
where:
N
-
the number of customers in the father node
N 1, N 2
-
the number of customers in the descendant left node (denoted
by the index 1) and the right node (denoted by the index 2),
respectively. In the following we always assume N1 ? N 2 (i.e.,
the left node is the smaller one).
RR1 , RR 2
-
Q ?RR1 ?, Q ?RR 2 ?
the response rates of the left and the right nodes, respectively.
the corresponding node value functions.
Thus, the improvement in the node value resulting by the split is given as the difference:
N1
N
Q ?RR1 ? ? 2 Q ?RR 2 ? ? Q ?RR ?
N
N
(A.5)
And we seek the split that yields the maximal improvement in the node value. But since, for a
given father node, Q ?RR ? is the same for all splits, (A.5) is equivalent to maximizing the node
value (A.4).
Clearly in DBM applications, where the number of buyers are largely outnumbered by
the number of nonbuyers, the reference point of ½ may not be appropriate. A more suitable
reference point to define the node value is TRR, where TRR is the overall response rate of the
training audience. Another alternative is the cutoff response rate, CRR, calculated based on
economic criteria. The resulting node value functions in this case satisfy all conditions above,
except that they are not symmetrical.
Now depending upon the value function Q ?RR ? , this yields several heuristic criteria for
determining the best split. For example, for the piecewise linear function and an hierarchical
tree, a possible criterion is choosing the split which maximizes the response rate of the smaller
32
child node, Max ?RR1 , 1 ? RR1 ? ; In a binary tree, the split which yields the most difference in
the response rates of the two descendant nodes, Max ?RR1 ,
RR 2 ? . In the quadratic case, a
reasonable function is ?RR ? TRR ?2 . Or one can use the entropy function (A.3).
Finally, we note that with a concave node value function Q ?RR ? basically any split will
result in a positive value improvement, however small. Thus, when using the node-value based
criteria for determining the best split, it is necessary to impose a threshold level on the minimum
segment size and/or the minimum improvement in the node value, or otherwise the algorithm will
keep partitioning the tree until each node contains exactly one customer.
Partition-Value Based Criteria
Instead of evaluating nodes, one can evaluate partitions. A partition is considered as a
“good” one if the resulting response rate of the children nodes are significantly different than one
another. This can be casted in terms of test of hypothesis. For example, in a two-way split
case:
H0 :
p1 ? p2
H1 :
p1 ? p 2
where p1 and p2 are the true (but unknown) response rates of the left child node and the right
child node, respectively.
A common way to test the hypothesis is by calculating the P_value, defined as the
probability to reject the null hypothesis, for the given sample statistics, if it is true. Then, if the
resulting P_value is less than or equal to a predetermined level of significance (often 5%), the
hypothesis is rejected; otherwise, the hypothesis is accepted.
33
(a)
The normal test
The hypothesis testing procedure draws on the probability laws underlying the process.
In the case of a two-way split, as above, the hypothesis can be tested using the normal
distribution. One can find the Z-value corresponding to the P_value, denoted Z 1? ? / 2 , using:
Z1? ? / 2 ?
Abs ?RR1 ? RR2 ?
RR1 ? ?1 ? RR1 ? ?B1 ? N 1 ? ? RR2 ? ?1 ? RR 2 ? ?B2 ? N 2 ?
(A.6)
where:
RR1 , RR 2
-
the response rates of the left child node (denoted by the
index 1) and the right child node (denoted by the index 2)
respectively
B1 , B2
-
the corresponding number of buyers
N 1, N 2
-
the corresponding number of nonbuyers
and then extract the P_value from a normal distribution table.
(b)
The chi-square test
In the case of a multiple split, the test of hypothesis is conducted by means of the chi
square test of independence.
The statistic for conducting this test, denoted by Y, is given by:
Y?
?
splits
?Observed - Expected ?2
Expected
(A.7)
Table A.1 exhibits the calculation of the components of Y for a 3-way split, extending the
notation above to the case of 3 child nodes. This table can easily be extended to more than
three splits per node.
34
Table A.1: Components of Y
1
Split
Observed
Expected
Buyers
Nonbuyers
Total
T1 ? B1 ? N 1
B1
T1 ?B T
B2
T2 ?B T
N1
T1 ?N T
N2
T2 ?N T
3. Observed
Expected
B3
T3 ?B T
N3
T3 ?N T
T3 ? B3 ? N 3
Total
B ? B1 ? B2 ? B3
N ? N1 ? N 2 ? N3
T ? T1 ? T 2 ? T3
2
Observed
Expected
T2 ? B2 ? N 2
Y is distributed according to the chi-square distribution with ?k ? 1?degrees of freedom, where
k is the number of splits for the node. One can then extracts the P_value for the resulting value
of Y from the chi square distribution. The best split is the one with the smallest P_value.
(c)
The smallest child test
Finally, this criterion is based on testing the hypothesis:
H0 :
p ? TRR
H1 :
p ? TRR
where p here stands for the true (but unknown) response rate of the smaller child node, and
TRR is the observed response rate for the training audience.
To test this hypothesis we define a statistic Y denoting the number of standard deviations
(“sigmas”) that the smaller segment is away from TRR, i.e.:
Y?
RR ? TRR
RR ?1 ? RR ? N
where RR is the observed response rate of the smaller child node, and N the number of
observations.
Large values of Y mean that p is significantly different than TRR, indicating a “good” split.
For example, one may reject the null hypothesis, concluding that the split is a good one, if Y is
larger than 2 “sigmas”.
35
Appendix B: Tree Classifiers
We discuss below the three tree classifiers that were involved in our study – STA,
CHAID and GA.
STA – Standard Tree Algorithm
STA is an AID–like algorithm. The basic AID algorithm is a univariate binary tree. In
each iteration of the process, each undetermined node is partitioned based on one variable at a
time into two descendant nodes. The objective is to partition the audience into two groups that
exhibit substantially less variation than the father node. AID uses the sum of squared deviations
of the response variable from the mean as the measure of the node value, which, in the binary
yes/no case, reduces to the minimum variance criterion, ?RR ? 0.5?2 , where RR is the response
rate (the ratio of the number of responders to the total number of customers) for the node (see
also Appendix A).
In each stage, the algorithm searches over all remaining predictors, net of all predictors
that had already been used in previous stages to split father nodes, to find the partition that yields
the maximal reduction in the variance.
In this work we have expanded the AID algorithm in two directions:
- Splitting a node based on two predictors at a time to allow one to also account
for the interaction terms to affect the tree structure.
- Using different reference points in the minimum variance criterion that are more
appropriate for splitting populations with marked differences
between responders and non-responders, such as DBM applications. Possible
36
candidates are the overall response rate of the training audience, or even the
cutoff response rate separating between targets and nontargets.
We therefore refer to our algorithm as STA (Standard Tree Algorithm) to distinguish it
from the conventional AID algorithm.
CHAID
CHAID (Chi-Square AID) is the most common of all tree classifiers. Unlike AID,
CHAID is not a binary tree as it may partition a node into more than two branches. CHAID
categorizes all independent continuous and multi-valued integer variables by “similarity”
measures, and considers the resulting categories for a variable as a whole unit (group) for
splitting purposes. Take for example the variable MONEY (money spent) that is categorized
into four ranges, each is represented by a dummy 0/1 variable which assumes the value of 1 if
the variable value falls in the corresponding range, 0-otherwise. Denote the resulting four
categorical variables as variables A, B, C and D, respectively. Since MONEY is an ordinal
variable (order is important) there are 3 possibilities to split this variable into two adjacent
categories: (A, BCD), (AB, CD), (ABC, D); 3 possibilities to split the variable into 3 adjacent
categories: (A, B, CD) (AB, C, D), (A, BC, D); and one way to split the variable into four
adjacent categories (A, B, C, D). Now, CHAID considers each of these partitions as a
possible split, and seeks the best combination to split the node from among all possible
combinations. As a result, a node in CHAID may be partitioned into more than two splits, as
many as four splits in this particular example. The best split is based on a chi-square test, which
is what gave this method its name.
Clearly, there are many ways to partition a variable with K values into M categories
37
(children nodes). To avoid choosing a combination that randomly yields a “good” split, some
versions of CHAID use an adjusted P_value criterion to compare candidate splits.
Let L denote the number of possible combinations to combine a variable with K values
into M categories.
Let ? denote the Type-I error (also known as the level of significance) in the chisquare test for independence. ? is the probability of rejecting the null hypothesis that there is no
significant difference in the response rates of the resulting child nodes, when the null hypothesis is
true.
Now, the probability to accept the null hypothesis in one combination is ?1 ? ? ? , and in
L successive combinations (assuming the test of hypotheses are independent) is ?1 ? ?
?L .
Hence the probability of making a Type-I error in at least one combination is 1 ? ?1 ? ?
?L ,
which is greater than ? .
To yield a “fair” comparison of the various combinations, ? is replaced by the resulting
P_value. In most cases, the P_value is very small, and we therefore can use the approximation
1 ? ?1 ? P _ value ? ? L ?P _ value
L
The resulting quantity, L?P _ value , is the adjusted P_value, and L is referred to as the
Bonferroni multiplier. Each combination yields a different adjusted P_value. The “best”
combination to partition the node by is the one that yields the smallest adjusted P_value.
The number of possibilities for combining a variable with K values into M categories
depends on the type of the variable involved. In our algorithm, we distinguish between three
cases:
- Ordinal variables where one may combine only adjacent categories (as in the
case of the variable MONEY above).
38
- Ordinal variables with a missing value that may be combined with any of the
other categories.
- Nominal variables where one may combine any two (or more) values, including
the missing value (e.g., the variable MARITAL with four nominal values: M married, S - single, W - widow, D - divorce).
Table B.1 exhibits the number of combinations for several representative values of K
and M.
Table B.1: Number of Possible Combinations
K
2
3
4
4
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8
9
9
9
9
10
10
10
M
2
2
2
3
2
3
4
2
3
4
5
2
3
4
6
2
4
6
7
2
4
6
8
2
4
6
Ordinal
1
2
3
3
4
6
4
5
10
10
5
6
15
20
6
7
35
21
7
8
56
56
8
9
84
126
Ord+Miss
1
3
5
5
7
12
7
9
22
22
9
11
35
50
11
13
95
51
13
15
161
161
15
17
252
406
Nominal
1
3
7
6
15
25
10
31
90
65
15
63
301
350
21
127
1701
266
28
255
7770
2646
36
511
34105
22827
39
10
10
8
9
36
9
92
17
750
45
Genetic Algorithm (GA)
All tree algorithms described above are combinatorial, in the sense that in each stage
they go over all possible combinations to partition a node. This number gets excessively large
even for one-variable splits and becomes computationally prohibitive with multi-variable splits.
Consequently, all tree algorithms are univariate (AID, CHAID) or at best bi-variate (STA).
Yet, it is conceivable that splits based on several variables at a time (more than 2) may be more
“homogenous” and therefore better from the standpoint of profiling.
Thus, by confining oneself to using only univariate and even bi-variate tree algorithms,
one may miss out the better splits which could have been obtained otherwise with multivariate
algorithms.
To resolve this issue, we developed a Genetic Algorithm (GA) tree for profiling which,
unlike the other trees, is a non-combinatorial algorithm in the sense that it employs a random, yet
a systematic search approach to grow a tree, rather than go over all possible number of
combinations. This significantly reduces the number of combinations to consider in partioning a
node, thus allowing one to increase the number of variables to split a node beyond two. In fact,
with this approach one can theoretically use any number of variables to split a node, but for
computational reasons we have confined the number of simultaneous variables to the range 3-7.
Genetic Algorithm (GA) is a general purpose search procedure, based on the biological
principle of “the survival of the fittest”, according to which the strongest and the fittest have a
higher likelihood of reproduction than the weak and the unfit. Thus, the succeeding
40
descendants, having inherited the better properties of their parents, tend to be even stronger and
healthier than their predecessors and therefore get improved over time with each additional
generation (Davis, 1991).
This idea has been applied to find heuristic solutions for large scale combinatorial
optimization problems. Starting with the “better” solutions in each generation (according to
some “fitness” measure), GA creates successive offspring solutions that are likely to result in a
better value for the objective function as one goes from one generation to the other, thus finally
converging to a local, if not a global optimum (Holland, 1975). These solutions are created by
means of a “reproduction” process which involves two basic operations: “mutations” and
“crossovers”.
?? Mutation - randomly changing some of the genes of the parent solution.
?? Crossover - crossing over genes of two parent solutions, some of the genes are taken
from the “mother” solution, the rest from the “father” solution.
In the context of our profiling problem, GA is used as an algorithm to grow the tree and
generate candidate splits for a node. A solution in our case is a collection of splitting rules
specifying whether a customer belongs to the left segment or to the right segment. For example,
if X 3 ? 1, X 4 ? 0 and X 7 ? 1 , the customer belongs to the left segment, otherwise he/she
belongs to the right segment.
One may use several ways to represent splits in GA. One possibility is by means of a
vector, the dimension of which is equal to the number of potential predictors, one entry for each
predictor. The value of each entry denote how the corresponding predictor affect the split, e.g.,
0 -
X i does not affect the current split
41
-1 -
X i ? 0 in the current split
1 -
X i ? 1 in the current split
In the above example (assuming there are only 10 potential predictors denoted as
X 1 ,? , X 10 ), the corresponding vector is given by:
(0, 0, 1, -1, 0, 0, 1, 0 0 0)
Using the terminology of GA, each such solution is a chromosome, each variable is a gene, and
the value of each gene is an allele.
Now, the crux of the GA method is to define those descendant solutions from one
generation to the other. These are created in our algorithm using mutation and crossover
operations, as follows:
Mutation:
?? Choose a predictor from the set of predictors in the solution, by drawing a random
number from the uniform distribution over the range (1 - V), where V is the number of
potential predictors. Say the predictor selected is X 3 .
?? Determine the allele of X 3 as follows:
-1 with probability of p 0
1 -
with probability of p1
0 -
otherwise
For example, suppose the allele selected is -1, then the descendant solution becomes
(0, 0, -1, -1, 0, 0, 1, 0 0 0 )
and the new split is defined by
42
X3 ? 0,
X4 ? 0
and
X7 ? 1
The value of p0 and p1 are parameters of the algorithm and are set up in advance by
the user. The mutation operator is applied simultaneously on g genes at a time, where the g
genes are also determined at random.
Crossover:
?? Pick two solutions at random (a “father” and a “mother”)
?? Select g consecutive genes at random, say X 3 and X 4
?? Create two descendant solutions: a “daughter” and a “son”, by swapping the selected
genes: The “daughter” solution gets her mother’s genes, except for X 3 and X 4 which
are inherited from the father; the “son” solution gets his father’s genes, except X 3 and
X 4 which are inherited from the mother.
The process starts with a pool of solutions (population), often created in a random
manner. The various reproduction methods are applied to create the descendant solutions of the
next generation. The resulting solutions are then evaluated based on the partitioning criteria.
The best solutions are retained, and control is handed over to the next generation, and so on,
until all termination conditions are met.
Finally we note that in our GA tree we define the smaller child node (i.e., the one with
the fewer number of customers) as a terminal node. This is based on the plausible assumption
that with multivariate-based splits, the resulting smaller split appears to be homogenous “enough”
to make it a terminal node. Hence, the resulting GA tree is hierarchical.
43
References
Ben-Akiva, M. and Lerman, S.R. (1987), Discrete Choice Analysis, Cambridge, MA, The
MIT Press.
Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984), Classification and Regression
Trees, Belmont, CA., Wadsworth.
Bult, J.R. and Wansbeek,T. (1995), Optimal Selection for Direct Mail, Marketing Science, 14,
pp. 378-394.
Davis, L., editor (1991), Handbook of Genetic Algorithms, New York, Van Nostrand
Reinhold.
Haughton, D. and Oulabi, S. (1997), Direct Marketing Modeling with CART and CHAID,
Journal of Direct Marketing, 11, pp. 42-52.
Holland, J.H. (1975), Adaptation in Natural and Artificial Systems, Ann Arbor, University of
Michigan Press.
Kass, G. (1983), An Exploratory Technique for Investigating large Quantities of Categorical
Data, Applied Statistics, 29.
Kestnbaum, R.D., Kestnbaum & Company, Chicago, Private Communication.
Levin, N. and Zahavi, J. (1996), Segmentation Analysis with Managerial Judgment, Journal of
Direct Marketing, 10, pp. 28-47.
Long, J.S. (1997), Regression Models for Categorical and Limited Dependent Variables,
Thousand Oaks, CA, Sage Publications.
Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. (1983), Machine Learning - An Artificial
Intelligence Approach, Palo Alto, CA., Tioga Publishing Company.
Morwitz, G.V. and Schmittlein, D. (1992), Using Segmentation to Improve Sales forecasts
Based on Purchase ?????: Which “Indenders” Actually Buy?, Journal of Marketing
Research, 29, pp. 391-405.
Morwitz, G.V. and Schmittlein, D. (1998), Testing New Direct Marketing Offerings: The
Interplay of Management Judgment and Statistical Models, Management Science, 44,
pp. 610-628.
44
Murthy, K.S. (1998), Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery, 2, pp. 45-389.
45
Novak, P.T., de Leeuw, J. and MacEvoy, B. (1992), Richness Curves for Evaluating Market
Segmentation, Journal of Marketing Research,29, pp. 254-267.
Quinlan, J.R. (1986), Induction of Decision Trees, Machine Learning, 1, pp. 81-106.
Quinlan, J.R. (1993), C4.5: Program for Machine Learning, CA., Morgan Kaufman Publishing.
Shepard, D. editor (1995), The New Direct Marketing, New York, Irwin Professional
Publishing.
Sonquist, J., Baker, E. and Morgan, J.N. (1971), Searching for Structure, Ann Arbor,
University of Michigan, Survey Research Center.
Weinstein, A. (1994), Market Segmentation, New York, Irwin Professional
Publishing.