PREDICTIVE MODELING USING SEGMENTATION Nissan Levin Faculty of Management Tel Aviv University Jacob Zahavi The Alberto Vitale Visiting Professor of Electronic Commerce The Wharton School On leave from Tel Aviv University December 1999 JZ-Predictive.doc/mp Abstract We discuss the use of segmentation as a predictive model for supporting targeting decisions in database marketing. We compare the performance of judgmentally -based RFM and FRAC methods to automatic tree classifiers involving the well-known CHAID algorithm, a variation of the AID algorithm and a newly-developed method based on genetic algorithm (GA). We use the logistic regression model as a benchmark for the comparative analysis. The results indicate that automatic segmentation methods may very well substitute the judgmentally-based segmentation methods for response analysis, and come only short of the logistic regression results. The implications of the results for decision making are also discussed. 1. Introduction Segmentation is key to marketing. Since customers are not homogenous and differ with respect to one another in their preferences, wants and needs, the idea is to partition the market into groups, or segments, of “like” people, with similar needs and characteristics, which are likely to exhibit similar purchasing behavior. Then, one may offer each segment with the products/services which are keen to the members of the segment. Ideally, the segmentation should reflect customers’ attitude towards the product/service involved. Since this is often not known in advance, the best proxy is to use data that reflect customers’ purchasing habits and behavior. Weinstein discusses several dimensions to segmentation (1994, p.4) : * Geography - classifying markets on the basis of geographical considerations. * Socioeconomic - segmentation is based on factors reflecting customers’ socioeconomic status, such as income level, education. * Psychographic - differentiating markets by appealing to customers’ needs, personality traits and lifestyle. * Product usage - partitioning the market based on consumption level of various users group. * Benefits - splitting the market based upon the benefit obtained from a product/service, such as price, service, special features and factors. Hence, the segmentation process is information-based, the more information is available, the more refined and focused are the resulting segments. Perhaps more than in any other marketing channel, segmentation is especially powerful in database marketing (DBM) where one can use the already available wealth of information on 2 customers’ purchase history and demographic and lifestyle characteristics to partition the customer list to segments. The segmentation process, often referred to as profiling, is used to distinguish between customers and noncustomers, where “customers” here are extended to include buyers, payers, loyal customers, etc., and to understand their composition and characteristics - who they are? what do they look like? what are their attributes? where do they reside?, etc. This analysis supports a whole array of decisions, ranging from targeting decisions to determining efficient and cost effective marketing strategies, even evaluating market competition. In this paper we discuss the use of segmentation as a predictive model for supporting targeting decisions in database marketing. We compare the performance of judgmentally-based RFM and FRAC methods, to several automatic tree-structured segmentation methods (decision trees). To assess how good is the performance of the segmentation-based models, we compare them against the results of a logistic regression model, which is undoubtedly one of the most advanced response models in database marketing and a one which is certainly hard to “beat”. Logistic regression is widely discussed in the literature, and will not be reviewed here. See for example, Ben-Akiva (1987), Long (1997) and others. Several studies have been conducted so far on the use of segmentation methods for supporting targeting decisions. Haughton and Oulabi (1997) compare the performance of response models built with CART and CHAID on a case study that contains some 10,000 cases, with about 30 explanatory variables and about the same proportion of responders and non-responders. Bult and Wansbeek (1995) devise a profit-maximization approach to selecting customers for promotion, comparing the performance of CHAID against several “parametric” models (e.g., logistic regression) using a sample of about 14,000 households with only 6 3 explanatory variables. Novak et al. (1992) devise a “richness” curve for evaluating segmentation results, defined as the running average of the proportion of individuals in a segment which are “consumers”, where segments are added in decreasing rank order of their response. Morwitz and Schmittlein (1992) investigate whether the use of segmentation can improve the accuracy of sales forecasts based on stated purchase intent involving CART, discriminant analysis and K-means clustering algorithm. Other attempts have been made to improve the targeting decisions with segmentation by using prior information. Of these we mention the parametric approach of Morwitz and Schmittlein (1998), and the non-parametric approach of Levin and Zahavi (1996). This paper provides a hard empirical evidence of the relative merits of various segmentation methods, focussing on several issues: - How well automatic tree classifiers are capable of discriminating between buying and non-buying segments? - How well automatic segmentation perform as compared to manually-based RFM and FRAC segmentation? - How the various automatic tree classifiers compare against logistic regression results? - What practical implications one needs to look into when using automatic segmentation? On the theoretical front, we offer a unified framework to formulate decision trees for segmenting an audience based on a choice variable, and expand the existing tree classifiers in several directions. 4 The development and the evaluation of the decision trees were conducted by the authors’ own computer programs. We note that since tree classifiers are heuristic methods, the resulting tree is as good as the algorithm used to create the tree. Hence, all results in this paper reflect the performance of our computer algorithms, which may not extend to other algorithms. The various methods are demonstrated and evaluated using realistic data from the collectible industry. For confidentiality reasons, all results are presented in percentage terms. Also discussed are the implications of the results for decision making. 2. Segmentation Methods 2.1 Judgmentally-Based Methods Judgmentally-based or “manual” segmentation methods are still most commonly used in DBM to partitoin a customers’ list into “homogenous” segments. Typical segmentation criteria include previous purchase behavior, demographics, geographics and psychographics. Previous purchase behavior is often considered to be the most powerful criterion in predicting likelihood of future response. This criterion is operationalized for the segmentation process by means of Recency, Frequency, Monetary (RFM) variables (Shepard, 1995). Recency corresponds to the number of weeks (or months) since the most recent purchase, or the number of mailings since last purchase; frequency to the number of previous purchases or the proportion of mailings to which the customer responded; and monetary to the total amount of money spent on all purchases (or purchases within a product category), or the average amount of money per purchase. The general convention in the DBM industry is that the more recently the customer has placed the last order, the more items he/she bought from the company in the past, and the more money he/she spent on the company’s products, the higher is his/her likelihood of 5 purchasing the next offering and the better target he/she is. This simple rule allows one to arrange the segments in decreasing likelihood of purchase. The more sophisticated manual methods also make use of product/attribute proximity considerations in segmenting a file. By and large, the more similar the products bought in the past are to the current product offering, or the more related are the attributes (e.g., themes), the higher the likelihood of purchase. For example, in a book club application, customers are segmented based upon the proximity of the theme/content of the current book to those of previously-purchased books. Say, the currently promoted book is “The Art History of Florence”, then book club members who previously bought Italian art books are the most likely candidates to buy the new book, and are therefore placed at the top of the segmentation list, then people who purchased general art books, followed by people who purchased geographical books, and so on. In cases where males and females may react differently to the product offering, gender may also be used to partition customers into groups. By and large, the list is first partitioned by product/attribute type, then by RFM and then by gender (i.e., the segmentation process is hierarchical). This segmentation scheme is also known as FRAC Frequency, Recency, Amount (of money) and Category (of product) (Kestnbaum, 1998). Manually-based RFM and FRAC methods are subject to judgmental and subjective considerations. Also, the basic assumption behind the RFM method may not always hold. For example, in durable products, such as cars or refrigerators, recency may work in a reverse way - the longer the time since last purchase, the higher the likelihood of purchase. Finally, to meet segment size constraints, it may be necessary to run the RFM/FRAC iteratively, each time combining small segments and splitting up large segments, until a satisfactory solution is obtained. This may increase computation time significantly. 6 2.2 Decision Trees Several “automatic” methods have been devised in the literature to take away the judgmental and subjective considerations inherent in the manual segmentation process. By and large, these methods map data items (customers, in our case) into one of several predefined classes. In the most simple case, the purpose is to segment customers into one of two classes, based on some type of a binary response, such as buy/no buy, loyal/non-loyal, pay/no-pay, etc. Thus, tree classifiers are choice-based. Without loss of generality we refer to the choice variable throughout this paper as purchase/no -purchase, thus classifying the customers into segments of “buyers” and “nonbuyers”. Several automatic tree classifiers were discussed in the literature, among them AID Automatic Interaction Detection (Sonquist, Baker and Morgan, 1971); CHAID - Chi square AID (Kass, 1983), CART - Classification and Regression Trees (Breiman, Friedman, Olshen and Stone, 1984), ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993), and others. A comprehensive survey of automatic construction of decision trees from data was compiled recently by Murthy (1998). Basically, all automatic tree classifiers share the same structure. Starting from a “root” node (the whole population), tree classifiers employ a systematic approach to grow a tree into “branches” and “leaves”. In each stage, the algorithm looks for the “best” way to split a “father” node into several “children” nodes, based on some splitting criteria. Then, using a set of predefined termination rules, some nodes are declared as “undetermined” and become the father nodes in the next stages of the tree development process, some others are declared as 7 “terminal” nodes. The process proceeds in this way until no more node is left in the tree which is worth splitting any further. The terminal nodes define the resulting segments. If each node in a tree is split into two children only, one of which is a terminal node, the tree is said to be “hierarchical”. Three main considerations are involved in developing automatic trees: - Growing the tree - Determining the best split - Termination rules Growing the Tree One grows the tree by successively partitioning nodes based on the data. A node may be partitioned based on several variables at a time, or even a function of variables (e.g., a linear combination of variables). With so many variables involved, there are practically infinite number of ways to split a node. Take as an example just a single continuous variable. This variable alone can be partitioned in infinite number of ways, let alone when several variables are involved. In addition, each node may be partitioned into several descendants (or splits), each becomes a “father” node to be partitioned in the next stage of the tree development process. Thus, the larger the number of splits per node, the larger the tree and the more prohibitive the calculations. Indeed, several methods have been applied in practice to reduce the number of possible partitions of a node: - All continuous variables are categorized prior to the tree development process into small number of ranges (“binning”). A similar procedure 8 applies for integer variables which assume many values (such as the frequency of purchase). - Nodes are partitioned only on one variable at a time (“univariate” algorithm). - The number of splits per each “father” node is often restricted to two (“binary” trees). - Splits are based on a “greedy” algorithm in which splitting decisions are made sequentially looking only on the impact of the split in the current stage, but never beyond (i.e., there is no “looking ahead”). Determining the Best Split With so many possible partitions per node, the question is what is the best split? There is no unique answer to this question as one may use a variety of splitting criteria, each may result in a different “best” split. We can classify the splitting criteria into two “families”: node-value based criteria and partition-value based criteria. - Node-value based criteria: seeking the split that yields the best improvement in the node value. - Partition-value based criteria: seeking the split that separates the node into groups which are as different from each other as possible We discuss the splitting criteria at more length in Appendix A. Termination Rules Theoretically, one can grow a tree indefinitely, until all terminal nodes contain very few customers, as low as one customer per segment. The resulting tree in this case is unbounded 9 and unintelligible, having the effect of “can’t see the forest because of too many trees”. It misses the whole point of tree classifiers whose purpose is to divide the population into buckets of “like” people, where each bucket contains a meaningful number of people for statistical significance. Also, the larger the tree, the larger the risk of overfitting. Hence it is necessary to control the size of a tree by means of termination rules that determine when to stop growing the tree. These termination rules should be set to ensure statistical validity of the results and avoid overfitting. We discuss three tree classifiers in this paper: a variation of AID which we refer to as Standard Tree Algorithm (STA), CHAID and a new tree classifier based on Genetic Algorithm (GA). These algorithms are further described in Appendix B. 3. A Case Study We use a real case study and actual results to demonstrate and evaluate the performance of decision trees vis-a-vis manually-based trees. The case study involves a solo mailing of a collectible item that was live-tested in the market and then rolled out. The data for the analysis consists of the test audience with appended orders, containing 59,970 customers, which we randomly split into two mutually exclusive samples - a training (calibration) sample, consisting of 60% of the observations, to build the model (tree) with, and a holdout sample, containing the rest of the customers, to validate the model with. As alluded to earlier, only binary predictors may take part in the partitioning process. Hence, all continuous and multi-valued integer variables were categorized, prior to invoking the tree algorithm, into ranges, each is represented by a binary variable assuming the value of 1 if the variable falls in the interval, 0-otherwise. This process is also referred to as “binning”. 10 Depending upon the tree, the resulting categories may be either overlapping (i.e., X ? a i , i ? 1, 2, ? , where X is a predictor, a i a given breakpoint) or non-overlapping (i.e., ai ? 1 ? X ? ai , i ? 1, 2,? ). The trees were evaluated using goodness-of-fit criteria which express how well the profiling analysis is capable of discriminating between the buyers and the nonbuyers. A common measure is in terms of the percentage of buyers “captured” per the percentage of audience mailed, the higher the percentage of buyers, the “better” the model. For example, a segmentation scheme that captures 80% of the buyers for 50% of the audience is better than a segmentation scheme that captures only 70% of the buyers for the same audience. Below we discuss several considerations in setting up the study: Feature Selection In DBM applications, the number of potential predictors could be very large, often in the order of magnitude of several hundreds, even more, predictors. Thus, the crux of the modelbuilding process in DBM is to pick the subset of predictors, usually only a handful of which, that explain the customers choice decision. This problem, referred to as the feature selection (or specification) problem, is a tough combinatorial problem and is definitely the most complex issue in building large scale multivariate models. Decision trees possess an advantage over statistical regression models in that they have an inherent built-in mechanism to pick the predictors affecting the terminal segments. They do it by going over all possible combinations to grow a tree (subject to the computational constraints discussed in the previous section) and selecting the best split for each node using one of many splitting criteria (discussed in Appendix A). 11 Logistic regression, on the other hand do not enjoy such a benefit, and one needs to incorporate a specification procedure into the process (e.g., a stepwise regression approach). Hence, using logistic regression models in DBM applications is not easy, and is definitely not as straightforward as building decision trees, and may require an extensive expertise in statistics. In our case, we use a rule-based expert system to weed out the “bad” from the “good” predictors, using rules that reflect statistical theory and practice. Examples are rules that set up the level of significance to include a variable in the model, or rules that set up a threshold on the allowed degree of multicollinearity between predictors, and the like. These rules were calibrated using an extensive experimentation process. Number of Predictors for a Split The number of predictors to split a node by is constrained by the tree classifier: ?? STA - Our AID-like algorithm, was expanded to allow for splitting a none based on two predictors at a time. This enabled us to also account for the interacton (or secondary) effect on the decision tree. ?? CHAID - in contrast to STA and GA, CHAID considers all predictors resulting from a categorical representation of a variable as one group in the partitioning process. For example, suppose MARITAL denote the marital status of a customer with four values (single, married, widow, divorce), then CHAID seeks the best way to split a node from among all possible combinations to group these four predictors (see Appendix B for more details). ?? Finally, the main benefit of GA is that it can use any number of predictors to split a node by. However, due to computational constraints, we have limited the number of 12 variables in our study to only three and four predictors at a time. The Predictors Set The mix of predictors is a major factor affecting the resulting tree structure, the larger the number of potential predictors to split a tree, the larger is the number of segments and the smaller is the size of each terminal segment. To determine the impact of the mix and the number of predictors on the performance of the tree classifiers, we have used four sets of predictors in our study. Set 1 - affinity: the product attributes corresponding to the current product, grouped into major categories based on similarly measures. Set 2 - affinity, recency: product attributes plus recency (number of months since last purchase) categorized into predefined ranges (0-6 months, 7-12 months, etc) Set 3 - affinity, recency, frequency: product attributes, plus recency variables plus frequency measures broken down by product lines. Set 4 - all predictors which exist in the customer file. Min/Max Segment Size Size constraints are most crucial in segmentation analysis. To minimize the error incurred in case wrong decisions are made (e.g., because of sampling errors), segments are required to be “not-too-small” and “not-too-big”. If a segment is too small, the probability of making an error increases due to the lack of statistical significance. If the segment is too big, then if a “good” segment is somehow eliminated from the mailing (Type I error) - large foregone 13 profits are incurred, and if a “bad” segment makes it to the mailing (Type II error) - large out-ofpocket costs are incurred. Consequently, we have built a mechanism in all our automatic tree algorithms to account for minimum and maximum constraints on the resulting segment size. In our study, we used two sets of min/max constraints on the segment size, 150/3000 and 300/6000, respectively. Splitting Criteria Finally, as discussed in Appendix A, one may define a variety of splitting criteria which belong to the node-value and partition-value families. We have used four different criteria in our study, all of them seek to maximize a statistical measure Y as follows: Criterion 1 (CHAID): Y is the statistic for the chi-square test of independence: Y? ? ?Observed ? Expected ?2 splits Expected The larger the value of Y, the larger the difference between the response rates of the resulting child nodes, and the “better” the split. This statistic is distributed as chi-square with (k - 1) degrees of freedom, where k is the number splits for the node. Then, if the resulting P_value is less than or equal to the level of significance, we conclude that Y is big “enough” and that the resulting split is “good”. Since CHAID uses a sequence of tests (each possible split constitutes a test), an adjusted P_value measure is often used to determine the “best” split. Criterion 2 (CHAID): Y is the total entropy associated with a given partition (into M splits). It is a measure of the information content of the split, the larger the entropy, the better Criterion 3 (STA): Y is the number of standard deviations that the response rate (RR) of the smaller child node (the one with the fewer number of customers) is away from the overall 14 response rate of the training audience (TRR). Large values of Y (e.g., Y ? 2 ) mean that the true (but unknown) response rate of the resulting segment is significantly different from the TRR, indicating a “good” split. Criterion 4 (STA, GA): Y is the larger response rate of the two children nodes (in a binary split). This criterion seeks the split which maximizes the difference in the response rates of the two descendant nodes. All these criteria are further discussed in Appendix A. 4. Results and Analysis The combination of several tree classifiers, predictor sets and splitting criteria give rise to numerous profiling algorithms. We provide only selective results in this paper. We evaluate all trees based on goodness-of-fit criteria. As a reference point, we compare the automatic segmentation and the manually-based segmentation to logistic regression. To allow for a “fair” comparison between the models, each model was optimized to yield the best results: the manually-based segmentation by using domain experts to determine the FRAC segmentation criteria; the automatic decision trees by finding the best tree for the given splitting criterion and constraints; and the logistic regression by finding the “best” set of predictors that explain customer choice. Goodness-of-Fit Results Goodness-of-fit exhibits how well a model is capable of discriminating between buyers and nonbuyers. In a binary yes/no model, it is measured by means of the actual response rate (the ratio of the number of buyers “captured” to the size of the audience mailed), or the “lift” in 15 the actual response attained by the model over a randomly-selected mailing. Goodness-of-fit results are typically presented by means of gains tables or gains charts. In a segmentation model, the gains table exhibits the performance measures by segments, in decreasing order of the segments’ actual response rates. To evaluate the goodness-of-fit results, one needs to look on the segments of the holdout sample, where the segments are arranged in descending order of the response rates of the segments in the training sample. In a logistic regression model, the gains results are exhibited by decreasing probability groups, most often by deciles. Table 1 presents the gains table for RFM-based segmentation. Table 2 - for FRACbased segmentation. Out of the many tree classifiers that we analyzed, we present two gains tables for STA, with 1-variable split, predictor set 2 and splitting criterion 4 - one which corresponds to min/max constraints on the resulting segment size of 150 and 3000, respectively (Table 3), and the other for min/max constraints of 300 and 6000, respectively (Table 4). Finally, Table 5 exhibits the logistic regression results by deciles. The goodness-of-fit results may be assessed by means of several measures: - The behavior of the response rates of the holdout audience across segments, which in a “good” model should exhibits a nicely declining pattern. - The difference between the response rates of the top and the bottom segments, the larger the difference, the better the fit. Observing Table 1 (RFM segmentation), other than the top two segments, the response rates across segments in the list are pretty flat and somewhat fluctuating - both are indicative of relatively poor fit. 16 By comparison, in the FRAC segmentation, the top segments perform significantly better than the bottom segments, with the top segment yielding a response rate of 12.99% versus an average response rate of only .69% for the entire holdout audience. And the automatic segmentation methods are not lagging behind in terms of discriminating between the buying and the nonbuying segments, with the top segment outperforming the bottom segments by a wide margin. Tree Performance To evaluate and compare the automatic segmentation to the judgmentally-based segmentation and the logistic regression model, we look on the percentage of buyers captured at several representative mailing audiences. The reference point consists of the top 30% of the customers in the list of segments, arranged in descending order of the response rate of the segments in the training sample. Note that in a tree analysis, the response probability of a customer is determined by the response rate of his/her peers, i.e. the response rate of the segment that the customer belongs to. Since segments’ size are discrete, we use interpolation to exhibit the performance results at exactly 30% of the audience. Of course, no interpolation is required for logistic regression, since here the response probability is calculated individually for each customer in the list. As additional reference points, we also present the performance results for the top 10% and 50% of the audience. Table 6 presents the performance results at these audience levels for several tree classifiers, as well as the results of RFM, FRAC and logistic regression. Note that all tree classifiers were ran using all four sets of predictors; RFM, FRAC and logistic regression - only using set 4. 17 Comparing the results, we conclude: - The logistic regression model outperform all other models - the judgmentally based, as well as the automatic models. - The RFM-based models are the worst. - The automatic tree classifiers perform extremely well, being comparable to the FRAC-based model and getting pretty close to the logistic regression model. - Most of the information is captured by the affinity considerations (Set 1). The additional variables of Set 2 and Set 3 do not seem to add much to improve performance. This phenomenon may be very typical in the collectible industry, where a customer either likes a given range of products (e.g., dolls) or not, but it may not extends to other industries. - By comparison, Set 4, which contains all variables in the data set appears to perform the worst of all sets. This could be a reflection of the overfitting phenomenon, the risk of which is usually higher, the larger the number of variables. - Increasing the minimum size constraint usually increases the variance of the fit results across all segments of a tree. This is a manifestation of the fact that larger segments are less “homogenous” and thus exhibit larger variation. Indeed, smaller segments are more stable, but increase the risk of overfitting. So one needs to trade off segment size to find the most suitable one for the occasion. Finally, it would be interesting to compare the various tree classifiers to one another to find out which one performs the best. But this requires extensive experimentation, running the automatic segmentation models on many more data sets and more applications, which was 18 beyond the scope of this paper. 6. Conclusions In this paper we evaluated the performance of automatic tree classifiers versus the judgmentally-based RFM and FRAC methods and logistic regression. The methods were evaluated based on goodness-of-fit measures, using real data from the collectible industry. Three tree classifiers participated in our study - a modified version of AID that we termed STA (Standard Tree Algorithm), the commonly used CHAID and a newly-developed method based on genetic algorithms (GA). The AID, STA and CHAID are combinatorial algorithms in the sense that they go over all possible combinations of variables to partition a node. Consequently these algorithms are computationally intensive and therefore limited to splits which are based on one variable, or at best two variables at a time. In contrast, GA is a noncombinatorial algorithm in the sense that the candidate solutions (splits) are generated by a random, yet systematic, search method, involving mutations and crossovers. This opens up the possibility to consider partitions which are based on more than 2 variables at a time, hopefully yielding more “homogenous” segments. The evaluation process, which involves several predictor sets, several splitting criteria and several constraints on the minimum and maximum size of the terminal segments, shows that the automatic tree classifiers outperform the RFM and FRAC methods, and come only short of the logistic regression results. The practical implication of these results is that automatic trees may be used as a substitute to judgmentally-based methods, even to logistic regression models, for response modeling. While experience shows that decision trees are outperformed by logistic regression, 19 decision trees have clear benefits from the point of view of the users. Trees are easy to understand and interpret, if, of course, properly controlled to avoid unbounded growth. The output can be presented by means of rules which are clearly related to the problem. Unlike traditional statistical methods, no extensive background in statistics is required to build trees (as the feature selection process is built in the tree algorithm). No close familiarity with the application domain is required either. These benefits, and others, have rendered tree analysis very popular as a data analysis model. Thus, the availability to generate trees automatically and inexpensively, opens up new frontiers for using tree classifiers to rapidly analyze and understand the relationship between variables in a data set, in database marketing as well as in other applications. 20 Table 1: Gains Table for RFM-Based Segmentation Segments with at Least 100 Customers in the Holdout Sample Results by Descending Response Rates of Segments in the Calibration Sample. SEG 144 244 322 434 423 223 143 333 433 134 344 334 131 133 123 233 122 412 323 132 213 234 312 212 211 411 112 444 311 432 413 111 121 313 All CLB RR % 2.38 1.90 1.24 1.10 0.96 0.90 0.90 0.87 0.86 0.84 0.78 0.75 0.72 0.71 0.67 0.62 0.58 0.58 0.57 0.46 0.45 0.44 0.39 0.38 0.34 0.28 0.28 0.27 0.26 0.26 0.23 0.20 0.17 0.07 HLD RR % 2.62 2.38 0.58 0.86 0.00 1.02 0.94 1.43 1.02 0.62 1.23 0.56 0.00 0.88 1.02 1.82 0.38 0.67 0.33 0.37 0.00 0.00 1.02 0.24 0.27 0.37 0.40 0.42 0.27 0.85 0.00 0.46 0.00 0.21 CUM CLB RR % 2.38 2.27 2.21 2.11 1.98 1.91 1.83 1.72 1.63 1.58 1.42 1.36 1.35 1.27 1.24 1.22 1.17 1.14 1.11 1.07 1.06 1.05 1.01 0.95 0.86 0.81 0.79 0.76 0.74 0.74 0.72 0.71 0.70 0.67 0.61 CUM HLD RR % 2.50 2.47 2.34 2.21 1.99 1.93 1.84 1.79 1.71 1.63 1.55 1.47 1.44 1.36 1.35 1.36 1.27 1.25 1.19 1.15 1.14 1.12 1.11 1.03 0.92 0.87 0.85 0.83 0.81 0.81 0.79 0.78 0.76 0.74 0.69 CUM CLB BUY % 29.22 36.07 37.44 39.27 41.55 42.92 44.75 47.49 50.23 52.05 58.45 61.19 62.10 66.67 68.49 69.41 72.60 74.43 76.71 78.54 79.00 79.45 81.28 84.47 89.50 92.69 93.61 95.43 96.80 97.26 98.17 99.09 99.54 100.00 100.00 CUM CLB AUD % 7.47 9.66 10.33 11.34 12.75 13.67 14.91 16.83 18.76 20.09 25.08 27.29 28.06 31.97 33.63 34.52 37.88 39.81 42.23 44.65 45.27 45.90 48.76 53.90 62.98 69.81 71.81 75.93 79.11 80.18 82.63 85.40 87.02 90.94 100.00 CUM HLD BUY % 27.88 34.55 35.15 36.36 36.36 37.58 39.39 43.64 46.67 47.88 56.97 58.79 58.79 64.24 66.67 69.09 70.91 72.73 73.94 75.15 75.15 75.15 79.39 81.21 84.85 88.48 89.70 92.12 93.33 94.55 94.55 96.36 96.36 97.58 100.00 CLB = Calibration sample HLD = Holdout sample CUM = Cumulative SEG = RFM segment number: 1 st digit-Recency: 1-most recent ? 4-least recent nd 2 digit-Frequency: 1-few purchases ? 4-many purchases 3 rd digit Monetary: 1-least spending ? 4 most spending CUM HLD AUD % 7.68 9.61 10.32 11.29 12.60 13.42 14.75 16.78 18.82 20.16 25.23 27.47 28.13 32.41 34.04 34.96 38.29 40.17 42.69 44.95 45.54 46.15 49.00 54.28 63.44 70.19 72.29 76.26 79.31 80.29 82.54 85.28 86.79 90.73 100.00 %CLB AUD/ %HLD AUD 0.98 1.14 0.94 1.04 1.05 1.12 0.93 0.94 0.95 0.99 0.98 0.99 1.16 0.91 1.01 0.97 1.01 1.03 0.96 1.07 1.03 1.05 1.00 0.97 0.99 1.01 0.95 1.04 1.04 1.09 1.09 1.01 1.08 0.99 0.98 21 Table 2: Gains Table for FRAC-Based Segmentation Segments with at Least 100 Customers in the Holdout Sample Results by Descending Response Rates of Segments in the Calibration Sample. SEG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 All CLB RR % 12.36 6.42 5.07 3.73 3.51 2.33 2.33 1.78 1.52 1.08 0.86 0.84 0.80 0.46 0.45 0.44 0.42 0.36 0.36 0.35 0.26 0.19 0.19 0.16 0.13 0.12 0.11 0.09 0.09 0.07 0.07 HLD RR % 12.99 6.53 4.92 3.79 3.41 4.42 1.53 3.02 2.22 0.90 1.87 0.00 0.00 0.70 0.22 0.34 0.00 0.27 0.27 0.78 0.16 0.37 0.15 0.48 0.13 0.25 0.00 0.05 0.20 0.65 0.12 CUM CLB RR % 12.36 8.92 7.59 6.75 6.05 5.33 5.01 4.51 4.24 3.84 3.54 3.28 3.06 2.79 2.38 2.27 2.16 2.05 1.87 1.71 1.43 1.30 1.23 1.19 1.06 0.99 0.95 0.84 0.78 0.74 0.71 0.61 CLB = Calibration sample HLD = Holdout sample CUM = Cumulative SEG = Sequential segment number CUM HLD RR % 12.99 8.67 7.37 6.29 5.64 5.39 4.98 4.60 4.35 3.91 3.70 3.36 3.08 2.83 2.36 2.25 2.13 2.02 1.83 1.73 1.42 1.31 1.24 1.21 1.08 1.02 0.97 0.86 0.80 0.80 0.77 0.69 CUM CLB BUY % 14.61 31.96 41.55 47.95 54.79 59.82 63.01 67.12 69.41 72.15 73.97 75.80 77.63 79.00 81.74 82.65 83.56 84.47 86.30 88.13 91.32 92.69 93.61 94.06 95.43 96.35 96.80 98.17 99.09 99.54 100.00 100.00 CUM CLB AUD % 0.72 2.18 3.33 4.32 5.51 6.83 7.66 9.07 9.95 11.43 12.72 14.05 15.43 17.25 20.93 22.19 23.52 25.06 28.15 31.33 38.95 43.43 46.43 48.20 54.73 59.37 61.95 71.27 77.60 81.39 85.20 100.00 CUM HLD BUY % 13.94 26.06 33.94 38.79 44.85 53.94 55.76 61.82 64.24 66.67 70.30 70.30 70.30 72.12 73.33 73.94 73.94 74.55 75.76 79.39 81.21 83.64 84.24 85.45 86.67 88.48 88.48 89.09 90.91 94.55 95.15 100.00 CUM HLD AUD % 0.74 2.07 3.17 4.24 5.47 6.88 7.70 9.24 10.15 11.72 13.06 14.38 15.71 17.50 21.33 22.57 23.82 25.34 28.46 31.65 39.44 43.91 46.74 48.47 54.96 59.96 62.67 71.65 77.76 81.60 85.20 100.00 %CLB AUD/ % HLD AUD 0.98 1.10 1.05 0.93 0.97 0.93 1.02 0.90 0.97 0.92 0.96 1.01 1.04 1.02 0.96 1.02 1.06 1.02 0.99 1.00 0.98 1.00 1.06 1.02 1.01 0.93 0.95 1.04 1.04 0.99 1.06 1.00 22 Table 3: Gains Table for STA, 1-variable Split, Predictor Set 2 Splitting Criterion 4, and Min/Max Constraint 150/3000 Segments with at Least 25 customers in the Holdout Sample Results by Descending Response Rates of Segments in the Calibration Sample. SEG 9 1 8 10 2 15 13 4 24 16 18 22 12 14 34 3 20 23 30 7 5 11 32 17 19 25 60 38 40 41 48 31 26 59 47 42 39 21 37 49 50 54 64 63 33 69 6 35 53 57 65 43 All CLB RR % 8.12 5.65 4.55 4.05 4.01 3.49 3.37 3.31 3.11 2.41 1.84 1.75 1.51 1.39 1.21 1.09 1.05 0.96 0.96 0.90 0.75 0.70 0.68 0.65 0.63 0.53 0.49 0.48 0.47 0.46 0.46 0.38 0.37 0.36 0.32 0.31 0.30 0.30 0.30 0.29 0.28 0.27 0.27 0.27 0.24 0.24 0.23 0.15 0.13 0.13 0.13 0.13 HLD RR % 10.08 1.85 6.50 4.20 3.67 3.46 3.85 4.67 0.93 0.00 1.81 2.87 1.14 2.97 0.00 0.39 0.00 1.00 0.96 0.90 0.61 0.00 0.47 0.00 0.40 0.00 0.78 0.00 0.00 0.00 0.00 0.00 1.04 0.00 0.65 0.00 0.46 1.32 0.85 0.00 0.00 0.28 0.68 0.21 0.19 0.36 0.33 0.11 0.19 0.21 0.29 0.00 CLB = Calibration sample HLD = Holdout sample CUM = Cumulative SEG = STA segment number CUM CLB RR % 8.12 7.36 6.42 6.03 5.64 5.25 4.99 4.89 4.78 4.64 4.27 4.11 3.92 3.74 3.60 3.40 3.31 3.13 3.00 2.89 2.79 2.70 2.61 2.57 2.48 2.43 2.38 2.33 2.28 2.24 2.19 2.14 2.08 2.03 1.87 1.83 1.78 1.74 1.69 1.65 1.61 1.50 1.47 1.41 1.34 1.31 1.28 1.20 1.11 1.04 0.92 0.90 0.61 CUM HLD RR % 10.08 7.51 7.19 6.72 6.13 5.67 5.43 5.38 5.11 4.85 4.46 4.32 4.10 4.01 3.80 3.52 3.40 3.20 3.07 2.95 2.86 2.74 2.63 2.57 2.46 2.41 2.37 2.31 2.25 2.20 2.14 2.08 2.05 1.99 1.87 1.81 1.77 1.76 1.73 1.68 1.63 1.53 1.50 1.44 1.37 1.34 1.32 1.22 1.14 1.08 0.97 0.94 0.69 CUM CLB BUY % 19.18 25.11 32.88 36.99 42.92 48.86 53.88 56.16 58.45 60.27 63.93 65.75 67.58 69.41 70.78 72.60 73.52 75.34 76.71 78.08 79.00 79.91 80.82 81.28 82.19 82.65 83.11 83.56 84.02 84.47 84.93 85.39 85.84 86.30 87.67 88.13 88.58 89.04 89.50 89.95 90.41 91.78 92.24 93.15 94.06 94.52 94.89 95.89 96.80 97.72 99.54 100.00 100.00 CUM CLB AUD % 1.44 2.08 3.12 3.73 4.63 5.67 6.57 6.99 7.44 7.90 9.11 9.74 10.48 11.28 11.97 12.99 13.52 14.67 15.54 16.47 17.21 18.01 18.83 19.26 20.14 20.66 21.23 21.81 22.39 22.99 23.60 24.34 25.10 25.87 28.48 29.37 30.28 31.21 32.15 33.11 34.10 37.18 38.22 40.30 42.62 43.79 44.99 48.72 52.88 57.15 65.67 67.84 100.00 CUM HLD BUY % 21.82 23.64 33.33 36.97 41.82 46.67 51.52 54.55 55.15 55.15 58.18 61.82 63.03 66.67 66.67 67.27 67.27 69.09 70.30 71.52 72.12 72.12 72.73 72.73 73.33 73.33 73.94 73.94 73.94 73.94 73.94 73.94 75.15 75.15 77.58 77.58 78.18 80.00 81.21 81.21 81.21 82.42 83.64 84.24 84.85 85.45 86.06 86.67 87.88 89.09 92.73 92.73 100.00 CUM HLD AUD % 1.49 2.16 3.19 3.79 4.69 5.66 6.52 6.97 7.42 7.83 8.98 9.85 10.58 11.43 12.06 13.13 13.61 14.86 15.73 16.66 17.34 18.12 19.01 19.45 20.49 20.96 21.50 21.98 22.58 23.13 23.72 24.47 25.27 25.95 28.53 29.46 30.37 31.32 32.30 33.30 34.18 37.11 38.34 40.37 42.60 43.74 45.00 48.76 53.09 56.99 65.48 67.64 100.00 %CLB AUD/ %HLD AUD 0.97 0.95 1.01 1.03 0.99 1.07 1.04 0.94 0.99 1.13 1.05 0.73 1.00 0.95 1.09 0.95 1.10 0.92 1.00 1.00 1.09 1.02 0.92 0.98 0.85 1.09 1.05 1.21 0.97 1.09 1.03 0.99 0.94 1.12 1.01 0.96 1.00 0.97 0.96 0.96 1.12 1.05 0.84 1.03 1.04 1.02 0.96 0.99 0.96 1.09 1.00 1.00 0.99 23 24 Table 4: Gains Table for STA, 1-variable Split, Predictor Set 2 Splitting Criterion 4, and Min/Max Constraint 300/6000 Segments with at Least 25 customers in the Holdout Sample Results by Descending Response Rates of Segments in the Calibration Sample. SEG 1 7 3 2 6 8 13 10 5 11 9 12 16 25 15 4 21 30 43 19 24 20 46 33 42 17 23 22 36 38 18 26 27 31 40 47 28 All CLB RR % 6.62 5.08 4.01 2.78 2.60 2.04 1.79 1.37 1.09 0.91 0.90 0.63 0.62 0.56 0.52 0.43 0.38 0.32 0.32 0.32 0.31 0.30 0.29 0.28 0.26 0.24 0.22 0.21 0.20 0.19 0.19 0.18 0.15 0.13 0.13 0.13 0.12 HLD RR % 8.62 3.94 3.67 3.17 2.24 1.72 1.11 2.43 0.39 0.58 0.90 0.40 0.44 0.00 1.06 0.43 0.28 0.25 0.50 0.00 0.62 0.44 0.71 0.00 0.19 0.18 0.00 0.65 0.00 0.29 0.09 0.26 0.00 0.21 0.00 0.29 0.18 CLB = Calibration sample HLD = Holdout sample CUM = Cumulative SEG = STA segment number CUM CLB RR % 6.62 6.15 5.72 4.78 4.35 3.99 3.77 3.46 3.28 2.95 2.83 2.72 2.62 2.51 2.36 2.20 2.08 1.87 1.78 1.69 1.58 1.55 1.52 1.45 1.39 1.32 1.29 1.26 1.23 1.20 1.11 1.08 1.05 0.99 0.96 0.85 0.81 0.61 CUM HLD RR % 8.62 7.14 6.45 5.42 4.84 4.37 4.03 3.79 3.53 3.13 3.01 2.85 2.73 2.61 2.49 2.32 2.19 1.97 1.88 1.79 1.70 1.66 1.63 1.55 1.48 1.41 1.37 1.35 1.30 1.27 1.17 1.15 1.11 1.05 1.01 0.93 0.88 0.69 CUM CLB BUY % 26.94 36.07 42.01 51.60 58.45 63.47 66.67 70.32 72.15 75.34 76.71 77.63 78.54 79.45 80.82 82.19 83.11 84.93 85.84 86.76 88.13 88.58 89.04 89.95 90.87 91.78 92.24 92.69 93.15 93.61 94.98 95.43 95.89 96.80 97.26 99.09 100.00 100.00 CUM CLB AUD % 2.48 3.57 4.47 6.57 8.17 9.67 10.76 12.39 13.40 15.55 16.47 17.36 18.25 19.25 20.84 22.79 24.26 27.70 29.43 31.18 33.89 34.82 35.77 37.72 39.89 42.25 43.50 44.83 46.22 47.66 52.10 53.61 55.52 59.78 61.93 70.59 75.10 100.00 CUM HLD BUY % 31.52 38.18 43.03 52.73 57.58 61.21 63.03 69.70 70.30 72.12 73.33 73.94 74.55 74.55 76.97 78.18 78.79 80.00 81.21 81.21 83.64 84.24 85.45 85.45 86.06 86.67 86.67 87.88 87.88 88.48 89.09 89.70 89.70 90.91 90.91 94.55 95.76 100.00 CUM HLD AUD % 2.51 3.68 4.59 6.69 8.17 9.63 10.76 12.64 13.72 15.86 16.78 17.83 18.78 19.68 21.24 23.18 24.69 27.98 29.64 31.22 33.90 34.84 36.01 37.89 40.03 42.31 43.55 44.83 46.34 47.79 52.23 53.81 55.74 59.63 61.65 70.29 75.01 100.00 %CLB AUD/ %HLD AUD 0.99 0.94 0.99 1.00 1.08 1.03 0.97 0.86 0.95 1.00 1.00 0.85 0.93 1.11 1.02 1.01 0.98 1.05 1.04 1.11 1.01 0.99 0.80 1.04 1.01 1.04 1.01 1.04 0.93 0.99 1.00 0.95 0.99 1.09 1.07 1.00 0.96 1.00 25 Table 5: Gains Table for Logistic Regression by Decils - Holdout Sample % PROSPECTS 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 % RESPONSE 62.42 77.58 83.03 86.67 89.70 92.12 96.36 98.18 100.00 100.00 ACTUAL RESP RATE % 4.29 2.67 1.90 1.49 1.23 1.06 0.95 0.84 0.76 0.69 % RESPONSE/ %PROSPECTS 6.24 3.88 2.77 2.17 1.79 1.53 1.38 1.23 1.11 1.00 PRED RESP RATE % 4.26 2.57 1.83 1.43 1.18 1.00 0.87 0.78 0.70 0.63 26 Table 6: Summary of Performance Results - Holdout Sample LOGIT FRAC RFM CHAI D CHAI D CHAI D CHAI D GA-3 GA-3 GA-4 GA-4 STA-1 STA-1 STA-1 STA-1 STA-2 STA-2 STA-2 STA-2 LOGIT= FRAC RFM CHAID STA-1 STA-2 GA-3 GA-4 CRITE RION MIN SEG SIZE SET 1 10% SET 2 10% 1 150 55.1 59.5 1 300 55.1 2 150 2 4 4 4 4 3 3 4 4 3 3 4 4 SET 3 10% 57.0 SET 4 10% 62.4 63.8 34.9 42.4 73.9 SET 4 30% 83.0 77.5 61.2 68.3 86.0 86.0 86.4 SET 4 50% 89.7 86.9 79.7 79.8 78.4 79.2 59.3 49.8 45.8 78.4 79.2 73.0 68.2 86.0 86.0 86.4 75.9 56.1 59.8 58.9 59.1 78.7 78.7 81.3 80.7 86.8 86.8 87.7 87.1 300 54.3 61.5 59.2 61.0 78.7 78.7 81.3 81.3 86.8 86.8 87.7 87.7 150 300 150 300 150 300 150 300 150 300 150 300 56.6 51.9 57.0 54.5 44.8 50.9 55.4 55.1 52.7 45.5 53.7 50.3 60.6 57.5 59.8 62.9 50.3 50.9 62.1 61.8 55.8 45.5 62.4 59.5 61.9 59.4 58.2 59.8 50.4 50.9 63.5 58.5 57.0 45.5 59.1 60.3 60.0 60.7 59.1 60.0 54.5 50.9 61.1 60.4 57.5 50.7 . 64.1 80.8 78.5 80.6 79.2 77.0 79.8 78.5 81.5 80.0 77.8 79.4 79.4 77.6 76.3 76.5 78.1 79.4 79.8 77.9 81.2 77.6 77.8 79.4 79.3 73.9 73.6 72.6 75.7 79.8 81.6 76.4 78.2 79.3 78.5 71.9 77.0 73.1 72.1 69.0 77.0 82.4 82.4 75.8 76.4 80.0 81.8 . 82.4 86.9 85.6 86.8 85.0 84.3 87.4 86.3 86.8 86.2 85.1 87.2 86.4 85.1 84.3 83.6 84.7 85.8 88.7 87.0 88.8 84.2 84.5 85.5 86.1 83.2 82.3 80.8 83.4 86.1 86.7 86.1 85.5 84.2 86.8 80.2 84.2 81.7 79.9 77.8 85.0 87.4 88.3 83.4 84.3 86.5 90.3 . 88.3 Logistic regression = FRAC-based segmentation = RFM-based segmentation = CHAID tree = STA tree with 1-variable split = STA tree with 2-variable split = GA tree with 3-variable split = GA tree with 4-variable split SET 1 30% SET 2 30% SET 3 30% SET 1 50% SET 2 50% SET 3 50% 27 Appendix A: Splitting Criteria Splitting criteria are used to determine the best split, out of the many possible ways to partition a node. By and large, we can classify the splitting criteria into two families, one which is based on the value of the node, the other on the value of the partition. Node-Value Based Criteria These criteria are based on the value of the node. The objective is to maximize the improvement in the node value which results by splitting a node into two or more splits. The value of a node t is a function of the response rate (RR) of the node, RR(t), (i.e., the proportion of buyers). The “ideal” split is the one which partitions a father node into two children nodes, one which contains only buyers, i.e., RR ?t ? ? 1 , and the other only nonbuyers, RR ?t ? ? 0 . Clearly, in applications where the number of “responders” is more or less equal to the number of “non-responders”, the “worst” split is the one which results in two children each having about the same proportion of buyers and nonbuyers, i.e. RR ?t ? ~ 1/ 2 . We define the value (or quality) of a node t as a function Q(RR) which satisfy the following conditions: ?? Max Q ?RR ? ? Q ?0? ? Q ?1? ?? Min Q ?RR ? ? Q ?1 / 2 ? ?? Q ?RR ? is a concave function of RR (i.e., the second derivative is positive). ?? Q ?RR ? is symmetric, i.e. Q ?RR ? ? Q ?1 ? RR ? The first two conditions stem from our definition of the “best” and the “worst” splits; the 28 concavity condition follows from the first two conditions; and the symmetry condition from the fact that the reference point is ½. Clearly, there are many functions that satisfy these requirements. While the analysis here extends to any number of splits per node, we focus here on two-way splits. Examples: a. The (piecewise) linear function, e.g., ?1 ? RR Q ?RR ? ? ? ? RR b. RR ? 1 / 2 RR ? 1 / 2 (A.1) The quadratic function Q ?RR ? ? a ? b ?RR ? c ?RR 2 where a, b and c are parameters. One can show that the only quadratic function that satisfy all these conditions (up to a constant) is: Q ?RR ? ? ? RR ?1 ? RR ? ? RR 2 ? RR For example, the variance used by AID to define the value of a node, in the binary yes/no case, is a quadratic node value function. To show this, we evaluate the variance of the choice variable Y. Denoting by: Yi - the choice value of observation i Y - the mean value over all observations B - the number of buyers N - the number of nonbuyers Var ?Y ? ? ? ?Y i ?Y ? 2 i ?B ? N ?? ? i Yi 2 ?B ? N ? ? Y 2 29 But since Yi is a binary variable (1-buy, 0-no buy), ? i Y ? Yi 2 ? ? Yi ? B i ? Y i ?B ? N ? ? RR i and we obtain: Var ?Y ? ? RR ? RR 2 ? .25 ? ?RR ? .5 ? 2 which is equivalent to the quadratic function (since the variance is not affected by shifting): Q ?RR ? ? ?RR ? 1/ 2? 2 c. (A.2) The entropy function (Michalski, et al., 1993) Q ?RR ? ? ? ?RR log ?RR ? ? ?1 ? RR ? log ?1 ? RR ?? (A.3) The entropy is a measure of the information content, the larger the entropy, the better. Hence the best split is the one with the largest entropy of all possible splits of a node. Figure A.1 exhibits all the three functions graphically. Now, the node value resulting by partitioning a node into two children nodes is obtained as the average of the node value of the two descendant nodes, weighted by the proportion of customers, i.e.: N1 N Q ?RR1 ? ? 2 Q ?RR2 ? N N (A.4) Q(RR) 30 1 Piecewise Linear 0.8 0.6 0.4 Quadratic 0.2 -0.2 Entropy -0.4 Figure A.1 - Node Value Functions 0.99 0.92 0.85 0.78 0.71 0.64 0.57 0.5 0.43 0.36 0.29 0.22 0.15 0.08 RR 0.01 0 31 where: N - the number of customers in the father node N 1, N 2 - the number of customers in the descendant left node (denoted by the index 1) and the right node (denoted by the index 2), respectively. In the following we always assume N1 ? N 2 (i.e., the left node is the smaller one). RR1 , RR 2 - Q ?RR1 ?, Q ?RR 2 ? the response rates of the left and the right nodes, respectively. the corresponding node value functions. Thus, the improvement in the node value resulting by the split is given as the difference: N1 N Q ?RR1 ? ? 2 Q ?RR 2 ? ? Q ?RR ? N N (A.5) And we seek the split that yields the maximal improvement in the node value. But since, for a given father node, Q ?RR ? is the same for all splits, (A.5) is equivalent to maximizing the node value (A.4). Clearly in DBM applications, where the number of buyers are largely outnumbered by the number of nonbuyers, the reference point of ½ may not be appropriate. A more suitable reference point to define the node value is TRR, where TRR is the overall response rate of the training audience. Another alternative is the cutoff response rate, CRR, calculated based on economic criteria. The resulting node value functions in this case satisfy all conditions above, except that they are not symmetrical. Now depending upon the value function Q ?RR ? , this yields several heuristic criteria for determining the best split. For example, for the piecewise linear function and an hierarchical tree, a possible criterion is choosing the split which maximizes the response rate of the smaller 32 child node, Max ?RR1 , 1 ? RR1 ? ; In a binary tree, the split which yields the most difference in the response rates of the two descendant nodes, Max ?RR1 , RR 2 ? . In the quadratic case, a reasonable function is ?RR ? TRR ?2 . Or one can use the entropy function (A.3). Finally, we note that with a concave node value function Q ?RR ? basically any split will result in a positive value improvement, however small. Thus, when using the node-value based criteria for determining the best split, it is necessary to impose a threshold level on the minimum segment size and/or the minimum improvement in the node value, or otherwise the algorithm will keep partitioning the tree until each node contains exactly one customer. Partition-Value Based Criteria Instead of evaluating nodes, one can evaluate partitions. A partition is considered as a “good” one if the resulting response rate of the children nodes are significantly different than one another. This can be casted in terms of test of hypothesis. For example, in a two-way split case: H0 : p1 ? p2 H1 : p1 ? p 2 where p1 and p2 are the true (but unknown) response rates of the left child node and the right child node, respectively. A common way to test the hypothesis is by calculating the P_value, defined as the probability to reject the null hypothesis, for the given sample statistics, if it is true. Then, if the resulting P_value is less than or equal to a predetermined level of significance (often 5%), the hypothesis is rejected; otherwise, the hypothesis is accepted. 33 (a) The normal test The hypothesis testing procedure draws on the probability laws underlying the process. In the case of a two-way split, as above, the hypothesis can be tested using the normal distribution. One can find the Z-value corresponding to the P_value, denoted Z 1? ? / 2 , using: Z1? ? / 2 ? Abs ?RR1 ? RR2 ? RR1 ? ?1 ? RR1 ? ?B1 ? N 1 ? ? RR2 ? ?1 ? RR 2 ? ?B2 ? N 2 ? (A.6) where: RR1 , RR 2 - the response rates of the left child node (denoted by the index 1) and the right child node (denoted by the index 2) respectively B1 , B2 - the corresponding number of buyers N 1, N 2 - the corresponding number of nonbuyers and then extract the P_value from a normal distribution table. (b) The chi-square test In the case of a multiple split, the test of hypothesis is conducted by means of the chi square test of independence. The statistic for conducting this test, denoted by Y, is given by: Y? ? splits ?Observed - Expected ?2 Expected (A.7) Table A.1 exhibits the calculation of the components of Y for a 3-way split, extending the notation above to the case of 3 child nodes. This table can easily be extended to more than three splits per node. 34 Table A.1: Components of Y 1 Split Observed Expected Buyers Nonbuyers Total T1 ? B1 ? N 1 B1 T1 ?B T B2 T2 ?B T N1 T1 ?N T N2 T2 ?N T 3. Observed Expected B3 T3 ?B T N3 T3 ?N T T3 ? B3 ? N 3 Total B ? B1 ? B2 ? B3 N ? N1 ? N 2 ? N3 T ? T1 ? T 2 ? T3 2 Observed Expected T2 ? B2 ? N 2 Y is distributed according to the chi-square distribution with ?k ? 1?degrees of freedom, where k is the number of splits for the node. One can then extracts the P_value for the resulting value of Y from the chi square distribution. The best split is the one with the smallest P_value. (c) The smallest child test Finally, this criterion is based on testing the hypothesis: H0 : p ? TRR H1 : p ? TRR where p here stands for the true (but unknown) response rate of the smaller child node, and TRR is the observed response rate for the training audience. To test this hypothesis we define a statistic Y denoting the number of standard deviations (“sigmas”) that the smaller segment is away from TRR, i.e.: Y? RR ? TRR RR ?1 ? RR ? N where RR is the observed response rate of the smaller child node, and N the number of observations. Large values of Y mean that p is significantly different than TRR, indicating a “good” split. For example, one may reject the null hypothesis, concluding that the split is a good one, if Y is larger than 2 “sigmas”. 35 Appendix B: Tree Classifiers We discuss below the three tree classifiers that were involved in our study – STA, CHAID and GA. STA – Standard Tree Algorithm STA is an AID–like algorithm. The basic AID algorithm is a univariate binary tree. In each iteration of the process, each undetermined node is partitioned based on one variable at a time into two descendant nodes. The objective is to partition the audience into two groups that exhibit substantially less variation than the father node. AID uses the sum of squared deviations of the response variable from the mean as the measure of the node value, which, in the binary yes/no case, reduces to the minimum variance criterion, ?RR ? 0.5?2 , where RR is the response rate (the ratio of the number of responders to the total number of customers) for the node (see also Appendix A). In each stage, the algorithm searches over all remaining predictors, net of all predictors that had already been used in previous stages to split father nodes, to find the partition that yields the maximal reduction in the variance. In this work we have expanded the AID algorithm in two directions: - Splitting a node based on two predictors at a time to allow one to also account for the interaction terms to affect the tree structure. - Using different reference points in the minimum variance criterion that are more appropriate for splitting populations with marked differences between responders and non-responders, such as DBM applications. Possible 36 candidates are the overall response rate of the training audience, or even the cutoff response rate separating between targets and nontargets. We therefore refer to our algorithm as STA (Standard Tree Algorithm) to distinguish it from the conventional AID algorithm. CHAID CHAID (Chi-Square AID) is the most common of all tree classifiers. Unlike AID, CHAID is not a binary tree as it may partition a node into more than two branches. CHAID categorizes all independent continuous and multi-valued integer variables by “similarity” measures, and considers the resulting categories for a variable as a whole unit (group) for splitting purposes. Take for example the variable MONEY (money spent) that is categorized into four ranges, each is represented by a dummy 0/1 variable which assumes the value of 1 if the variable value falls in the corresponding range, 0-otherwise. Denote the resulting four categorical variables as variables A, B, C and D, respectively. Since MONEY is an ordinal variable (order is important) there are 3 possibilities to split this variable into two adjacent categories: (A, BCD), (AB, CD), (ABC, D); 3 possibilities to split the variable into 3 adjacent categories: (A, B, CD) (AB, C, D), (A, BC, D); and one way to split the variable into four adjacent categories (A, B, C, D). Now, CHAID considers each of these partitions as a possible split, and seeks the best combination to split the node from among all possible combinations. As a result, a node in CHAID may be partitioned into more than two splits, as many as four splits in this particular example. The best split is based on a chi-square test, which is what gave this method its name. Clearly, there are many ways to partition a variable with K values into M categories 37 (children nodes). To avoid choosing a combination that randomly yields a “good” split, some versions of CHAID use an adjusted P_value criterion to compare candidate splits. Let L denote the number of possible combinations to combine a variable with K values into M categories. Let ? denote the Type-I error (also known as the level of significance) in the chisquare test for independence. ? is the probability of rejecting the null hypothesis that there is no significant difference in the response rates of the resulting child nodes, when the null hypothesis is true. Now, the probability to accept the null hypothesis in one combination is ?1 ? ? ? , and in L successive combinations (assuming the test of hypotheses are independent) is ?1 ? ? ?L . Hence the probability of making a Type-I error in at least one combination is 1 ? ?1 ? ? ?L , which is greater than ? . To yield a “fair” comparison of the various combinations, ? is replaced by the resulting P_value. In most cases, the P_value is very small, and we therefore can use the approximation 1 ? ?1 ? P _ value ? ? L ?P _ value L The resulting quantity, L?P _ value , is the adjusted P_value, and L is referred to as the Bonferroni multiplier. Each combination yields a different adjusted P_value. The “best” combination to partition the node by is the one that yields the smallest adjusted P_value. The number of possibilities for combining a variable with K values into M categories depends on the type of the variable involved. In our algorithm, we distinguish between three cases: - Ordinal variables where one may combine only adjacent categories (as in the case of the variable MONEY above). 38 - Ordinal variables with a missing value that may be combined with any of the other categories. - Nominal variables where one may combine any two (or more) values, including the missing value (e.g., the variable MARITAL with four nominal values: M married, S - single, W - widow, D - divorce). Table B.1 exhibits the number of combinations for several representative values of K and M. Table B.1: Number of Possible Combinations K 2 3 4 4 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 M 2 2 2 3 2 3 4 2 3 4 5 2 3 4 6 2 4 6 7 2 4 6 8 2 4 6 Ordinal 1 2 3 3 4 6 4 5 10 10 5 6 15 20 6 7 35 21 7 8 56 56 8 9 84 126 Ord+Miss 1 3 5 5 7 12 7 9 22 22 9 11 35 50 11 13 95 51 13 15 161 161 15 17 252 406 Nominal 1 3 7 6 15 25 10 31 90 65 15 63 301 350 21 127 1701 266 28 255 7770 2646 36 511 34105 22827 39 10 10 8 9 36 9 92 17 750 45 Genetic Algorithm (GA) All tree algorithms described above are combinatorial, in the sense that in each stage they go over all possible combinations to partition a node. This number gets excessively large even for one-variable splits and becomes computationally prohibitive with multi-variable splits. Consequently, all tree algorithms are univariate (AID, CHAID) or at best bi-variate (STA). Yet, it is conceivable that splits based on several variables at a time (more than 2) may be more “homogenous” and therefore better from the standpoint of profiling. Thus, by confining oneself to using only univariate and even bi-variate tree algorithms, one may miss out the better splits which could have been obtained otherwise with multivariate algorithms. To resolve this issue, we developed a Genetic Algorithm (GA) tree for profiling which, unlike the other trees, is a non-combinatorial algorithm in the sense that it employs a random, yet a systematic search approach to grow a tree, rather than go over all possible number of combinations. This significantly reduces the number of combinations to consider in partioning a node, thus allowing one to increase the number of variables to split a node beyond two. In fact, with this approach one can theoretically use any number of variables to split a node, but for computational reasons we have confined the number of simultaneous variables to the range 3-7. Genetic Algorithm (GA) is a general purpose search procedure, based on the biological principle of “the survival of the fittest”, according to which the strongest and the fittest have a higher likelihood of reproduction than the weak and the unfit. Thus, the succeeding 40 descendants, having inherited the better properties of their parents, tend to be even stronger and healthier than their predecessors and therefore get improved over time with each additional generation (Davis, 1991). This idea has been applied to find heuristic solutions for large scale combinatorial optimization problems. Starting with the “better” solutions in each generation (according to some “fitness” measure), GA creates successive offspring solutions that are likely to result in a better value for the objective function as one goes from one generation to the other, thus finally converging to a local, if not a global optimum (Holland, 1975). These solutions are created by means of a “reproduction” process which involves two basic operations: “mutations” and “crossovers”. ?? Mutation - randomly changing some of the genes of the parent solution. ?? Crossover - crossing over genes of two parent solutions, some of the genes are taken from the “mother” solution, the rest from the “father” solution. In the context of our profiling problem, GA is used as an algorithm to grow the tree and generate candidate splits for a node. A solution in our case is a collection of splitting rules specifying whether a customer belongs to the left segment or to the right segment. For example, if X 3 ? 1, X 4 ? 0 and X 7 ? 1 , the customer belongs to the left segment, otherwise he/she belongs to the right segment. One may use several ways to represent splits in GA. One possibility is by means of a vector, the dimension of which is equal to the number of potential predictors, one entry for each predictor. The value of each entry denote how the corresponding predictor affect the split, e.g., 0 - X i does not affect the current split 41 -1 - X i ? 0 in the current split 1 - X i ? 1 in the current split In the above example (assuming there are only 10 potential predictors denoted as X 1 ,? , X 10 ), the corresponding vector is given by: (0, 0, 1, -1, 0, 0, 1, 0 0 0) Using the terminology of GA, each such solution is a chromosome, each variable is a gene, and the value of each gene is an allele. Now, the crux of the GA method is to define those descendant solutions from one generation to the other. These are created in our algorithm using mutation and crossover operations, as follows: Mutation: ?? Choose a predictor from the set of predictors in the solution, by drawing a random number from the uniform distribution over the range (1 - V), where V is the number of potential predictors. Say the predictor selected is X 3 . ?? Determine the allele of X 3 as follows: -1 with probability of p 0 1 - with probability of p1 0 - otherwise For example, suppose the allele selected is -1, then the descendant solution becomes (0, 0, -1, -1, 0, 0, 1, 0 0 0 ) and the new split is defined by 42 X3 ? 0, X4 ? 0 and X7 ? 1 The value of p0 and p1 are parameters of the algorithm and are set up in advance by the user. The mutation operator is applied simultaneously on g genes at a time, where the g genes are also determined at random. Crossover: ?? Pick two solutions at random (a “father” and a “mother”) ?? Select g consecutive genes at random, say X 3 and X 4 ?? Create two descendant solutions: a “daughter” and a “son”, by swapping the selected genes: The “daughter” solution gets her mother’s genes, except for X 3 and X 4 which are inherited from the father; the “son” solution gets his father’s genes, except X 3 and X 4 which are inherited from the mother. The process starts with a pool of solutions (population), often created in a random manner. The various reproduction methods are applied to create the descendant solutions of the next generation. The resulting solutions are then evaluated based on the partitioning criteria. The best solutions are retained, and control is handed over to the next generation, and so on, until all termination conditions are met. Finally we note that in our GA tree we define the smaller child node (i.e., the one with the fewer number of customers) as a terminal node. This is based on the plausible assumption that with multivariate-based splits, the resulting smaller split appears to be homogenous “enough” to make it a terminal node. Hence, the resulting GA tree is hierarchical. 43 References Ben-Akiva, M. and Lerman, S.R. (1987), Discrete Choice Analysis, Cambridge, MA, The MIT Press. Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984), Classification and Regression Trees, Belmont, CA., Wadsworth. Bult, J.R. and Wansbeek,T. (1995), Optimal Selection for Direct Mail, Marketing Science, 14, pp. 378-394. Davis, L., editor (1991), Handbook of Genetic Algorithms, New York, Van Nostrand Reinhold. Haughton, D. and Oulabi, S. (1997), Direct Marketing Modeling with CART and CHAID, Journal of Direct Marketing, 11, pp. 42-52. Holland, J.H. (1975), Adaptation in Natural and Artificial Systems, Ann Arbor, University of Michigan Press. Kass, G. (1983), An Exploratory Technique for Investigating large Quantities of Categorical Data, Applied Statistics, 29. Kestnbaum, R.D., Kestnbaum & Company, Chicago, Private Communication. Levin, N. and Zahavi, J. (1996), Segmentation Analysis with Managerial Judgment, Journal of Direct Marketing, 10, pp. 28-47. Long, J.S. (1997), Regression Models for Categorical and Limited Dependent Variables, Thousand Oaks, CA, Sage Publications. Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. (1983), Machine Learning - An Artificial Intelligence Approach, Palo Alto, CA., Tioga Publishing Company. Morwitz, G.V. and Schmittlein, D. (1992), Using Segmentation to Improve Sales forecasts Based on Purchase ?????: Which “Indenders” Actually Buy?, Journal of Marketing Research, 29, pp. 391-405. Morwitz, G.V. and Schmittlein, D. (1998), Testing New Direct Marketing Offerings: The Interplay of Management Judgment and Statistical Models, Management Science, 44, pp. 610-628. 44 Murthy, K.S. (1998), Automatic Construction of Decision Trees from Data: A MultiDisciplinary Survey, Data Mining and Knowledge Discovery, 2, pp. 45-389. 45 Novak, P.T., de Leeuw, J. and MacEvoy, B. (1992), Richness Curves for Evaluating Market Segmentation, Journal of Marketing Research,29, pp. 254-267. Quinlan, J.R. (1986), Induction of Decision Trees, Machine Learning, 1, pp. 81-106. Quinlan, J.R. (1993), C4.5: Program for Machine Learning, CA., Morgan Kaufman Publishing. Shepard, D. editor (1995), The New Direct Marketing, New York, Irwin Professional Publishing. Sonquist, J., Baker, E. and Morgan, J.N. (1971), Searching for Structure, Ann Arbor, University of Michigan, Survey Research Center. Weinstein, A. (1994), Market Segmentation, New York, Irwin Professional Publishing.
© Copyright 2026 Paperzz