Developing Customer Churn Models for Customer Relationship Management Stephen Rodriguez1 Computer Science and Information Systems Departments School of Arts and Science Iona College New Rochelle, NY 10801, USA E-mail: [email protected] Heechang Shin Information Systems Department Hagan School of Business Iona College New Rochelle, NY 10801, USA E-mail: [email protected] Abstract Customer churn is the term used to describe customers who terminate their relationship with a company. Since churn means a loss of revenue to a company, it is important to identify customer churn and provide incentives to them in order to retain them to the company. This paper aims to design methodologies for the customer churn prediction problem in wireless telecommunications industry. Since the number of features in customer churn dataset is rather large, the performance of decision trees is significantly degraded if inappropriate features are selected and used for building decision trees. This paper finds features that have high effect on customer churn, and to design methodologies that will cope with high dimensionality of dataset and the customer churn prediction problem. This paper provides experimental results of a set of data mining schemes and feature selection methods for customer churn datasets. Keywords: customer churn, classification model, data mining, and decision tree Biographical notes: Stephen Rodriguez is enrolled in a 5-year dual degree program for a BA/MS in Computer Science with minor in Information Systems. His research interests are data mining algorithms and natural language processing: use of data mining for solving business problems and natural language processing for small children. He specializes in front-end and back-end development of desktop and mobile software applications using programming languages such as Java, C++, and HTML/CSS/JS/XML. Heechang Shin received his PhD in Information Technology from Rutgers University. He is currently an Assistant Professor of Information Systems at Iona College. His research interests are information privacy and security and data mining: information security and privacy issues raised by data mining techniques, efficient enforcement of privacy and security via secured indexing and access control, use of data mining for enhancing security and mobile databases. 1. Introduction Customer churn is the term to describe customers who terminate their relationship with a company. Since churn means a loss of revenue, it is important to identify the customer group who is most likely to leave the company and provide incentives to them in order to retain them as customers. Managing customer churn is of great concern to global telecommunications service companies as the market matures (Ahn et al., 2006). The annual churn rate ranges from 20% to 40% in most of the global mobile telecommunications service companies (Berson, Smith, & Therling, 1999; Madden, Savage, & Coble-Neal, 1999; Parks Associates, 2003; Kim & Jeong, 2004). In a highly competitive and maturing mobile telecommunications service market, instead of attempting to entice new customers or to lure subscribers away from competitors, Fornell and Wernerfelt (1987) suggest reducing customer exit and brand switching. Reichheld (1996) estimated that, with an increase in customer retention rates of just 5%, the average net present value of a customer increases by 35% for software companies and 95% for advertising agencies. Therefore, in order to be successful in the maturing market, the strategic focus of a company ought to shift from acquiring customers to retaining customers by reducing customer churn (Ahn et al., 2006). 1 Corresponding Author Data mining methodology has a tremendous contribution for researchers to extract the hidden knowledge and information which have been inherited in the data used by researchers (Hosseini et al., 2010). Neslinet al. (2006) indicates that numerous steps in the customer churn prediction process have an impact on its success. As a result, academic literature on the optimization of churn prediction algorithms has exploded such as support vector machine (Xia and Jin, 2008; Zhao et al., 2005), logistic regression (Smith, Willis, and Brooks, 2000), Naïve Bayes classifiers (Buckinxet al., 2002), artificial neural networks (Pendharkar, 2009; Tsai and Lu, 2009), and K-nearest neighbor classifiers (Ruta et al., 2006). Decision tree classification is popular in data mining methodologies due to their simplicity and transparency (Duda, Hart, and Stork, 2001). For instance, Murthy (1998) found more than 300 academic references that use decision trees in a variety of settings. Although decision tree classification have been already used to predict churn models by other researchers such as Lemmens and Croux (2006), they focus on application of decision trees on churn dataset and discuss their results. However, since the number of features in dataset for customer churn is rather large (more than 15 in general), the performance of decision trees is significantly degraded if inappropriate features are selected and used for building decision trees. None of the previous work on customer churn classification problems focuses on the performance of decision trees based on application of different feature selection methods. Instead, they simply use all the features provided by the dataset. This paper aims to evaluate various feature selection methods that have high impact on customer churn, and to design methodologies that will cope with high dimensionality of dataset and the customer churn prediction problem. Research contributions of this paper are summarized as follows: 1. 2. 3. Feature selection methods and their impact on prediction performance are analyzed. Since number of features in dataset for customer churn is rather large and some of them are not relevant to prediction, several methodologies have been applied on customer churn datasets to identify appropriate features for prediction Data mining methodology often require finding the optimum parameter setting. Decision trees are no exception. Various parameter settings have been tested and their results are presented. This paper provides a comprehensive evaluation of a set of other data mining schemes, and their results are compared with decision tree algorithms. The organization of this paper is as follows. Section 2 describes the decision tree methodologies and datasets used in the paper. Experimental results of various feature section methods and data mining results on customer churn datasets are discussed in Section 3. Summary and future work are discussed in Section 4. 2. Methodology 2.1 Evaluation Dataset The following two churn datasets are used in this paper: Larose (2013) Dataset: This churn dataset deals with telecommunications customers and the data pertinent to the telephone calls they make. The dataset contains 19 variables and 3,333 samples. Each record in the dataset is a wireless telecommunications service subscriber. Each subscriber is labeled with either churner (i.e., the customer who left the company) or not. The attributes (features) of the dataset include continuous, discrete, and symbolic forms, and symbolic-valued attributes are mapped to an integer values ranging from 0 to N - 1 where N is the number of symbols. The variables of the dataset are listed in Appendix I. Fuqua (2013) Dataset: The dataset includes the data about wireless customers consisting of 40,000 customers. Normal churn rate for the company was about 2% per month, but the dataset contained roughly 50% churners to make it easier to identify the factors influencing customer churn. Appendix II describes the list of attributes in the dataset. 2.2 Decision Tree Classification Classification is to learn a model that maps each attribute set into one of the predefined class labels, given a collection of records (called training set) where each record is composed of the attribute set and the class label. Table 1 describes example of classification tasks (Pang-Ning et al., 2006). Classification model requires two different types of dataset: training dataset and test dataset. Training dataset is used to learn the model, and the test dataset is classified based on the model developed by the training set. Table 1. Example of Classification Tasks Task Categorizing email messages Categorizing customers Identifying tumor cells Attribute set Features extracted from email message header and content Features extracted from the customer data Features extracted from MRI scans Class label Spam or non-spam Churn or non-churn Malignant or benign cells Decision tree classifier is a widely used classification technique and has the following advantages (Pang-Ning et al, 2006; Bramer, 2007; Decision Tree, 2013) Simple to understand and interpret. People are able to understand decision tree models after a brief explanation. Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed. Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables.) Uses a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by Boolean logic. (An example of a black box model is an artificial neural network since the explanation for the results is difficult to understand.) Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model. Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data were generated. Decision tree classifier is constructed in a top-down recursive divide-and-conquer manner: at start, all the training data is at the root, and the data is partitioned recursively based on selected attributes, which are based on a heuristic or statistical measure such as information gain (Pang-Ning et al., 2006). In this section, decision tree induction algorithm by Pang-Ning et al. (2006) is introduced. Decision tree induction algorithm called TreeGrowth is shown in Algorithm 1 (Pang-Ning et al., 2006). The algorithm works by recursively selecting the best attribute to split the data (step 7) and expanding the leaf nodes of the tree (steps 11 and 12) until the stopping criterion is met (step 1) (Pang-Ning et al., 2006). Algorithm 1. Decision Tree Induction Algorithm TreeGrowth (E, F) input: E = training records, F = attribute set output: root of the decision tree classifier 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: If stopping_cond(E, F) = true then leaf = createNode() leaf.label = Classify(E) return leaf else root = createNode() root.test_cond = find_best_split(E, F) let V = {v|v is a possible outcome of root.test_cond} For each v ∈ V do Ev = {e | root.test_cond(e) = v and e ∈ E} Child = TreeGrowth(Ev, F) add child as descendent of root and label the edge (root child) as v end for end if 15: return root In Algorithm 1, the find_best_split() function determines which attribute should be selected as the test condition for splitting the training records. The choice of test condition depends on which impurity measure is used to measure the goodness of a split. Commonly used measures include entropy and the Gini index shown below (Pang-Ning et al., 2006): Entropy(t) = − ∑𝑐−1 𝑖=0 𝑝(𝑖|𝑡)𝐼𝑡 where 𝐼𝑡 = 0 if 𝑝(𝑖|𝑡)𝐼𝑡 = 0 and 𝐼𝑡 = log2 𝑝(𝑖|𝑡) if 𝑝(𝑖|𝑡)𝐼𝑡 ≠ 0 2 Gini(t) = 1 − ∑𝑐−1 𝑖=0 𝑝(𝑖|𝑡) where c is the number of classes and 𝑝(𝑖|𝑡) is the fraction of records belonging to class i at a given node t. The behaviors of the above measures are consistent (Pang-Ning et al., 2006), i.e., impurity value is maximized when there is uniform distribution of classes in a node and minimized when all the records belong to the same class. For example, suppose that there are only two classes, C0 and C1. If a node N1 has 0 samples for class C0 and 6 samples for class C1, Entropy(N1)= – (0/6)* 0 – (6/6) log26 = 0 Gini (N1) = 1 – (0/6)2 – (6/6)2 = 0 Also, if a node N2 has 3 samples for class C0 and 3 samples for class C1, Entropy(N2)= – (3/6) log2(3/6) – (3/6) log2(3/6) = 1 Gini (N2) = 1 – (3/6)2 – (3/6)2 = 0.5 Finally, if a node N3 has 1 samples for class C0 and 5 samples for class C1, Entropy(N3)= – (1/6) log2(1/6) – (5/6) log2(5/6) = 0.650 Gini (N3) = 1 – (1/6)2 – (5/6)2 = 0.650 To determine which attributes are used to split, the degree of impurity of the parent node (before splitting) with the degree of the impurity of the child nodes (after splitting) is compared, and the larger the difference, the better choices of splitting the node (Pang-Ning et al., 2006). The gain is a criterion that can be used to determine the goodness of a split: Gain = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) − ∑𝑘𝑗=1 𝑁(𝑣𝑗 ) 𝑁 𝐼(𝑉𝑗 ), where 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) is the impurity measure of the node parent, N is the total number of records at parent, k is the number of attribute values, and N(𝑉𝑗 ) is the number of records associated with the child node 𝑉𝑗 . Decision tree classifier illustrated in Algorithm 1 chooses the attribute that maximizes the gain. Since 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) is the same for all other attributes, maximizing the gain is equivalent to minimizing the weighted average impurity measures of child nodes. The stopping_cond()function in Algorithm 1 is used to terminate the tree-growing process by testing whether the number of records have fallen below the minimum threshold level (Pang-Ning et al., 2006). In this paper, we denote n as the minimum threshold level, in other words, the minimum number of records for each node. The details of constructing decision tree classifier can be found in Pang-Ning et al. (2006) for further information. 2.3 Evaluation of Classification Results Confusion matrix summaries the number of instances predicted correctly or incorrectly by a classification model for binary classification problem such as customer churn model as illustrated in Table 2 (Pang-Ning et al., 2006). For example, for customer churn model, class = + implies customer churn, and class=- implies no customer churn. Table 2. Confusion Matrix PREDICTED CLASS ACTUAL CLASS Class = + Class = - Class = + True Positive False Negative Class = - False Positive True Negative where each term describes the following (Pang-Ning et al., 2006): True Positive (TP) corresponds to the number of positive examples correctly predicted by the model. True Negative (FN) corresponds to the number of negative examples correctly predicted by the model. False Positive (FP) corresponds to the number of positive examples wrongly predicted as positive by the model False Negative (FN) corresponds to the number of negative examples wrongly predicted as negative by the model Here, accuracy rate is defined as (TP + FP) / (TP + FN + FP + FN) which shows how much percentage of records are correctly classified. Similarly, error rate is defined as (FN + FP) / (TP + FN + FP + FN) which shows how much percentage of records are not correctly classified. It is easy to show that accuracy rate = 1 – error rate. Obviously, the model which produces the higher accuracy rate (equivalently, the lower error rate) is preferred than the one with the lower accuracy rate (equivalently, the higher error rate). There are two types of errors committed by a classification model: training error and test error. Training error is the error rate when a classification method is applied to the training data while test error is the error rate of the method applied on previously unobserved data. A good classification model must have low training rate as well as low test rate (Shin, 2011). The performance of classification methods is usually measured by the test error because we want to know how the classification methods perform on the unobserved dataset (Shin, 2011). Also, if there exist outliers or errors in the training data, although the performance of classification methods is good on training dataset, it may not perform well in the test data (Shin, 2011). Therefore, test error is used for comparison of results in this paper. 3. Experimental Study 3.1 Environmental Setting To address over-fitting problems that can occur during the classification, randomly selected proportion of the training data (i.e., 80% of the dataset) is used to train a model. Once this proportion of the data has made a complete pass through the algorithm, the rest (i.e., 20% of the dataset) is reserved as a test set to evaluate the performance. In Section 2.2, decision tree algorithms using impurity measures such as Gini and Entropy have been introduced, and the performances of these measures are compared in the following sections for each dataset. Weka (the software tool that incorporated standard data mining algorithms into a software) is used in the experiments. Following classifiers in Weka (2013) have been used to test the impact of impurity measures to the classification tasks: - CART algorithm (find_best_split()in Algorithm 1 based on Gini Index) C4.5 algorithm (find_best_split()in Algorithm 1 based on Entropy) CART (Breiman, 1993) is a decision tree learner that splits nodes based on the Gini index, and C4.5 (Quinlan, 1993) is an algorithm to generate a decision tree using the Entropy measure. Weka’s implementation of CART (or C4.5) is called as Simple CART (or J48). 3.2 Results of Larose (2013) Dataset 3.2.1 Comparison of Feature Selection Methods Since there are 18 possible predictors in the dataset (19 including the dependent variable), the first step is to dentify the relative importance of the input variables and used those in data analysis. Principal Components Analysis (PCA) (Jolliffe, 2005) which is one of the most commonly used approaches for dimensionality reduction is used. The goal of PCA is to find a new set of features (attributes) that better captures the variability of the data (Pang-Ning et al, 2006). More specifically, the first dimension is to capture as much as the variability as possible, and the second dimension is orthogonal to the first, and, subject to that constraint, captures as much of the remaining variability as possible (PangNing et al, 2006). One of the most appealing advantages of PCA is that dimensionality reduction using PCA can result in relatively low-dimensional data and it may be possible to apply techniques that don’t work well with highdimensional data (Pang-Ning et al, 2006). After PCA has been applied to the dataset, following 13 attributes are selected: Account Length, Area Code, International Plan, Voice Mail Plan, Number of Voice Mail Messages, Total Day Minutes, Total Day Calls, Total Day Charge, Total Evening Minutes, Total Evening Calls, Total Evening Charge, Total Night Minutes, and Total Night Calls. 95% 94% Accuracy Rate 93% 92% 91% CART (19 Attr.) 90% C4.5 (19 Attr.) 89% CART (PCA) 88% C4.5 (PCA) 87% 2 4 6 8 10 12 Minimum Number of Samples in a Node Figure 1. Comparison of the Attributes by PCA and the Original Attributes Figure 1 illustrates the results of CART and C4.5 algorithms using all the features vs. the ones selected by PCA. The features selected by PCA outperforms in every level of parameters compared to the corresponding algorithms using all the features. These results show the importance of using the appropriate attributes for preprocessing steps before constructing the decision trees. 3.2.2 Comparison of Impurity Measures Figure 2 illustrates the experimental results of the decision tree classifiers using Gini index and Entropy with using different parameter values respectively using features selected by PCA (which performs better in the dataset). Entropy performs better in terms of test accuracy for various values of parameter. In case of Gini index (or CART algorithm), the best test accuracy is 93.85% (or test error is 6.15%) when the minimum number of samples allowed in the node, denoted as n, is 6, which is presented in Table 3. Figure 2 illustrates the typical over-fitting problem when n is 2: although training error is smaller when n = 2, the test error is larger than other cases. The test error is improved when n is 4, but after n is greater than 4, the test accuracy is not improving that the classifiers are not sensitive to the classification. 95% Accuracy Rate 94% 93% CART C 4.5 92% 91% 2 4 6 8 10 12 Minimum Number of Samples in a Node Figure 2. Accuracy Rate of Decision Tree Algorithms For the impurity measure of Entropy (or C4.5 algorithm), the best test accuracy is 94.45% (or test error is 5.55%) when n = 6 illustrated in Table 4. Figure 1 illustrates that test accuracy is improved when n is increased before 6. However, the test accuracy rate become lower for n is greater than 6. Table 3. Confusion Matrix for CART Algorithm PREDICTED CLASS ACTUAL CLASS Class = Not Churn Class = Churn Class = Not Churn 544 9 Class = Churn 32 82 Table 4. Confusion Matrix for C4.5 Algorithm PREDICTED CLASS ACTUAL CLASS Class = Not Churn Class = Churn Class = Not Churn 546 7 Class = Churn 34 84 3.2.3. Comparison with Other Prediction Models As we discussed in the Introduction section, various data mining techniques have already been applied to customer churn dataset including SVM (Xia and Jin, 2008; Zhao et al., 2005), logistic regression (Smith, Willis, and Brooks, 2000), Naïve Bayes classifiers (Buckinxet al., 2002), artificial neural networks (Pendharkar, 2009; Tsai and Lu, 2009), and K-nearest neighbor classifiers (Ruta et al., 2006). To compare our results with previously proposed approaches, the following classifiers have been used to compare with the results from the decision tree: SVM (SMO) Logistic Analysis (Simple Logistic) Naïve Bayes (Standard Probabilistic Naïve Bayes classifier) Artificial Neural network (Multilayer Perceptron) K-nearest Neighbor Classifier (IBk) Weka’s implementations for the corresponding algorithm are listed inside the parenthesis. Test accuracy results of them are listed below: SVM: 82.90% Logistic Analysis: 83.96% Naïve Bayes: 84.40% Artificial Neural Network: 93.25% K-nearest Neighbor Classifier: 85.60% All the data mining algorithms in the experimental results except artificial neural network showed considerably lower performance. Performance of distance-based classification algorithms such as SVM and k-nearest neighbor algorithm is relatively lower than others. This can be explained by the curse of dimensionality (Rust, 1997). Curse of dimensionality arises when analyzing and organizing data in high-dimensional spaces, and the main problem is that when the dimensionality increases, the volume of the space increases so fast such that the available data becomes sparse (Rust, 1997). The best algorithm among them is from artificial neural network and test accuracy is 93.25%. In the future work, we want to study why artificial neural network performs better than the other algorithms. However, it still performs slightly lower than C4.5 algorithm’s test accuracy (94.45%) 3.3 Results of Fuqua (2013) Dataset 3.3.1 Comparison of Feature Selection Methods In order to identify input variables that can be used to predict customer churn, following methods have been used in addition to PCA (discussed in Section 3.2): Greedy Stepwise Approach: It searches greedily through the space of attribute subsets, and it does not backtrack but terminates as soon as adding or deleting the best remaining attribute decreases the evaluation metric (Witten et al., 2011) Features used in Larose (2013) dataset: Since classification results of Larose (2013) dataset is quite good, the features compatible with Larose (2013) dataset have been identified and used for prediction. Features used in Ahn et al. (2006) dataset: Ahn et al. (2006) conducted a behavioral study of specific customer churn determinants and identified major constructs to affect customer churn and the mediation effects of customer status that indirectly affect customer churn. All the features compatible with their study have been used for prediction. The results of selected features (or input variables) for each method described above are listed in Appendix III. Accuracy Rate 61% 60% 59% 58% 57% 56% 55% 54% 53% 52% 51% 50% All features PCA Greedy stepwise Larose Ahn et al. 25 50 75 100 125 150 200 300 400 Minimum Number of Samples in a Node Figure 3. Accuracy Rate of J48 61% 60% Accuracy Rate 59% 58% 57% All features 56% PCA 55% Greedy stepwise 54% Larose 53% Ahn et al. 52% 51% 25 50 75 100 125 150 200 300 400 Minimum Number of Samples in a Node Figure 4. Accuracy Rate of CART Figure 3 and Figure 4 illustrates the results of C4.5 and CART algorithms for each feature selection method. Unexpectedly, the best performance is generated when using all features in both algorithms. PCA performs better in Larose (2013) dataset, but its performance for Fuqua (2013) dataset is lower than other feature selection methods in most of the cases. Greedy stepwise approach performs slightly lower performance results than that of using all the features. This can be explained by the fact that decision tree algorithms also choose the splitting attributes in greedy manner based on the selected impurity measure, and therefore it shows similar performance. Features compatible with Larose dataset shows promising results in the experiments. Future study will evaluate other feature selection methods to identify determinants of customer churn. Same as the previous experiments in Section 3.2, these results show the importance of using the appropriate attributes for preprocessing steps before constructing the decision trees. 3.2.3 Comparison of Impurity Measures 61.0% 60.0% Accuracy Rate 59.0% 58.0% 57.0% C4.5 56.0% CART 55.0% 54.0% 53.0% 5 10 15 20 25 50 75 100 125 150 200 300 400 Minimum Number of Samples in a Node Figure 5. Accuracy Rate of CART Figure 5 illustrates the experimental results of the decision tree classifiers using Gini index and Entropy with using different parameter values respectively using all the attributes of Fuqua (2013) dataset. In most of cases, Gini index (or CART algorithm) performs better in terms of test accuracy. The best test accuracy of Gini index is 60.28% (or test error is 39.72%) when n is 1 described in Table 5. Figure 5 illustrates that the test accuracy rate is decreased steadily when n increases for Gini index (or CART algorithm). For Entropy measure (or C4.5 algorithm), test accuracy rate peaked (described in Table 6) at n is 50. Table 5. Confusion Matrix for CART Algorithm PREDICTED CLASS ACTUAL CLASS Class = Not Churn Class = Churn Class = Not Churn 1985 1455 Class = Churn 1335 2246 Table 6. Confusion Matrix for C4.5 Algorithm PREDICTED CLASS ACTUAL CLASS Class = Not Churn Class = Churn Class = Not Churn 1996 1444 Class = Churn 1371 2210 3.3.3. Comparison with Other Prediction Models Test accuracy results of other prediction models described in Section 3.2.3 are listed below: SVM: 57.23% Logistic Analysis: 57.57% Naïve Bayes: 52.81% Artificial Neural Network: 52.51% K-nearest Neighbor Classifier: 52.47% All the data mining algorithms showed considerably lower performance than decision tree classifiers where the best test accuracy is 60.28% (CART algorithm with n = 1). Compared to the Larose (2013) dataset, test accuracy is relatively low for all the data mining algorithms used in this paper. It is assumed that this relatively low test accuracy comes from the data quality of the dataset. As we discussed in Section 2.1, the dataset is modified to contain roughly 50% churners to make it easier to identify the factors influencing customer churn. However, this modification may impact on the quality of the data for identifying customer churner, which results in relatively low test accuracy. 4. Conclusion and Future Work This paper focuses on predicting customer churn through the decision tree classification method. Two customer churn datasets have been used to study various feature selection methods and their impacts on prediction performance have been analyzed. Since the number of features in each customer churn dataset is rather large and some of them are not relevant to prediction, several methodologies have been applied to identify appropriate features for prediction. Among them, feature selection by PCA performs well, but it is worthwhile to construct a decision tree using all the provided features also and to compare both results. In addition, in order to overcome over-fitting problem, optimum parameter setting such as minimum number of samples in a node, denoted as n has been analyzed. Although this process requires extensive experimental study, test accuracy rate generally shows concave function with respect to n, which makes it easier to find the optimal parameter setting. Finally, our results are compared with a set of other data mining schemes, which have been applied to the customer churn dataset, and their results are compared with decision tree algorithms. In both dataset, decision tree classifiers show better accuracy rate. Our future study includes investigation of various methodologies that may generate better accuracy rate on the results. Although decision-based results show relatively better performance on each data dataset, the result on the Fuqua (2013) dataset is still lower (about 60% accuracy rate) and false positive rate is quite high. For this purpose, various ensemble methods will be evaluated rather than applying a single classifier. In addition, identification of additional predictors for customer churn will be studied. References Ahn, J. H., Han, S. P., & Lee, Y. S. (2006). Customer churn analysis: Churn determinants and mediation effects of partial defection in the Korean mobile telecommunications service industry. Telecommunications Policy, 30(10), 552-568. Bayraktar, E., Tatoglu, E., Turkyilmaz, A., Delen, D., & Zaim, S. (2012). Measuring the efficiency of customer satisfaction and loyalty for mobile phone brands with DEA. Expert Systems with Applications, 39(1), 99-106. Berson, A., Smith, S., & Therling, K. (1999). Building data mining applications for CRM. New York: McGraw-Hill. Bramer, Max. Principles of data mining. Springer, 2007. Brebbia, A., Ebecken, N. F. F. & Melli, P. (Eds.), Data Mining III (Vol. 6, pp. 509-517). Breiman, L. (Ed.). (1993). Classification and regression trees. CRC press. Buckinx, W., Baesens, B., Van den Poel, D., Van Kenhove, P., & Vanthienen, J. (2002). Using machine learning techniques to predict defection of top clients. In A. Zanasi, C. Decision tree learning (2013). (online) http://en.wikipedia.org/wiki/Decision_tree_learning (accessed 10/6/2013) Duke (2003). “Can You Keep a Customer? Predicting Churn Rates at Fuqua's Teradata Center” (online) http://www.fuqua.duke.edu/admin/extaff/news/embanews/1103/embanws1103_teradata.html (accessed on November 6 2013). Duda, R. O., Hart, E., & Stork, D. G. (2001). Pattern classification. New York, NY: John Wiley & Sons. Fornell, C., & Wernerfelt, B. (1987). Defensive marketing strategy by customer complaint management: A theoretical analysis. Journal of Marketing Research. 24(4),, 24(4), 337–346. Fuqua Data Center (2013). http://www.fuqua.duke.edu/centers/ccrm/ (online) (accessed on November 6, 2013). Hearst, M. A., Dumais, S. T., Osman, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. Intelligent Systems and their Applications, IEEE, 13(4), 18-28. Hosseini, S. M. S., Maleki, A., & Gholamian, M. R. (2010). Cluster analysis using data mining approach to develop CRM methodology to assess the customer loyalty. Expert Systems with Applications, 37(7), 5259-5264. Jolliffe, I. (2005). Principal component analysis. John Wiley & Sons, Ltd. Kim, M. K., & Jeong, D. H. (2004). The effects of customer satisfaction and switching barriers on customer loyalty in Korean mobile telecommunication services. Telecommunications Policy, 28(2), 145–159. Larose, D. T., (2013). http://dataminingconsultant.com/DKD.htm (online) (accessed on November 10, 2013). Lemmens, A., and Croux, C. (2006). Bagging and boosting classification trees to predict churn. Journal of Marketing Research, 43(2), 276–286. Madden, G., Savage, S. J., & Coble-Neal, G. (1999). Subscriber churn in the Australian ISP market. Information Economics and Policy, 11(2), 195–207. Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, 2, 345–389. Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress. Pendharkar, P. C. (2009). Genetic algorithm based neural network approaches for predicting churn in cellular wireless network services. Expert Systems with Applications, 36(3), 6714-6720. Parks Associates. (2003). US Mobile Market Intelligence. August, 2003. Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. Reichheld, F. F. (1996). The loyalty effect: The hidden force behind growth, profits and lasting value. Harvard Business School Press. Ruta, D., Nauck, D., & Azvine, B. (2006). K nearest sequence method and its application to churn prediction. Paper presented at the 7th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2006), Burgos, Spain. Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica: Journal of the Econometric Society, 487-516. Shim, B., Choi, K., & Suh, Y. (2012). CRM strategies for a small-sized online shopping mall based on association rules and sequential patterns. Expert Systems with Applications, 39(9), 7736-7742. Shin, H. (2011). Identifying network intrusion with defensive forecasting. International Journal of Business Continuity and Risk Management, 2(2), 91-104. Smith, K. A., Willis, R. J., & Brooks, M. (2000). An analysis of customer retention and insurance claim patterns using data mining: a case study. Journal of the Operational Research Society, 51(5), 532-541. Tsai, C. F., & Lu, Y. H. (2009). Customer churn prediction by hybrid neural networks. Expert Systems with Applications, 36(10), 12547-12553. Weka (2013). http://www.cs.waikato.ac.nz/ml/weka/downloading.html (online) (accessed on November 6 2013). Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. Elsevier. Xia, G. E., & Jin, W. D. (2008). Model of customer churn prediction on support vector machine. Systems Engineering-Theory & Practice, 28(1), 71-77. Zhao, Y., Li, B., Li, X., Liu, W., & Ren, S. (2005). Customer churn prediction using improved one-class support vector machine. In Advanced data mining and applications (pp. 300-306). Springer Berlin Heidelberg. Appendix 1. Variables for Larose (2013) Dataset Variable name Variable Type Description Account Length Integer How long account has been active. Area Code Categorical Area code of the subscriber International Plan Dichotomous Having “Yes” or “No” value VMail Message Integer Number of Voice Mail Messages Voice Mail Plan Dichotomous Having “Yes” or “No” value Total Day Minutes Continuous Number of minutes customer has used the service during day Total Day Charge Continuous Charge for the usage of the service during day Total Day Calls Integer Number of calls customer has used the service during day Total Evening Minutes Continuous Minutes the customer has used the service during the evening Total Evening Charge Continuous Minutes customer has used the service during the evening Total Evening Calls (Eve Calls) Integer Number of calls customer has used the service during evening Total Night Minutes (Night Mins) Continuous Minutes the customer has used the service during the night Total Night Charge Continuous Charges for the customer of using the service during the night Total Night Calls (Night Calls) Integer Number of calls customer has used the service during night Total International Minutes (Intl Mins) Continuous Minutes customer has used service to make international calls Total International Calls (Intl Calls) Integer Number of international calls customer has used the service Total International Charge Continuous Charges for the customer of using the international service during the night Number of Calls to Customer Service Integer Number of calls to the customer service by the user Dependent Variable Dichotomous Churn (1) or Not (0) Appendix II. Variables for Fuqua (2013) Dataset Variable Name Variable Type Description revenue Continuous Average monthly revenue per month from the customer’s account mou Continuous Customer’s average minutes of use per month recchrge Continuous Customer’s average recurring charge per month directas Continuous Customer’s average number of director assisted calls per month overage Continuous Customer’s average overage minutes of use per month roam Continuous Customer’s average number of monthly roaming calls changem Continuous Average of percentage changes in monthly minutes of use changer Continuous Average of percentage changes in monthly revenues dropvce Continuous Average number of dropped voice calls per month blckvce Continuous Average number of blocked voice calls per month unansvce Continuous Average number of unanswered voice calls per month custcare Continuous Average number of customer care calls per month threeway Continuous Average number of three-way calls per month mourec Continuous Average number of received voice calls per month outcalls Continuous Average number of outbound voice calls per month incalls Continuous Average number of inbound voice calls per month peakvce Continuous Average number of in and out peak voice calls per month opeakvce Continuous Average number of in and out off-peak voice calls per month dropblk Continuous Average number of dropped or blocked calls per month callfwdv Continuous Average number of call forwarding calls per month callwait Continuous Average number of call waiting calls per month churn Dichotomous Churn (1) or no churn (0) between 31-60 days after the observation date months Integer Number of months the customer subscribed to the service uniqsubs Integer Number of unique subscribers for the customer account actvsubs Integer Number of active subscribers for the customer account csa String Communications service area phones Integer Number of handsets issued to the customer models Integer Number of models issued to the customer eqpdays Integer Number of days of the current equipment has used customer Integer Customer identification number age1 Integer Age of the first member in the customer account age2 Integer Age of the second member in the customer account children Dichotomous Has any child (1) or no child (0) in the customer account credita Dichotomous Whether the customer has the credit rating of “A” (1) or not (0) creditaa Dichotomous Whether the customer has the credit rating of “AA” (1) or not (0) creditb Dichotomous Whether the customer has the credit rating of “B” (1) or not (0) creditc Dichotomous Whether the customer has the credit rating of “C” (1) or not (0) creditde Dichotomous Whether the customer has the credit rating of “E” (1) or not (0) creditgy Dichotomous Whether the customer has the credit rating of “GY” (1) or not (0) creditz Dichotomous Whether the customer has the credit rating of “Z” (1) or not (0) prizmrur Dichotomous Rural area (1) or not (0) prizmub Dichotomous Suburban area (1) or not (0) Variable Name Variable Type Description prizmtwn Dichotomous Town area (1) or not (0) refurb Dichotomous Whether handset is refurbished (1) or not (0) webcap Dichotomous Whether handset is web capable (1) or not (0) truck Dichotomous Whether the subscriber owns a truck (1) or not (0) rv Dichotomous Whether the subscriber owns a recreational vehicle (1) or not (0) occprof Dichotomous Whether occupation of the subscriber is professional (1) or not (0) occcler Dichotomous Whether occupation of the subscriber is clerical (1) or not (0) occcrft Dichotomous Whether occupation of the subscriber is in craft (1) or not (0) occstud Dichotomous Whether occupation of the subscriber is student (1) or not (0) occhmkr Dichotomous Whether occupation of the subscriber is homemaker (1) or not (0) occret Dichotomous Whether the subscriber is retired (1) or not (0) occself Dichotomous Whether the subscriber is self-employed (1) or not (0) ownrent Dichotomous Whether the subscriber is renting (1) or not (0) marryun Dichotomous Whether marital status of the subscriber unknown (1) or not (0) marryyes Dichotomous Whether the subscriber is married (1) or not (0) marryno Dichotomous Whether occupation of the subscriber is not married (1) or married (0) mailord Dichotomous Whether the subscriber purchase handset(s) via mail order [1) or not (0) mailres Dichotomous Whether the subscriber responds to mail offers (1) or not (0) mailflag Dichotomous Whether the subscriber has chosen to be solicited by mail (1) or not (0) travel Dichotomous Whether the subscriber has traveled to non-US country (1) or not (0) pcown Dichotomous Whether the subscriber owns a personal computer (1) or not (0) creditcd Dichotomous Whether the subscriber possesses a credit card (1) or not (0) retcalls Integer Total number of calls previously made to retention team retaccpt Integer Total number of previous retention offers accepted by customer newcelly Dichotomous Whether subscriber is known to be a new cell phone user (1) or not (0) newcelln Dichotomous Whether subscriber is known not to be a new cell phone user (1) or not (0) refer Integer Total number of referrals made by subscriber incmiss Dichotomous Whether income data is missing (1) or not (0) income Integer Monthly income of subscriber mcycle Dichotomous Whether subscriber owns a motorcycle (1) or not (0) creditad Dichotomous Whether higher adjustments made to customer credit rating (1) or not (0) setprcm Dichotomous Whether there is missing data on handset price (1) or not (0) setprc Dichotomous Whether price of handset price is known (1) or not (0) retcall Dichotomous Whether customer has made call to retention team (1) or not (0) calibrat Dichotomous Whether data is calibration sample (1) or not (0) Dependent Variable Dichotomous Churn = 1, Not churn = 0 Appendix III. List of Attributes for Feature Selection Methods for Fuqua (2013) Dataset Variable Names PCA GreedyStepwise Larose revenue O mou O O recchrge O O directas O overage O roam O changem O changer O dropvce O blckvce O unansvce O custcare O threeway O Ahn et al. (2006) O O O O mourec O outcalls O O incalls O O peakvce O O opeakvce O O dropblk O O callfwdv O O callwait O churn O months O uniqsubs O actvsubs O csa O phones O models O eqpdays O customer O age1 O age2 O children O credita O creditaa O creditb O creditc O creditde O creditgy O creditz O prizmrur O O prizmub O O prizmtwn O O refurb O webcap O truck O rv O occprof O occcler O occcrft O retcalls O O O O O O O
© Copyright 2026 Paperzz