customer churn, classification model, data mining, and decision

Developing Customer Churn Models for Customer Relationship Management
Stephen Rodriguez1
Computer Science and Information Systems Departments
School of Arts and Science
Iona College
New Rochelle, NY 10801, USA
E-mail: [email protected]
Heechang Shin
Information Systems Department
Hagan School of Business
Iona College
New Rochelle, NY 10801, USA
E-mail: [email protected]
Abstract
Customer churn is the term used to describe customers who terminate their relationship with a company. Since churn
means a loss of revenue to a company, it is important to identify customer churn and provide incentives to them in
order to retain them to the company. This paper aims to design methodologies for the customer churn prediction
problem in wireless telecommunications industry. Since the number of features in customer churn dataset is rather
large, the performance of decision trees is significantly degraded if inappropriate features are selected and used for
building decision trees. This paper finds features that have high effect on customer churn, and to design
methodologies that will cope with high dimensionality of dataset and the customer churn prediction problem. This
paper provides experimental results of a set of data mining schemes and feature selection methods for customer
churn datasets.
Keywords: customer churn, classification model, data mining, and decision tree
Biographical notes: Stephen Rodriguez is enrolled in a 5-year dual degree program for a BA/MS in Computer
Science with minor in Information Systems. His research interests are data mining algorithms and natural language
processing: use of data mining for solving business problems and natural language processing for small children. He
specializes in front-end and back-end development of desktop and mobile software applications using programming
languages such as Java, C++, and HTML/CSS/JS/XML.
Heechang Shin received his PhD in Information Technology from Rutgers University. He is currently an Assistant
Professor of Information Systems at Iona College. His research interests are information privacy and security and
data mining: information security and privacy issues raised by data mining techniques, efficient enforcement of
privacy and security via secured indexing and access control, use of data mining for enhancing security and mobile
databases.
1.
Introduction
Customer churn is the term to describe customers who terminate their relationship with a company. Since churn
means a loss of revenue, it is important to identify the customer group who is most likely to leave the company and
provide incentives to them in order to retain them as customers. Managing customer churn is of great concern to
global telecommunications service companies as the market matures (Ahn et al., 2006). The annual churn rate ranges
from 20% to 40% in most of the global mobile telecommunications service companies (Berson, Smith, & Therling,
1999; Madden, Savage, & Coble-Neal, 1999; Parks Associates, 2003; Kim & Jeong, 2004). In a highly competitive
and maturing mobile telecommunications service market, instead of attempting to entice new customers or to lure
subscribers away from competitors, Fornell and Wernerfelt (1987) suggest reducing customer exit and brand
switching. Reichheld (1996) estimated that, with an increase in customer retention rates of just 5%, the average net
present value of a customer increases by 35% for software companies and 95% for advertising agencies. Therefore,
in order to be successful in the maturing market, the strategic focus of a company ought to shift from acquiring
customers to retaining customers by reducing customer churn (Ahn et al., 2006).
1
Corresponding Author
Data mining methodology has a tremendous contribution for researchers to extract the hidden knowledge and
information which have been inherited in the data used by researchers (Hosseini et al., 2010). Neslinet al. (2006)
indicates that numerous steps in the customer churn prediction process have an impact on its success. As a result,
academic literature on the optimization of churn prediction algorithms has exploded such as support vector machine
(Xia and Jin, 2008; Zhao et al., 2005), logistic regression (Smith, Willis, and Brooks, 2000), Naïve Bayes classifiers
(Buckinxet al., 2002), artificial neural networks (Pendharkar, 2009; Tsai and Lu, 2009), and K-nearest neighbor
classifiers (Ruta et al., 2006).
Decision tree classification is popular in data mining methodologies due to their simplicity and transparency (Duda,
Hart, and Stork, 2001). For instance, Murthy (1998) found more than 300 academic references that use decision trees
in a variety of settings. Although decision tree classification have been already used to predict churn models by other
researchers such as Lemmens and Croux (2006), they focus on application of decision trees on churn dataset and
discuss their results. However, since the number of features in dataset for customer churn is rather large (more than
15 in general), the performance of decision trees is significantly degraded if inappropriate features are selected and
used for building decision trees. None of the previous work on customer churn classification problems focuses on the
performance of decision trees based on application of different feature selection methods. Instead, they simply use all
the features provided by the dataset.
This paper aims to evaluate various feature selection methods that have high impact on customer churn, and to design
methodologies that will cope with high dimensionality of dataset and the customer churn prediction problem.
Research contributions of this paper are summarized as follows:
1.
2.
3.
Feature selection methods and their impact on prediction performance are analyzed. Since number of features in
dataset for customer churn is rather large and some of them are not relevant to prediction, several methodologies
have been applied on customer churn datasets to identify appropriate features for prediction
Data mining methodology often require finding the optimum parameter setting. Decision trees are no exception.
Various parameter settings have been tested and their results are presented.
This paper provides a comprehensive evaluation of a set of other data mining schemes, and their results are
compared with decision tree algorithms.
The organization of this paper is as follows. Section 2 describes the decision tree methodologies and datasets used in
the paper. Experimental results of various feature section methods and data mining results on customer churn datasets
are discussed in Section 3. Summary and future work are discussed in Section 4.
2. Methodology
2.1 Evaluation Dataset
The following two churn datasets are used in this paper:


Larose (2013) Dataset: This churn dataset deals with telecommunications customers and the data pertinent
to the telephone calls they make. The dataset contains 19 variables and 3,333 samples. Each record in the
dataset is a wireless telecommunications service subscriber. Each subscriber is labeled with either churner
(i.e., the customer who left the company) or not. The attributes (features) of the dataset include continuous,
discrete, and symbolic forms, and symbolic-valued attributes are mapped to an integer values ranging from
0 to N - 1 where N is the number of symbols. The variables of the dataset are listed in Appendix I.
Fuqua (2013) Dataset: The dataset includes the data about wireless customers consisting of 40,000
customers. Normal churn rate for the company was about 2% per month, but the dataset contained roughly
50% churners to make it easier to identify the factors influencing customer churn. Appendix II describes the
list of attributes in the dataset.
2.2 Decision Tree Classification
Classification is to learn a model that maps each attribute set into one of the predefined class labels, given a
collection of records (called training set) where each record is composed of the attribute set and the class label. Table
1 describes example of classification tasks (Pang-Ning et al., 2006). Classification model requires two different types
of dataset: training dataset and test dataset. Training dataset is used to learn the model, and the test dataset is
classified based on the model developed by the training set.
Table 1. Example of Classification Tasks
Task
Categorizing email messages
Categorizing customers
Identifying tumor cells
Attribute set
Features extracted from email message
header and content
Features extracted from the customer
data
Features extracted from MRI scans
Class label
Spam or non-spam
Churn or non-churn
Malignant or benign cells
Decision tree classifier is a widely used classification technique and has the following advantages (Pang-Ning et al,
2006; Bramer, 2007; Decision Tree, 2013)






Simple to understand and interpret. People are able to understand decision tree models after a brief
explanation.
Requires little data preparation. Other techniques often require data normalization, dummy variables need to
be created and blank values to be removed.
Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing
datasets that have only one type of variable. (For example, relation rules can be used only with nominal
variables while neural networks can be used only with numerical variables.)
Uses a white box model. If a given situation is observable in a model the explanation for the condition is
easily explained by Boolean logic. (An example of a black box model is an artificial neural network since
the explanation for the results is difficult to understand.)
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the
model.
Robust. Performs well even if its assumptions are somewhat violated by the true model from which the data
were generated.
Decision tree classifier is constructed in a top-down recursive divide-and-conquer manner: at start, all the training
data is at the root, and the data is partitioned recursively based on selected attributes, which are based on a heuristic
or statistical measure such as information gain (Pang-Ning et al., 2006). In this section, decision tree induction
algorithm by Pang-Ning et al. (2006) is introduced. Decision tree induction algorithm called TreeGrowth is shown
in Algorithm 1 (Pang-Ning et al., 2006). The algorithm works by recursively selecting the best attribute to split the
data (step 7) and expanding the leaf nodes of the tree (steps 11 and 12) until the stopping criterion is met (step 1)
(Pang-Ning et al., 2006).
Algorithm 1. Decision Tree Induction Algorithm
TreeGrowth (E, F)
input: E = training records, F = attribute set
output: root of the decision tree classifier
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
If stopping_cond(E, F) = true then
leaf = createNode()
leaf.label = Classify(E)
return leaf
else
root = createNode()
root.test_cond = find_best_split(E, F)
let V = {v|v is a possible outcome of root.test_cond}
For each v ∈ V do
Ev = {e | root.test_cond(e) = v and e ∈ E}
Child = TreeGrowth(Ev, F)
add child as descendent of root and label the edge (root  child) as v
end for
end if
15:
return root
In Algorithm 1, the find_best_split() function determines which attribute should be selected as the test
condition for splitting the training records. The choice of test condition depends on which impurity measure is used
to measure the goodness of a split. Commonly used measures include entropy and the Gini index shown below
(Pang-Ning et al., 2006):


Entropy(t) = − ∑𝑐−1
𝑖=0 𝑝(𝑖|𝑡)𝐼𝑡 where 𝐼𝑡 = 0 if 𝑝(𝑖|𝑡)𝐼𝑡 = 0 and 𝐼𝑡 = log2 𝑝(𝑖|𝑡) if 𝑝(𝑖|𝑡)𝐼𝑡 ≠ 0
2
Gini(t) = 1 − ∑𝑐−1
𝑖=0 𝑝(𝑖|𝑡)
where c is the number of classes and 𝑝(𝑖|𝑡) is the fraction of records belonging to class i at a given node t.
The behaviors of the above measures are consistent (Pang-Ning et al., 2006), i.e., impurity value is maximized when
there is uniform distribution of classes in a node and minimized when all the records belong to the same class. For
example, suppose that there are only two classes, C0 and C1. If a node N1 has 0 samples for class C0 and 6 samples for
class C1,


Entropy(N1)= – (0/6)* 0 – (6/6) log26 = 0
Gini (N1) = 1 – (0/6)2 – (6/6)2 = 0
Also, if a node N2 has 3 samples for class C0 and 3 samples for class C1,


Entropy(N2)= – (3/6) log2(3/6) – (3/6) log2(3/6) = 1
Gini (N2) = 1 – (3/6)2 – (3/6)2 = 0.5
Finally, if a node N3 has 1 samples for class C0 and 5 samples for class C1,


Entropy(N3)= – (1/6) log2(1/6) – (5/6) log2(5/6) = 0.650
Gini (N3) = 1 – (1/6)2 – (5/6)2 = 0.650
To determine which attributes are used to split, the degree of impurity of the parent node (before splitting) with the
degree of the impurity of the child nodes (after splitting) is compared, and the larger the difference, the better choices
of splitting the node (Pang-Ning et al., 2006). The gain is a criterion that can be used to determine the goodness of a
split:
Gain = 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) − ∑𝑘𝑗=1
𝑁(𝑣𝑗 )
𝑁
𝐼(𝑉𝑗 ),
where 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) is the impurity measure of the node parent, N is the total number of records at parent, k is the
number of attribute values, and N(𝑉𝑗 ) is the number of records associated with the child node 𝑉𝑗 . Decision tree
classifier illustrated in Algorithm 1 chooses the attribute that maximizes the gain. Since 𝐼(𝑝𝑎𝑟𝑒𝑛𝑡) is the same for all
other attributes, maximizing the gain is equivalent to minimizing the weighted average impurity measures of child
nodes.
The stopping_cond()function in Algorithm 1 is used to terminate the tree-growing process by testing whether
the number of records have fallen below the minimum threshold level (Pang-Ning et al., 2006). In this paper, we
denote n as the minimum threshold level, in other words, the minimum number of records for each node. The details
of constructing decision tree classifier can be found in Pang-Ning et al. (2006) for further information.
2.3 Evaluation of Classification Results
Confusion matrix summaries the number of instances predicted correctly or incorrectly by a classification model for
binary classification problem such as customer churn model as illustrated in Table 2 (Pang-Ning et al., 2006). For
example, for customer churn model, class = + implies customer churn, and class=- implies no customer churn.
Table 2. Confusion Matrix
PREDICTED CLASS
ACTUAL CLASS
Class = +
Class = -
Class = +
True Positive
False Negative
Class = -
False Positive
True Negative
where each term describes the following (Pang-Ning et al., 2006):




True Positive (TP) corresponds to the number of positive examples correctly predicted by the model.
True Negative (FN) corresponds to the number of negative examples correctly predicted by the model.
False Positive (FP) corresponds to the number of positive examples wrongly predicted as positive by the
model
False Negative (FN) corresponds to the number of negative examples wrongly predicted as negative by the
model
Here, accuracy rate is defined as (TP + FP) / (TP + FN + FP + FN) which shows how much percentage of records are
correctly classified. Similarly, error rate is defined as (FN + FP) / (TP + FN + FP + FN) which shows how much
percentage of records are not correctly classified. It is easy to show that accuracy rate = 1 – error rate. Obviously, the
model which produces the higher accuracy rate (equivalently, the lower error rate) is preferred than the one with the
lower accuracy rate (equivalently, the higher error rate).
There are two types of errors committed by a classification model: training error and test error. Training error is the
error rate when a classification method is applied to the training data while test error is the error rate of the method
applied on previously unobserved data. A good classification model must have low training rate as well as low test
rate (Shin, 2011). The performance of classification methods is usually measured by the test error because we want to
know how the classification methods perform on the unobserved dataset (Shin, 2011). Also, if there exist outliers or
errors in the training data, although the performance of classification methods is good on training dataset, it may not
perform well in the test data (Shin, 2011). Therefore, test error is used for comparison of results in this paper.
3.
Experimental Study
3.1 Environmental Setting
To address over-fitting problems that can occur during the classification, randomly selected proportion of the training
data (i.e., 80% of the dataset) is used to train a model. Once this proportion of the data has made a complete pass
through the algorithm, the rest (i.e., 20% of the dataset) is reserved as a test set to evaluate the performance. In
Section 2.2, decision tree algorithms using impurity measures such as Gini and Entropy have been introduced, and
the performances of these measures are compared in the following sections for each dataset. Weka (the software tool
that incorporated standard data mining algorithms into a software) is used in the experiments. Following classifiers in
Weka (2013) have been used to test the impact of impurity measures to the classification tasks:
-
CART algorithm (find_best_split()in Algorithm 1 based on Gini Index)
C4.5 algorithm (find_best_split()in Algorithm 1 based on Entropy)
CART (Breiman, 1993) is a decision tree learner that splits nodes based on the Gini index, and C4.5 (Quinlan, 1993)
is an algorithm to generate a decision tree using the Entropy measure. Weka’s implementation of CART (or C4.5) is
called as Simple CART (or J48).
3.2 Results of Larose (2013) Dataset
3.2.1 Comparison of Feature Selection Methods
Since there are 18 possible predictors in the dataset (19 including the dependent variable), the first step is to dentify
the relative importance of the input variables and used those in data analysis. Principal Components Analysis (PCA)
(Jolliffe, 2005) which is one of the most commonly used approaches for dimensionality reduction is used. The goal of
PCA is to find a new set of features (attributes) that better captures the variability of the data (Pang-Ning et al, 2006).
More specifically, the first dimension is to capture as much as the variability as possible, and the second dimension is
orthogonal to the first, and, subject to that constraint, captures as much of the remaining variability as possible (PangNing et al, 2006). One of the most appealing advantages of PCA is that dimensionality reduction using PCA can
result in relatively low-dimensional data and it may be possible to apply techniques that don’t work well with highdimensional data (Pang-Ning et al, 2006). After PCA has been applied to the dataset, following 13 attributes are
selected: Account Length, Area Code, International Plan, Voice Mail Plan, Number of Voice Mail Messages, Total
Day Minutes, Total Day Calls, Total Day Charge, Total Evening Minutes, Total Evening Calls, Total Evening
Charge, Total Night Minutes, and Total Night Calls.
95%
94%
Accuracy Rate
93%
92%
91%
CART (19 Attr.)
90%
C4.5 (19 Attr.)
89%
CART (PCA)
88%
C4.5 (PCA)
87%
2
4
6
8
10
12
Minimum Number of Samples in a Node
Figure 1. Comparison of the Attributes by PCA and the Original Attributes
Figure 1 illustrates the results of CART and C4.5 algorithms using all the features vs. the ones selected by PCA. The
features selected by PCA outperforms in every level of parameters compared to the corresponding algorithms using
all the features. These results show the importance of using the appropriate attributes for preprocessing steps before
constructing the decision trees.
3.2.2 Comparison of Impurity Measures
Figure 2 illustrates the experimental results of the decision tree classifiers using Gini index and Entropy with using
different parameter values respectively using features selected by PCA (which performs better in the dataset).
Entropy performs better in terms of test accuracy for various values of parameter. In case of Gini index (or CART
algorithm), the best test accuracy is 93.85% (or test error is 6.15%) when the minimum number of samples allowed in
the node, denoted as n, is 6, which is presented in Table 3. Figure 2 illustrates the typical over-fitting problem when n
is 2: although training error is smaller when n = 2, the test error is larger than other cases. The test error is improved
when n is 4, but after n is greater than 4, the test accuracy is not improving that the classifiers are not sensitive to the
classification.
95%
Accuracy Rate
94%
93%
CART
C 4.5
92%
91%
2
4
6
8
10
12
Minimum Number of Samples
in a Node
Figure 2. Accuracy Rate of Decision Tree Algorithms
For the impurity measure of Entropy (or C4.5 algorithm), the best test accuracy is 94.45% (or test error is 5.55%)
when n = 6 illustrated in Table 4. Figure 1 illustrates that test accuracy is improved when n is increased before 6.
However, the test accuracy rate become lower for n is greater than 6.
Table 3. Confusion Matrix for CART Algorithm
PREDICTED CLASS
ACTUAL CLASS
Class = Not Churn
Class = Churn
Class = Not Churn
544
9
Class = Churn
32
82
Table 4. Confusion Matrix for C4.5 Algorithm
PREDICTED CLASS
ACTUAL CLASS
Class = Not Churn
Class = Churn
Class = Not Churn
546
7
Class = Churn
34
84
3.2.3. Comparison with Other Prediction Models
As we discussed in the Introduction section, various data mining techniques have already been applied to customer
churn dataset including SVM (Xia and Jin, 2008; Zhao et al., 2005), logistic regression (Smith, Willis, and Brooks,
2000), Naïve Bayes classifiers (Buckinxet al., 2002), artificial neural networks (Pendharkar, 2009; Tsai and Lu,
2009), and K-nearest neighbor classifiers (Ruta et al., 2006). To compare our results with previously proposed
approaches, the following classifiers have been used to compare with the results from the decision tree:

SVM (SMO)




Logistic Analysis (Simple Logistic)
Naïve Bayes (Standard Probabilistic Naïve Bayes classifier)
Artificial Neural network (Multilayer Perceptron)
K-nearest Neighbor Classifier (IBk)
Weka’s implementations for the corresponding algorithm are listed inside the parenthesis. Test accuracy results of
them are listed below:





SVM: 82.90%
Logistic Analysis: 83.96%
Naïve Bayes: 84.40%
Artificial Neural Network: 93.25%
K-nearest Neighbor Classifier: 85.60%
All the data mining algorithms in the experimental results except artificial neural network showed considerably lower
performance. Performance of distance-based classification algorithms such as SVM and k-nearest neighbor algorithm
is relatively lower than others. This can be explained by the curse of dimensionality (Rust, 1997). Curse of
dimensionality arises when analyzing and organizing data in high-dimensional spaces, and the main problem is that
when the dimensionality increases, the volume of the space increases so fast such that the available data becomes
sparse (Rust, 1997). The best algorithm among them is from artificial neural network and test accuracy is 93.25%. In
the future work, we want to study why artificial neural network performs better than the other algorithms. However,
it still performs slightly lower than C4.5 algorithm’s test accuracy (94.45%)
3.3 Results of Fuqua (2013) Dataset
3.3.1 Comparison of Feature Selection Methods
In order to identify input variables that can be used to predict customer churn, following methods have been used in
addition to PCA (discussed in Section 3.2):



Greedy Stepwise Approach: It searches greedily through the space of attribute subsets, and it does not
backtrack but terminates as soon as adding or deleting the best remaining attribute decreases the evaluation
metric (Witten et al., 2011)
Features used in Larose (2013) dataset: Since classification results of Larose (2013) dataset is quite good,
the features compatible with Larose (2013) dataset have been identified and used for prediction.
Features used in Ahn et al. (2006) dataset: Ahn et al. (2006) conducted a behavioral study of specific
customer churn determinants and identified major constructs to affect customer churn and the mediation
effects of customer status that indirectly affect customer churn. All the features compatible with their study
have been used for prediction.
The results of selected features (or input variables) for each method described above are listed in Appendix III.
Accuracy Rate
61%
60%
59%
58%
57%
56%
55%
54%
53%
52%
51%
50%
All features
PCA
Greedy stepwise
Larose
Ahn et al.
25 50 75 100 125 150 200 300 400
Minimum Number of Samples in a Node
Figure 3. Accuracy Rate of J48
61%
60%
Accuracy Rate
59%
58%
57%
All features
56%
PCA
55%
Greedy stepwise
54%
Larose
53%
Ahn et al.
52%
51%
25 50 75 100 125 150 200 300 400
Minimum Number of Samples in a Node
Figure 4. Accuracy Rate of CART
Figure 3 and Figure 4 illustrates the results of C4.5 and CART algorithms for each feature selection method.
Unexpectedly, the best performance is generated when using all features in both algorithms. PCA performs better in
Larose (2013) dataset, but its performance for Fuqua (2013) dataset is lower than other feature selection methods in
most of the cases. Greedy stepwise approach performs slightly lower performance results than that of using all the
features. This can be explained by the fact that decision tree algorithms also choose the splitting attributes in greedy
manner based on the selected impurity measure, and therefore it shows similar performance. Features compatible
with Larose dataset shows promising results in the experiments. Future study will evaluate other feature selection
methods to identify determinants of customer churn. Same as the previous experiments in Section 3.2, these results
show the importance of using the appropriate attributes for preprocessing steps before constructing the decision trees.
3.2.3 Comparison of Impurity Measures
61.0%
60.0%
Accuracy Rate
59.0%
58.0%
57.0%
C4.5
56.0%
CART
55.0%
54.0%
53.0%
5
10 15 20 25 50 75 100 125 150 200 300 400
Minimum Number of Samples in a Node
Figure 5. Accuracy Rate of CART
Figure 5 illustrates the experimental results of the decision tree classifiers using Gini index and Entropy with using
different parameter values respectively using all the attributes of Fuqua (2013) dataset. In most of cases, Gini index
(or CART algorithm) performs better in terms of test accuracy. The best test accuracy of Gini index is 60.28% (or
test error is 39.72%) when n is 1 described in Table 5. Figure 5 illustrates that the test accuracy rate is decreased
steadily when n increases for Gini index (or CART algorithm). For Entropy measure (or C4.5 algorithm), test
accuracy rate peaked (described in Table 6) at n is 50.
Table 5. Confusion Matrix for CART Algorithm
PREDICTED CLASS
ACTUAL CLASS
Class = Not Churn
Class = Churn
Class = Not Churn
1985
1455
Class = Churn
1335
2246
Table 6. Confusion Matrix for C4.5 Algorithm
PREDICTED CLASS
ACTUAL CLASS
Class = Not Churn
Class = Churn
Class = Not Churn
1996
1444
Class = Churn
1371
2210
3.3.3. Comparison with Other Prediction Models
Test accuracy results of other prediction models described in Section 3.2.3 are listed below:





SVM: 57.23%
Logistic Analysis: 57.57%
Naïve Bayes: 52.81%
Artificial Neural Network: 52.51%
K-nearest Neighbor Classifier: 52.47%
All the data mining algorithms showed considerably lower performance than decision tree classifiers where the best
test accuracy is 60.28% (CART algorithm with n = 1). Compared to the Larose (2013) dataset, test accuracy is
relatively low for all the data mining algorithms used in this paper. It is assumed that this relatively low test accuracy
comes from the data quality of the dataset. As we discussed in Section 2.1, the dataset is modified to contain roughly
50% churners to make it easier to identify the factors influencing customer churn. However, this modification may
impact on the quality of the data for identifying customer churner, which results in relatively low test accuracy.
4.
Conclusion and Future Work
This paper focuses on predicting customer churn through the decision tree classification method. Two customer
churn datasets have been used to study various feature selection methods and their impacts on prediction performance
have been analyzed. Since the number of features in each customer churn dataset is rather large and some of them are
not relevant to prediction, several methodologies have been applied to identify appropriate features for prediction.
Among them, feature selection by PCA performs well, but it is worthwhile to construct a decision tree using all the
provided features also and to compare both results. In addition, in order to overcome over-fitting problem, optimum
parameter setting such as minimum number of samples in a node, denoted as n has been analyzed. Although this
process requires extensive experimental study, test accuracy rate generally shows concave function with respect to n,
which makes it easier to find the optimal parameter setting. Finally, our results are compared with a set of other data
mining schemes, which have been applied to the customer churn dataset, and their results are compared with decision
tree algorithms. In both dataset, decision tree classifiers show better accuracy rate.
Our future study includes investigation of various methodologies that may generate better accuracy rate on the
results. Although decision-based results show relatively better performance on each data dataset, the result on the
Fuqua (2013) dataset is still lower (about 60% accuracy rate) and false positive rate is quite high. For this purpose,
various ensemble methods will be evaluated rather than applying a single classifier. In addition, identification of
additional predictors for customer churn will be studied.
References
Ahn, J. H., Han, S. P., & Lee, Y. S. (2006). Customer churn analysis: Churn determinants and mediation effects of
partial defection in the Korean mobile telecommunications service industry. Telecommunications Policy, 30(10),
552-568.
Bayraktar, E., Tatoglu, E., Turkyilmaz, A., Delen, D., & Zaim, S. (2012). Measuring the efficiency of customer
satisfaction and loyalty for mobile phone brands with DEA. Expert Systems with Applications, 39(1), 99-106.
Berson, A., Smith, S., & Therling, K. (1999). Building data mining applications for CRM. New York: McGraw-Hill.
Bramer, Max. Principles of data mining. Springer, 2007.
Brebbia, A., Ebecken, N. F. F. & Melli, P. (Eds.), Data Mining III (Vol. 6, pp. 509-517).
Breiman, L. (Ed.). (1993). Classification and regression trees. CRC press.
Buckinx, W., Baesens, B., Van den Poel, D., Van Kenhove, P., & Vanthienen, J. (2002). Using machine learning
techniques to predict defection of top clients. In A. Zanasi, C.
Decision tree learning (2013). (online) http://en.wikipedia.org/wiki/Decision_tree_learning (accessed 10/6/2013)
Duke (2003). “Can You Keep a Customer? Predicting Churn Rates at Fuqua's Teradata Center” (online)
http://www.fuqua.duke.edu/admin/extaff/news/embanews/1103/embanws1103_teradata.html (accessed on November
6 2013).
Duda, R. O., Hart, E., & Stork, D. G. (2001). Pattern classification. New York, NY: John Wiley & Sons.
Fornell, C., & Wernerfelt, B. (1987). Defensive marketing strategy by customer complaint management: A
theoretical analysis. Journal of Marketing Research. 24(4),, 24(4), 337–346.
Fuqua Data Center (2013). http://www.fuqua.duke.edu/centers/ccrm/ (online) (accessed on November 6, 2013).
Hearst, M. A., Dumais, S. T., Osman, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. Intelligent
Systems and their Applications, IEEE, 13(4), 18-28.
Hosseini, S. M. S., Maleki, A., & Gholamian, M. R. (2010). Cluster analysis using data mining approach to develop
CRM methodology to assess the customer loyalty. Expert Systems with Applications, 37(7), 5259-5264.
Jolliffe, I. (2005). Principal component analysis. John Wiley & Sons, Ltd.
Kim, M. K., & Jeong, D. H. (2004). The effects of customer satisfaction and switching barriers on customer loyalty
in Korean mobile telecommunication services. Telecommunications Policy, 28(2), 145–159.
Larose, D. T., (2013). http://dataminingconsultant.com/DKD.htm (online) (accessed on November 10, 2013).
Lemmens, A., and Croux, C. (2006). Bagging and boosting classification trees to predict churn. Journal of Marketing
Research, 43(2), 276–286.
Madden, G., Savage, S. J., & Coble-Neal, G. (1999). Subscriber churn in the Australian ISP market. Information
Economics and Policy, 11(2), 195–207.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining
and Knowledge Discovery, 2, 345–389.
Pang-Ning, T., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. In Library of Congress.
Pendharkar, P. C. (2009). Genetic algorithm based neural network approaches for predicting churn in cellular
wireless network services. Expert Systems with Applications, 36(3), 6714-6720.
Parks Associates. (2003). US Mobile Market Intelligence. August, 2003.
Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
Reichheld, F. F. (1996). The loyalty effect: The hidden force behind growth, profits and lasting value. Harvard
Business School Press.
Ruta, D., Nauck, D., & Azvine, B. (2006). K nearest sequence method and its application to churn prediction. Paper
presented at the 7th International Conference on Intelligent Data Engineering and Automated Learning (IDEAL
2006), Burgos, Spain.
Rust, J. (1997). Using randomization to break the curse of dimensionality. Econometrica: Journal of the Econometric
Society, 487-516.
Shim, B., Choi, K., & Suh, Y. (2012). CRM strategies for a small-sized online shopping mall based on association
rules and sequential patterns. Expert Systems with Applications, 39(9), 7736-7742.
Shin, H. (2011). Identifying network intrusion with defensive forecasting. International Journal of Business
Continuity and Risk Management, 2(2), 91-104.
Smith, K. A., Willis, R. J., & Brooks, M. (2000). An analysis of customer retention and insurance claim patterns
using data mining: a case study. Journal of the Operational Research Society, 51(5), 532-541.
Tsai, C. F., & Lu, Y. H. (2009). Customer churn prediction by hybrid neural networks. Expert Systems with
Applications, 36(10), 12547-12553.
Weka (2013). http://www.cs.waikato.ac.nz/ml/weka/downloading.html (online) (accessed on November 6 2013).
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques:
Practical Machine Learning Tools and Techniques. Elsevier.
Xia, G. E., & Jin, W. D. (2008). Model of customer churn prediction on support vector machine. Systems
Engineering-Theory & Practice, 28(1), 71-77.
Zhao, Y., Li, B., Li, X., Liu, W., & Ren, S. (2005). Customer churn prediction using improved one-class support
vector machine. In Advanced data mining and applications (pp. 300-306). Springer Berlin Heidelberg.
Appendix 1. Variables for Larose (2013) Dataset
Variable name
Variable Type
Description
Account Length
Integer
How long account has been active.
Area Code
Categorical
Area code of the subscriber
International Plan
Dichotomous
Having “Yes” or “No” value
VMail Message
Integer
Number of Voice Mail Messages
Voice Mail Plan
Dichotomous
Having “Yes” or “No” value
Total Day Minutes
Continuous
Number of minutes customer has used the
service during day
Total Day Charge
Continuous
Charge for the usage of the service during day
Total Day Calls
Integer
Number of calls customer has used the
service during day
Total Evening Minutes
Continuous
Minutes the customer has used the service
during the evening
Total Evening Charge
Continuous
Minutes customer has used the service during
the evening
Total Evening Calls (Eve Calls)
Integer
Number of calls customer has used the
service during evening
Total Night Minutes (Night
Mins)
Continuous
Minutes the customer has used the service
during the night
Total Night Charge
Continuous
Charges for the customer of using the service
during the night
Total Night Calls (Night Calls)
Integer
Number of calls customer has used the
service during night
Total International Minutes (Intl
Mins)
Continuous
Minutes customer has used service to make
international calls
Total International Calls (Intl
Calls)
Integer
Number of international calls customer has
used the service
Total International Charge
Continuous
Charges for the customer of using the
international service during the night
Number of Calls to Customer
Service
Integer
Number of calls to the customer service by
the user
Dependent Variable
Dichotomous
Churn (1) or Not (0)
Appendix II. Variables for Fuqua (2013) Dataset
Variable Name
Variable Type
Description
revenue
Continuous
Average monthly revenue per month from the customer’s account
mou
Continuous
Customer’s average minutes of use per month
recchrge
Continuous
Customer’s average recurring charge per month
directas
Continuous
Customer’s average number of director assisted calls per month
overage
Continuous
Customer’s average overage minutes of use per month
roam
Continuous
Customer’s average number of monthly roaming calls
changem
Continuous
Average of percentage changes in monthly minutes of use
changer
Continuous
Average of percentage changes in monthly revenues
dropvce
Continuous
Average number of dropped voice calls per month
blckvce
Continuous
Average number of blocked voice calls per month
unansvce
Continuous
Average number of unanswered voice calls per month
custcare
Continuous
Average number of customer care calls per month
threeway
Continuous
Average number of three-way calls per month
mourec
Continuous
Average number of received voice calls per month
outcalls
Continuous
Average number of outbound voice calls per month
incalls
Continuous
Average number of inbound voice calls per month
peakvce
Continuous
Average number of in and out peak voice calls per month
opeakvce
Continuous
Average number of in and out off-peak voice calls per month
dropblk
Continuous
Average number of dropped or blocked calls per month
callfwdv
Continuous
Average number of call forwarding calls per month
callwait
Continuous
Average number of call waiting calls per month
churn
Dichotomous
Churn (1) or no churn (0) between 31-60 days after the observation date
months
Integer
Number of months the customer subscribed to the service
uniqsubs
Integer
Number of unique subscribers for the customer account
actvsubs
Integer
Number of active subscribers for the customer account
csa
String
Communications service area
phones
Integer
Number of handsets issued to the customer
models
Integer
Number of models issued to the customer
eqpdays
Integer
Number of days of the current equipment has used
customer
Integer
Customer identification number
age1
Integer
Age of the first member in the customer account
age2
Integer
Age of the second member in the customer account
children
Dichotomous
Has any child (1) or no child (0) in the customer account
credita
Dichotomous
Whether the customer has the credit rating of “A” (1) or not (0)
creditaa
Dichotomous
Whether the customer has the credit rating of “AA” (1) or not (0)
creditb
Dichotomous
Whether the customer has the credit rating of “B” (1) or not (0)
creditc
Dichotomous
Whether the customer has the credit rating of “C” (1) or not (0)
creditde
Dichotomous
Whether the customer has the credit rating of “E” (1) or not (0)
creditgy
Dichotomous
Whether the customer has the credit rating of “GY” (1) or not (0)
creditz
Dichotomous
Whether the customer has the credit rating of “Z” (1) or not (0)
prizmrur
Dichotomous
Rural area (1) or not (0)
prizmub
Dichotomous
Suburban area (1) or not (0)
Variable Name
Variable Type
Description
prizmtwn
Dichotomous
Town area (1) or not (0)
refurb
Dichotomous
Whether handset is refurbished (1) or not (0)
webcap
Dichotomous
Whether handset is web capable (1) or not (0)
truck
Dichotomous
Whether the subscriber owns a truck (1) or not (0)
rv
Dichotomous
Whether the subscriber owns a recreational vehicle (1) or not (0)
occprof
Dichotomous
Whether occupation of the subscriber is professional (1) or not (0)
occcler
Dichotomous
Whether occupation of the subscriber is clerical (1) or not (0)
occcrft
Dichotomous
Whether occupation of the subscriber is in craft (1) or not (0)
occstud
Dichotomous
Whether occupation of the subscriber is student (1) or not (0)
occhmkr
Dichotomous
Whether occupation of the subscriber is homemaker (1) or not (0)
occret
Dichotomous
Whether the subscriber is retired (1) or not (0)
occself
Dichotomous
Whether the subscriber is self-employed (1) or not (0)
ownrent
Dichotomous
Whether the subscriber is renting (1) or not (0)
marryun
Dichotomous
Whether marital status of the subscriber unknown (1) or not (0)
marryyes
Dichotomous
Whether the subscriber is married (1) or not (0)
marryno
Dichotomous
Whether occupation of the subscriber is not married (1) or married (0)
mailord
Dichotomous
Whether the subscriber purchase handset(s) via mail order [1) or not (0)
mailres
Dichotomous
Whether the subscriber responds to mail offers (1) or not (0)
mailflag
Dichotomous
Whether the subscriber has chosen to be solicited by mail (1) or not (0)
travel
Dichotomous
Whether the subscriber has traveled to non-US country (1) or not (0)
pcown
Dichotomous
Whether the subscriber owns a personal computer (1) or not (0)
creditcd
Dichotomous
Whether the subscriber possesses a credit card (1) or not (0)
retcalls
Integer
Total number of calls previously made to retention team
retaccpt
Integer
Total number of previous retention offers accepted by customer
newcelly
Dichotomous
Whether subscriber is known to be a new cell phone user (1) or not (0)
newcelln
Dichotomous
Whether subscriber is known not to be a new cell phone user (1) or not (0)
refer
Integer
Total number of referrals made by subscriber
incmiss
Dichotomous
Whether income data is missing (1) or not (0)
income
Integer
Monthly income of subscriber
mcycle
Dichotomous
Whether subscriber owns a motorcycle (1) or not (0)
creditad
Dichotomous
Whether higher adjustments made to customer credit rating (1) or not (0)
setprcm
Dichotomous
Whether there is missing data on handset price (1) or not (0)
setprc
Dichotomous
Whether price of handset price is known (1) or not (0)
retcall
Dichotomous
Whether customer has made call to retention team (1) or not (0)
calibrat
Dichotomous
Whether data is calibration sample (1) or not (0)
Dependent Variable
Dichotomous
Churn = 1, Not churn = 0
Appendix III. List of Attributes for Feature Selection Methods for Fuqua (2013) Dataset
Variable Names
PCA
GreedyStepwise
Larose
revenue
O
mou
O
O
recchrge
O
O
directas
O
overage
O
roam
O
changem
O
changer
O
dropvce
O
blckvce
O
unansvce
O
custcare
O
threeway
O
Ahn et al. (2006)
O
O
O
O
mourec
O
outcalls
O
O
incalls
O
O
peakvce
O
O
opeakvce
O
O
dropblk
O
O
callfwdv
O
O
callwait
O
churn
O
months
O
uniqsubs
O
actvsubs
O
csa
O
phones
O
models
O
eqpdays
O
customer
O
age1
O
age2
O
children
O
credita
O
creditaa
O
creditb
O
creditc
O
creditde
O
creditgy
O
creditz
O
prizmrur
O
O
prizmub
O
O
prizmtwn
O
O
refurb
O
webcap
O
truck
O
rv
O
occprof
O
occcler
O
occcrft
O
retcalls
O
O
O
O
O
O
O