Incorporating sequential information into traditional classification

FACULTEIT ECONOMIE
EN BEDRIJFSKUNDE
HOVENIERSBERG 24
B-9000 GENT
Tel.
Fax.
: 32 - (0)9 – 264.34.61
: 32 - (0)9 – 264.35.92
WORKING PAPER
Incorporating sequential information into traditional classification
models by using an element/position- sensitive SAM
Anita Prinzie 1
Prof. Dr. Dirk Van den Poel 2
February 2005
2005/292
1
Corresponding Author: Department of Marketing, Tel: +32 9 264 35 20, fax: +32 9 264 42 79, email
[email protected].
2
Email: [email protected]
D/2005/7012/10
Abstract. The inability to capture sequential patterns is a typical drawback of predictive
classification methods. This caveat might be overcome by modeling sequential independent
variables by sequence-analysis methods. Combining classification methods with sequenceanalysis methods enables classification models to incorporate non-time varying as well as
sequential independent variables. In this paper, we precede a classification model by an
element/position-sensitive Sequence-Alignment Method (SAM) followed by the asymmetric,
disjoint Taylor-Butina clustering algorithm with the aim to distinguish clusters with respect to the
sequential dimension. We illustrate this procedure on a customer-attrition model as a decisionsupport system for customer retention of an International Financial-Services Provider (IFSP). The
binary customer-churn classification model following the new approach significantly outperforms
an attrition model which incorporates the sequential information directly into the classification
method.
Keywords:
sequence analysis, binary classification methods, Sequence-Alignment Method,
asymmetric clustering, customer-relationship management, churn analysis
2
1 Introduction
In the past, traditional classification models like logistic regression have been applied successfully to the
prediction of a dependent variable by a series of non-time varying independent variables [5]. In case
there are time-varying independent variables, these are typically included in the model by transforming
them into non-time varying variables [3]. Unfortunately, this practice results in information loss as the
sequential patterns of the data are neglected. Hence, although traditional classification models are highly
valid and robust for modeling non-time varying data, they are unable to capture sequential patterns in
data. This caveat might be overcome by modeling time-varying independent variables by sequenceanalysis methods. Unlike traditional classification methods, sequence-analysis methods were designed
for modeling sequential information. These methods take sequences of data, i.e., ordered arrays, as their
input rather than individual data points. With the exception of marketing, sequence analysis is
commonly applied in disciplines like archeology [32], biology [38], computer sciences [39], economics
[19], history [2], linguistics [24], psychology [7] and sociology [1]. Sequence-analysis methods can be
categorized depending on whether the sequences are treated as a whole or step by step [2]. Step-by-step
methods examine relationships among elements or states in the sequences. Time-series methods are used
to study the dependence of an interval-measured sequence on its own past. When the variable of interest
is categorical, Markov methods are appropriate. The latter methods calculate transition probabilities
based on the transition between two events [37]. Transitions from one prior category can be modeled by
using event-history methods, also known as duration methods, hazard methods, failure analysis, and
reliability analysis. The central research question studied is time until transition. Whole-sequence
methods use the entire sequence as unit of analysis to discover similarities between sequences resulting
in typologies. The central issue addressed is whether there are patterns in the sequences, either over the
whole sequences or within parts of them. There are two approaches to this pattern question. In the
algebraic approach, each sequence is reduced to some simplest form and sequences with similar
‘simplest forms’ are gathered under one heading. In the metric approach, a similarity measure between
the sequences is calculated which is then subsequently processed by clustering, scaling and other
categorization methods to extract typical sequential patterns. Methods like optimal matching or optimal
3
alignment are commonly applied within this metric approach. In an intermediate situation, local
similarities are analyzed to find out the role of key subsequences embedded in longer sequences [49].
Given that traditional classification models are designed for modeling non-time varying independent
variables and that sequence-analysis methods are well-suited to model dynamic information, it follows
that a combination of both methods unites the best of both worlds and allows for building predictive
classification models incorporating non-time varying as well as time-varying independent variables.
One possible approach amongst others exists in preceding the traditional classification method by a
sequence-analysis method to model the dynamic exogenous variables (cf. serial instead of parallel
combination of classifiers, [26]). In this paper, we precede a logistic regression, as a traditional
classification method, by a Sequence-Alignment Method (i.e., SAM), as a whole sequence method using
the metric approach. The SAM analysis is used to model a time-varying independent variable. We
identify how similar the customers are on the dynamic independent variable by calculating a similarity
measure between each pair of customers, the SAM distances. These distances are further processed by a
clustering algorithm to produce groups of customers which are relatively homogeneous with respect to
the dynamic independent variable. As we cluster on a dimension influencing the dependent variable,
the clusters are not only homogeneous in terms of the time-varying independent variable, but should
also be homogeneous with respect to the dependent variable. This way, we make the implicit link
between clustering and classification explicit. After all, clustering is in theory a special problem of
classification associated with an equivalence relation defined over a set [36]. Including the clustermembership information as dummies in the classification model not only allows for modeling the
dynamic independent variable in an appropriate way, it should even improve the predictive
performance.
In this paper, we illustrate the new procedure, which combines a sequence-analysis method with a
traditional classification method, by estimating a customer-attrition model for a large International
Financial-Services Provider (from now on referred to as IFSP). This attrition model feeds the
managerial decision process and helps refining the retention strategy by elucidating the profile of
customers with a high defection risk. A traditional logistic regression is applied to predict whether a
customer will churn or not. This logistic regression is preceded by an element and position-sensitive
4
Sequence-Alignment Method to incorporate a time-varying covariate. We will calculate the distance
between each customer on a sequential dimension, i.e., the evolution in relative account-balance total of
the customer at the IFSP, and use these distances as input for a subsequent cluster analysis. The clustermembership information is incorporated in the logistic regression by dummies. We hypothesize that the
logistic-regression model with the time-varying independent variable included as cluster dummy
variables will outperform the traditional logistic regression where the same sequential dimension is
incorporated by creating as many non-time varying independent variables as there are time points on
which the dimension is measured.
The remainder of this paper is structured as follows. In Section 2 we describe the different methods
used. We discuss the basic principles of SAM and underline how the cost allocation influences the
mathematical features of the resulting SAM distance measures determining whether a symmetric or
asymmetric clustering algorithm is appropriate. We outline how a modification of Taylor’s clustersampling algorithm [42] and Butina’s cluster algorithm based on exclusion spheres [6] allows clustering
on asymmetric SAM distances. In Section 3 we outline how the new procedure proposed in this paper is
applied within a financial-services context to improve prediction of churn behavior. Section 4
investigates whether the results confirm our hypothesis on improved predictive performance. We
conclude with a discussion of the main findings and introduce some avenues for further research.
2 Methodology
2.1 Sequence-Alignment Method (SAM)
The Sequence-Alignment Method (SAM) was developed in computer sciences (text editing and voice
recognition) and molecular biology (protein and nucleic acid analysis). A common application in
computer sciences is string correction or string editing [47]. The main use of sequence comparison in
molecular biology is to detect the homology between macromolecules. If the distance between two
macromolecules is small enough, one may conclude that they have a common evolutionary ancestor.
Applications of sequence alignment in molecular biology use comparatively simple alphabets (the four
nucleotide molecules or the twenty amino acids) but tend to have very long sequences [49]. Conversely,
5
in marketing applications, sequences will mostly be shorter but with a very large alphabet. Besides
SAM applications in computer sciences and molecular biology, there are applications in social science
[1], transportation research [21] and speech processing [34]. Recently, SAM has been applied in
marketing to discover visiting patterns of websites [18].
Sankoff & Kruskall [40], Waterman [48] and Gribskov & Devereux [14] are good references on
Sequence-Alignment Method. SAM handles variable-length sequences and incorporates sequential
information, i.e., the order in which the elements appear in a sequence, into its distance measure (unlike
conventional position-based distance measures, like Euclidean, Minkowsky, city block and Hamming
distances). The original sequence-alignment method can be summarized as follows. Suppose we
compare sequence a, called the source, having i elements a=a [a1, …, ai] with sequence b, i.e., the target,
having j elements b=b [b1, …, bj]. In general, the distance or similarity between sequence a and b is
expressed by the number of operations (i.e., total amount of effort) necessary to convert sequence a into
b. The SAM distance is represented by a score. The higher the score, the more effort it takes to equalize
the sequences and the less similar they are. The elementary operations are insertions, deletions and
substitutions or replacements. Deletion and insertion operations, often referred to as indel, are applied to
elements of the source (first) sequence in order to change the source into the target (second) sequence.
Substitution operations indicate deletion + insertion. Some advanced research involves other operations
like swaps or transpositions (i.e., the interchange of adjacent elements in the sequence), compression (of
two or more elements into one element) and expansion (of one element into two or more elements).
Every elementary operation is given a weight (i.e., cost) greater than or equal to zero. It is common
practice to make assumptions on the weights in order to achieve the metric axioms (nonnegative
property, zero property, triangle inequality and symmetry) of mathematical distance (e.g., equal weights
for deletions and insertions to preserve the symmetry axiom) [40]. Weights may be tailored to reflect the
importance of operations, the similarity of particular elements (cf. element sensitive), the position of
elements in the sequence (cf. position sensitive), or the number/type of neighboring elements or gaps
[49]. A different weight for insertion and deletion as well as position-sensitive weights result in SAM
distances which are no longer symmetric: cf. |ab| ~= |ba|. The latter has its implications on the clustering
algorithm that could be used (cf. infra). Different meanings can be given to the word ‘distance’ in
6
sequence comparison. In this paper, we express the relatedness (similarity or distance) between
customers on their evolution in relative account-balance total at the IFSP by calculating the weightedLevenshtein [29] distance between each possible pair of customers (i.e., pairwise-sequence analysis).
The weighted-Levenshtein distance defines dissimilarity as the smallest sum of operation-weighting
values required to change sequence a into b. This way a distance matrix is constructed and
consecutively, used as input for a cluster analysis.
2.2 Cluster Analysis of weighted-Levenshtein SAM Distances
We cluster the customers on the weighted-Levenshtein SAM distances, expressing how dissimilar they
are on the sequential dimension. The cluster-membership information resulting from this cluster
analysis is translated into cluster dummies, which represent the sequential dimension in a subsequent
classification model. We hypothesize that a classification model including cluster indicators
(operationalized as dummy variables) based on SAM distances will outperform a similar model where
the same sequential dimension is incorporated by as many non-time varying independent variables as
time points on which the dimension is measured. After all, these dummies are good indicators of what
type of behavior the customer exhibits towards the sequential dimension (i.e., time-varying independent
variable), as well as towards the dependent variable (cf. explicit typology of customers on time-varying
covariate results in implicit typology on the dependent variable).
A distance matrix holding the pairwise weighted-Levenshtein distances between customer sequences, is
used as a distance measure for clustering. As discussed earlier, depending on how the weights (i.e.,
costs) for SAM are set, the distances in the matrix are symmetric or asymmetric. Most common
clustering methods employ symmetric, hierarchical algorithms such as Wards, Single-, Complete-,
Average-, or Centroïd linkage [15, 20, 25], non-hierarchical algorithms such as Jarvis-Patrick [21], or
partitional algorithms such as k-means or hill-climbing. Such methods require symmetric measures, e.g.
Tanimoto, Euclidean, Hamman or Ochai, as their inputs. One drawback of these methods is that they
cannot capture important asymmetric relationships. Nevertheless, there exist many practical scenarios
where the underlying relation is asymmetric. Asymmetric relationships are common in transportation
research (cf. different distance between two cities A and B (|AB|~=|BA|) due to other routes (e.g. the
7
Vehicle Routing Problem [43])), in text mining (cf. word associations, e.g. most people will relate ‘data’
to ‘mining’ more strongly than conversely [44]), in sociometric ratings (cf. a person i could express a
higher like or dislike rating to person j than vice versa), in chemoinformatics (cf. compound A may fit
into compound B while the reverse is not necessarily true) and to a lesser extent in marketing research
(cf. brand-switching counts [10], ‘first choice’-‘second choice’ connections [45] and the asymmetric
price effects between competing brands [41]). A good overview of models for asymmetric proximities is
given by Zielman and Heiser [50]. Although there are a lot of research settings involving asymmetric
proximities, only a few clustering algorithms can handle asymmetric data. Most of these are based on a
nearest-neighbor table (NNT). Krishna et al. [27] provide a clustering algorithm for asymmetric data
(i.e., CAARD algorithm which closely resembles the Leader Clustering Algorithm (LCA) [16]) with
applications to text mining. Ozawa [36] defines a hierarchical asymmetric clustering algorithm called
Classic, and applies it on the detection of gestalt clusters. His algorithm is based on an iteratively
defined nested sequence of NNRs (i.e., Nearest Neighbors Relations). MacCuish et al. [31] converted
the Taylor-Butina exclusion region grouping- algorithms [6, 42] into a real clustering algorithm, which
can be used for both disjoint or non-disjoint (overlapping), either symmetric or asymmetric clustering.
Although this algorithm is designed for clustering compounds (i.e., the chemo-informatics field with
applications like compound acquisition and lead optimization in high-throughput screening), in this
paper it is employed to cluster customers on marketing-related information. More specifically, we apply
the asymmetric, disjoint version of the algorithm to the asymmetric SAM distances obtained earlier.
The asymmetric disjoint Taylor-Butina algorithm is a five-step procedure [30]:
1. Create the threshold nearest-neighbor table using similarities in both directions.
2. Find true singletons, i.e., data points (in our case customers) with an empty nearest-neighbor
list. Those elements do not fall into any cluster.
3. Find the data point with the largest nearest-neighbor list. This point tends to be in the center of
the k-th (cf. k clusters) most densely occupied region of the data space. The data point together
with all its neighbors within its exclusion region, constitute a cluster. The data point itself
becomes the representative data point for the cluster. Remove all elements in the cluster from all
8
nearest-neighbor lists. This process can be seen as putting an ‘exclusion sphere’ around the
newly formed cluster [6].
4. Repeat step 3 until no data points exist with a non-empty nearest-neighbor list.
5. Assign remaining data points, i.e., false singletons, to the group that contains their most similar
nearest neighbor, but identify them as “false singletons”. These elements have neighbors at the
given similarity threshold criterion (e.g. all elements with a dissimilarity measure smaller than
0.3 are deemed similar), but a ‘stronger’ cluster representative, i.e., one with more neighbors in
the list, excludes those neighbors (cf. cluster criterion).
Representative Compound
False Singleton
Threshold=.15
Exclusion Regions
diameter set by threshold
value
Dissimilarity in
both directions
True Singleton
Fig. 1. Asymmetric Taylor-Butina Schematic (MacCuish et al., 2003).
2.3 Incorporating Cluster Membership Information in the Classification Model
After having applied SAM and cluster analysis using the asymmetric Taylor-Butina algorithm, we build
a classification model to predict a binary target variable, in our application ‘churn’. As a classification
method, we use binary logistic regression. We build two churn models using the logistic-regression
method. One model includes the sequential dimension as cluster dummies resulting from clustering the
SAM distances (from now on referred to as LogSeq). The second model incorporates the sequential
dimension in a traditional way by as many non-time varying regressors as there are time points, on
9
which the dimension is measured (from now on referred to as LogNonseq). Both models are estimated
on a training sample and subsequently validated on a hold-out sample, containing customers not
belonging to the training sample. We compare the predictive performance of the LogSeq model with
that of the LogNonseq model.
In order to test the performance of the LogSeq model on the hold-out sample, we need to define a
procedure to assign the hold-out customers to the clusters identified on the training sample. We define
five sequences per cluster in the training sample as representatives. By default the grouping module of
the Mesa Suite software package, which implements the Taylor-Butina clustering algorithm [30],
returns only one representative for each identified cluster. We prefer to have more than one
representative customer for each cluster in order to improve the quality of allocation of customers in the
hold-out sample to the clusters identified on the training sample. Therefore, once we have found a good
k-th cluster solution on the training sample, we apply the Taylor-Butina algorithm to the clusterspecific SAM distances in order to obtain a five-cluster solution delivering five representatives for the
given cluster. This way, each cluster has five representatives. Next, we calculate the SAM distances of
the hold-out sequences towards these groups of five cluster representatives and vice versa. Each holdout sequence is assigned to the cluster to which it has the smallest average distance (i.e., smallest
average distance towards five cluster representatives). This cluster membership information is
transformed into cluster dummy variables.
The predictive performance of the classification models (in this case: logistic regression) is assessed by
the Area Under the receiver operating Curve (AUC). Unlike the Percentage Correctly Classified (i.e.,
PCC), this performance measure is independent of the chosen cut-off. The Receiver Operating
Characteristics curve plots the hit percentage (events predicted to be events) on the vertical axis versus
the percentage false alarms (non-events predicted to be events) on the horizontal axis for all possible
cut-off values [13]. The predictive accuracy of the logistic-regression models is expressed by the area
under the ROC curve (AUC). The AUC statistic ranges from a lower limit of 0.5 for chance (nullmodel) performance to an upper limit of 1.0 for perfect performance [13]. We compare the predictive
performance of the LogSeq model with the predictive accuracy of the LogNonseq model. We
hypothesize that the LogSeq model will outperform the LogNonseq model.
10
3. A Financial-Services Application
We illustrate our new procedure, which combines sequence analysis with a traditional classification
method, on a churn-prediction case to support the customer-retention decision system of a major
Financial-Services Provider (i.e., IFSP). Over the past two decades, the financial markets have become
more competitive due to the mature nature of the sector on the one hand and deregulation on the other,
resulting in diminishing profit margins and blurring distinctions between banks, insurers and brokerage
firms (i.e., universal banking). Hence, nowadays a small number of large institutions offering a wider
set of services dominate the financial-services industry. These developments stimulated bank assurance
companies to implement Customer Relationship Management (CRM). Under this intensive competitive
pressure, companies realize the importance of retaining their current customers. The substantive
relevance of attrition modeling comes from the fact that an increase in retention rate of just one
percentage point may result in substantial profit increases [46]. Successful customer retention allows
organizations to focus more on the needs of their existing customers, thereby increasing the managerial
insights into these customers’ needs and hence decreasing the servicing costs. Moreover, long-term
customers buy more [12] and if satisfied, might provide new referrals through positive word-of-mouth
for the company. These customers tend to be less sensitive to competitive marketing actions. Finally,
losing customers leads to opportunity costs due to lost sales and because attracting new customers is
five to six times more expensive than customer retention [4, 8]. For an overview on the literature in
attrition analysis we refer to Van den Poel and Larivière [46]. Combining several techniques (just like in
this paper) to achieve improved attrition models has already been shown to be highly effective [28].
3.1 Customer Selection
In this paper, we define a ‘churned’ customer as someone who closed all his accounts at the IFSP. We
predict whether customers still being customer at December 31st, 2002, will churn on all their accounts
in the next year (i.e., 2003) or not. Several selection criteria are used to decide which customers to
include into our analysis. Firstly, we only selected customers who became customer from January 1st,
1992 onwards because the information in the data warehouse before this date is less detailed. Secondly,
we only select customers having at least three distinct purchase moments before January 2003. This
11
constraint is imposed because we wish to focus the attrition analysis on the more valuable customers.
Given the fact that most customers at the IFSP only possess one financial service, the selected
customers clearly belong to the more precious clients of the IFSP. Thirdly, we only keep customers still
being customer on December 31st, 2002 (cf. prediction of churn event in 2003). This eventually results
in 16,254 customers left among which 399 customers (2.45%) closed all their accounts in 2003. We
randomly created a training and hold-out sample of 8,127 customers each, among which 200 (2.46%)
and 199 (2.45%) churners respectively. There is no overlap between the training and hold-out sample.
3.2 Construction of the Sequential Dimension
As discussed earlier, we want to include a sequential covariate in a traditional classification model. One
such sequential dimension likely to influence the churn probability is the customers’ evolution in
account-balance total at the IFSP. We define the latter variable as a sum of the customers’ total assets
(i.e., total outstanding balance on short- and long-term credit accounts + total debit on current account)
and total liabilities (i.e., total amount on savings and investment products + credit on current account +
sum of monthly insurance fees). Although this account-balance total is a continuous dimension, it is
registered in the data warehouse at discrete moments in time; at the end of the month for bank accounts
and on a yearly basis for insurance products. We have reliable data for account-balance total from
January 1st, 2002 onwards. We build sequences of relative difference in account-balance total (i.e.,
relbalance) rather than sequences of absolute account-balance total with the aim to facilitate the
capturing of overall trends in account-balance total. Each sequence contains four elements (see Table 1):
relbalanceJanMar, relbalanceMarJul, relbalanceJulOct, relbalanceOctDec.
12
Table 1
Four elements of the relative account-balance total dimension
Dimension relbalance
relbalanceJanMar
relbalanceMarJul
relbalanceJulOct
relbalanceOctDec
Definition
(account-balance total March 2002 –
account-balance total January 2002) /
account-balance total January 2002
(account-balance total July 2002 –
account-balance total March 2002) /
account-balance total March 2002
(account-balance total October 2002 –
account-balance total July 2002) / accountbalance total July 2002
(account-balance total December 2002 –
account-balance total October 2002) /
account-balance total December 2002
Besides observing the account-balance total at discrete moments in time, we converted the ratio-scaled
relative account-balance total sequence into a categorical dimension. The latter is crucial to ensure that
the SAM analysis will find any similarities between the customers’ sequences. Based on an
investigation of the distribution of the relative account-balance total, nine categories are distinguished
representing approximately an equal number of customers (cf. to enhance discovery of similarities
between customers):
Table 2
Values for the categorical relative account-balance total dimension and element-based costs
Element
0
1
2
3
4
5
6
7
8
Values of relbalance
0
- 0.5 < relbalance < 0
-2.5 < relbalance <= - 0.5
-10 < relbalance <= - 2.5
relbalance <= -10
0 < relbalance < 0.05
0.05 <= relbalance < 0.5
0.5 <= relbalance < 2.5
relbalance >= 2.5
Deletion/ Insertion cost of element
0.2
0.4
0.6
0.8
1
0.4
0.6
0.8
1
For the LogSeq model customers are clustered using the SAM and Taylor-Butina algorithm on their
evolution in relative account-balance total as expressed by a sequence of four categorical relative
account-balance total variables. In the LogNonseq model, the sequential dimension is included by the
four relative account-balance total variables (i.e., relbalanceJanMar, relbalanceMarJul, relbalanceJulOct
and relbalanceOctDec) measured at the ratio scale level.
13
3.3 Non-time Varying Independent Variables
Besides the sequential dimension, several non-time varying covariates are created (see Table 3). Two
blocks of independent variables can be distinguished. A first block captures behavioral/transactionalrelated information. Some of these variables are related to the number of accounts open(ed) / closed,
while others consider the account-balance total of the customer over a certain period of time. We also
include some exogenous variables expressing when and in what service category the next expiration will
occur. Finally, we tried to incorporate some regressors expressing how active the customer is: cf. his
recency or number of months since being titular of at least one account. The second block of variables
involves non-transactional data, i.e., socio-demographical information like gender, age and cohort.
Table 3
Non-time varying independent variables to predict churn behavior
Name of Variable
st_days_until_next_exp
st_days_since_last_exp
st_days_since_last_intentend
dummy_cat1_next_exp
...
dummy_cat14_next_exp
nbr_purchevent_bef2003
st_nbr_serv_opened_bef2003
nbr_serv_still_open_bef2003
nbr_serv_open_cat1_bef2003
...
nbr_serv_open_cat14_bef2003
dummy_lastcat1_opened_bef2003
...
dummy_lastcat14_opened_bef2003
nbr_serv_closed_bef2003
nbr_serv_intentend_bef2003
nbr_serv_closed_cat1_bef2003 –
nbr_serv_closed_cat14_bef2003
dummy_lastcat1_closed_bef2003
...
dummy_lastcat14_closed_bef2003
dummy_last_cat1_intentend_bef2003
Description
Number of days until the first next expiration date (from January
1st, 2003 on) for a service still in possession at December 31st,
2002. Standardized.
Number of days between December 31st, 2002 and last expiration
date of a service before January 1st, 2003. Standardized.
Number of days between December 31st, 2002 and date (before
January 1st, 2003) on which the customer intentionally closed a
service. Remark: Mostly the expiration date is equal to the
closing date.
Dummies indicating in what service category the next coming
expiration date in 2003 is.
Number of distinct purchase moments the customer has before
January 1st, 2003. The minimum value is 3 due to customer
selection.
Number of accounts a customer ever opened before January 1st,
2003. Standardized.
Number of services the customer still possesses on December
31st, 2002.
Number of accounts still open in service category 1 (respectively
2, ..., 14) on December 31st, 2002.
Dummies indicating in what service category the customer last
opened an account before January 1st, 2003.
Number of accounts the customer has closed (intentionally closed
or due to expiration) before January 1st, 2003.
Number of sercices the customer intentionally closed before
expiration date, before January 1st, 2003.
Number of accounts expired or intentionally closed in service
category 1 (respectively 2, ..., 14) before January 1st, 2003.
Dummies indicating in what service category the customer last
closed (intentionally or due to expiration) a service before
January 1st, 2003.
Dummies indicating in what service category the customer last
14
...
dummy_last_cat14_intentend_bef2003
ratio_closed_open_bef2003
ratio_stillo_open_bef2003
st_recency
st_avg_balance_total3
st_avg_balance_total6
st_ratio_tot_3_6
st_avg_diff_balance_total2
st_avg_diff_balance_total3
st_ratio_curr_avgtotal3
st_ratio_curr_avgtotal6
st_avg_balance_min4to2
st_ratio_curr_avgtotalmin4to2
last_avg_reinvest_time
st_last_use_homebanking
months_last_titu_nozero
dummy_contentieux
lor
age_ 31_Dec_2002
age_becoming_customer
gender
dummy_cohort_G1
...
dummy_cohort_G5
intentionaly closed an account before January 1st, 2003.
(Number of accounts closed before 2003 / number of accounts
opened before 2003) * 100.
(Number of services still open on December 31st, 2002 / number
of services ever opened before January 1st, 2003) *100.
Time between last purchase moment and December 31st, 2002.
Standardized.
The average account-balance total of the customer over the last
three months of 2002. (Account-balance total October 2002 +
account-balance total November 2002 + account-balance total
December 2002) / 3. Standardized.
The average account-balance total of the customer over the last
six months of 2002. Standardized.
Ratio of the total account-balance total of the customer over the
last three months of 2002 and the total account-balance total of
the customer over the last six months before 2003. Standardized.
((account-balance total December 2002 – account-balance total
November 2002) + (account-balance total November 2002 –
account-balance total October 2002)) / 2. Standardized.
((account-balance total December 2002 – account-balance total
November 2002) + (account-balance total November 2002 –
account-balance total October 2002) + (account-balance total
October 2002 – account-balance total September 2002)) / 3.
Standardized.
Account-balance total December 2002 / average account-balance
total calculated over October, November and December 2002.
Account-balance total December 2002 / average account-balance
total calculated over last six months of 2002.
(account-balance total November 2002 + account-balance total
October 2003 + account-balance total September 2002) / 3.
Standardized.
Tunover in December 2002 / avg_account-balance
total_min4to2. Standardized.
Latest average reinvest time before 2003. The average reinvest
time indicates how long the customer waits before investing
money that is suddenly available again.
Number of days since last use of home banking. Deduced from
the last logon date or if missing from the last transaction date or
if missing from the first logon date or if missing from the home
banking start date. Standardized.
Number of months ago since customer was titular of at least one
account where the balance is non zero.
Dummy indicating whether the customer is at least at one
account contentieux (bad debt) on December 31st, 2002.
Length of relationship expressed in years.
Age of the customer on December 31st, 2002.
Age of the customer when becoming customer at the IFSP.
Gender of the customer.
Dummies indicating whether the customer belongs to cohort
group 1, 2, 3, 4, 5 or 6. Cohort 1: 1900 <= birth date <= 1924
(i.e., early baby boomers). Cohort 2: 1925 <= birth date <= 1945
(i.e., GI generation). Cohort 3: 1946 >= birth date <= 1955 (i.e.,
Silent Generation). Cohort 4: 1956 >= birth date <= 1964 (i.e.,
late baby boomers). Cohort 5: 1965 <= birth date <= 1980 (i.e., X
generation). Cohort 6: birth date > 1980 (i.e., Y generation).
15
4. Results
4.1 An Element and Position-sensitive SAM Analysis
4.1.1 Operation Costs
We calculated the distance between each customer in both directions on the relative categorical accountbalance total dimension mainly using the weighted-Levenshtein distance. All sequences have length
four, i.e., there are four elements in each sequence (see Section 3.2). Typically, the operation weights
for deletion and insertion are set to 1. In this paper, we do not follow this approach. In order to ensure
that different trajectories result in different SAM distances, the operation weights for deletion and
insertion are not set equal. We define wdel = 1 and wins just below 1; wins= 0.9. Similarly, to favor
different SAM distances, the weight of reordering is not set like in many research studies to the sum of
the cost of one deletion and insertion. We arbitrarily set the reordering weight to 2.3. We intentionally
define the reordering weight to be larger than the sum of the deletion and insertion weights to simplify
and speed up the search for the optimal alignment (cf. calculating an optimal alignment is equivalent to
finding the longest common subsequence of the two sequences compared when the substitution weight
is at least equal to the sum of the deletion and insertion weights).
4.1.2 Element-Sensitive Costs
We adapted this rather conventional SAM-distance calculation to an element-sensitive SAM analysis to
better reflect the research context. Besides the operation weights we charge an element-based cost
depending on the element in the source being deleted or inserted. As can be seen from the values
assigned to the categorical relative account-balance total variable, (cf. Table 2), some values are more
related to each other. For instance, values 4 and 8 are more divergent than values 0 and 1. Therefore, a
different extra cost is added depending on the element being inserted in or deleted from the source. The
latter increases the variance in the final SAM distances. We set the element-based costs for deletion and
insertion equal. The element-cost setting does not distinguish between the categories of relbalance
reflecting positive evolutions in relbalance and categories expressing a negative evolution on relbalance
(e.g. equal element costs for element value 1 or 5). See Table 2 (cf. supra).
16
4.1.3 Absolute position-sensitive Costs
Besides incorporating an element-based cost in the calculation of the conventional SAM distances, we
convert the SAM-distance measure into a position-sensitive measure. Normally, SAM describes
differences between sequences only in terms of the difference in sequential order of the elements, by
changing the order of the common elements of the source sequence if it differs from that of the common
elements in the target (i.e., reordering), and in terms of the difference in element composition by
deleting the unique elements from the source and inserting the unique elements of the target sequence in
the source. Hence, conventional SAM is not sensitive to the positions at which elements are inserted,
deleted or reordered. After all, in bio-informatics studies, this position-sensitivity is not useful, as the
elements in DNA strings are relatively independent from each other. Consecutive DNA elements are not
likely to affect one another. However, one can think of many other sequences where the elements are
influencing each other. Whenever the elements in the sequence are measured at consecutive time points,
we can assume that previous values influence subsequent values in the sequence. For instance, in
activity-sequence analysis sequential relationships between activities is a primary concern. Likewise, we
assume in our application that the elements in the sequence consisting of the evolution in relative
account-balance total of the customer over 2002, are correlated. Although there are many applications
where the elements in the sequence are influencing each other, there is, to our knowledge, only one
research study, which incorporates the positional component into the original SAM distance concept.
Joh et al. [23] developed a position-sensitive sequence-alignment analysis for activity analysis. Positionsensitivity is taken into account by considering the distance by which the sequential order of the source
element is changed. The reordering distance h is measured as h=|i-j|, where i and j are the positions of
the reordered elements in the source and target sequences. The position-sensitive SAM distance is
defined as follows:
R


d (a, b ) = min  wd D + wi I + η ∑ hr 
r =1 

(1)
where
wd
weight for deletion
η
weight for reordering
17
D
number of deletions
R
number of reorderings
wi
weight for insertion
hr
distance of reordering the rth common element
I
number of insertions
The authors show that for larger values of the reordering weight, there is a significant difference
between the clustering solution found using the traditional SAM measure and the one resulting from the
position-sensitive SAM analysis. Whereas Joh et al. [23] developed a relative position-sensitive SAM
analysis, i.e., a SAM analysis that is sensitive to the difference in positions of common elements that
need to be reordered, we wish to develop an absolute position-sensitive SAM analysis which does not
only consider the positions of elements reordered, but also the positions at which elements are deleted
from the source as well as the positions in the source at which elements are inserted. We prefer an
absolute position-sensitive measure to a relative measure because we wish to distinguish between
operations applied in the beginning of the sequence and operations performed at the end of the
sequence. The rationale for this comes from the fact that we assume recent evolutions in relative
account-balance total to influence the customers’ churn probability more intensively than the customers’
relative account-balance total in the beginning of the sequence, e.g. the relative account-balance total of
the customer in the period October-December 2002 probably influences the churn probability more than
the relative account-balance total of the customer between January-March 2002. Consider example 1
and 2. Let sequence a be the source, sequence b the target. In both examples, we need to change the
order of elements 6 and 1. Whereas in the target element 6 precedes element 1, in the source this is
reverse. Using a relative position-sensitive distance measure like Joh et al. [23], the SAM cost from
reordering for example 1 and 2 would be the same. Supposing we reorder element 6 and the reordering
weight is 1, we obtain a reordering cost for example 1 of: |2-1| = 1 and for example 2 of: |4-3| = 1. From
these examples it follows that a relative position-sensitive SAM measure does not distinguish between
sequences where the reordering is applied over the same number of positions, but at distinct positions in
the source. Whereas a relative position-sensitive SAM measure keeps the distances symmetric (cf.
reordering cost defined using difference in position of element to reorder in source and target), an
18
absolute position-sensitive SAM method results in asymmetric distances (cf. reordering cost defined
using only position of reordered element in the source).
position
1
2
3
4
1
2
3
4
b
6
1
7
7
6
1
7
7
a
1
6
7
7
7
7
1
6
Example 1
Example 2
The absolute position-sensitive reordering cost multiplies the position of the reordered element in the
source with the reordering weight. As mentioned earlier, we also convert the deletion and insertion costs
into position-sensitive costs. The absolute position-sensitive deletion cost considers the position in the
source where the element is deleted. Likewise, the insertion cost is made position-sensitive by
incorporating the position of the element in the target to be inserted in the source. We use the position of
the element in the target as a proxy for the position in the source where the target element is inserted.
Next we describe how the final element and absolute-position sensitive SAM distances are calculated.
4.1.4 Hay’s pairwise-Sequence Alignment Algorithm
A major concern in sequence comparison and SAM analysis is the algorithm used to calculate the
distances within a reasonable time window. To address this computational complexity problem [40] we
apply an algorithm by Hay [17] that structures the equalizing process in a fast and easy way. It has not
yet been proven to always lead to an optimal alignment, i.e., the trajectory (the sequence of operations
necessary to equalize the source with the target) resulting in the smallest distance possible. Yet, the
algorithm mostly does.
19
- Step 1: Identify the longest common substrings respecting the sequential order of elements. It is
well known, that if the substitution/reordering weight is at least the sum of the deletion and insertion
weights, calculating an optimal alignment is equivalent to finding the longest-common subsequences of
the two sequences compared. These longest common substrings represent the structural integrity of the
two sequences or the structural skeleton [33]. In case there is more than one possible longest common
substring, we opt for the longest substring of which the absolute sum of the differences in positions
between all common elements in source and target is smallest as the latter prefers matches between
source and target at less remote positions above matches at more distant source-target positions. In
example 3 we compare a customer having a rather small evolution in relative account-balance total (e.g.
0 1 5 0) with another customer starting with rather small evolutions in relative customer accountbalance total but ending with a decreasing account-balance total (e.g. 5 1 2 3).
Example of sequence pairs
5 1 2 3
5 1 2 3
0 1 5 0
0 1 5 0
Longest common substring
5 or 1. We opt for identification of the common element 1.
− Step 2: Identify elements, which are not included in the substring and appear in the source and
the target. Count one reordering for each such identified element.
Example of sequence pairs
Common elements not appearing in the longest common substring
5 1 2 3
5
0 1 5 0
At the end of this step, the order of the substituted elements has been changed. In the above example,
the order of element 5 is changed to precede 1 rather than to succede 1. The total reordering cost is the
sum of the product of the reordering weight with the position of the reordered element in the source.
R
Cost reordering = ∑ η * pos r _ reorel
r =1
R
(2)
number of reorderings
20
η
reordering weight
posr_reorel
absolute position of rth reordered element in the source
− Step 3: Identify elements not included in the substring and which appear in eather one of the
compared sequences. Count one deletion operation for each unique element in the source, one
insertion operation for each unique element in the target.
Example of sequence pairs
Elements unique to the source or target
5 1 2 3
The two zero elements are deleted from the source,
0 1 5 0
elements 2 and 3 are inserted in the source.
The costs for deletions and insertions are besides position-sensitive also element-sensitive.
D
(
Cost deletion = ∑ wd * c d _ e + pos d _ del
d =1
)
wd
weight for deletion
cd_e
cost for deletion of dth element with a certain value (i.e., element cost)
posd_del
cost for deletion of dth element at a given position in the source (i.e., position cost)
I
(
Cost insertion = ∑ wi * ci _ e + posi _ ins
i =1
)
wi
weight for insertion
ci_e
cost for insertion of ith element with a certain value (i.e., element cost)
posi_ins
cost for insertion of ith element from a given position in the target (i.e, position cost)
(3)
(4)
Applying this algorithm, using the operation weights, the element-based costs and the position costs, the
total SAM distance is calculated as follows:
21
 R
  D
 ∑ η * pos r _ reorel  +  ∑ wd * c d _ e + pos d _ del
 r =1
  d =1
SAMdist = min 
  I

+  ∑ wi * ci _ e + pos i _ ins 
  i =1

(
(
)
)
(5)




The total SAM distance for our example is:
SAMdist = (2.3 * 3) + [(1* (0.2+1)) + (1 * (0.2+4))] + [(0.9 * (0.6+3)) + (0.9* (0.8+4))]
(6)
= 19.86
In this paper, we calculate the element- and position-sensitive SAM distance between each customer in
the training sample, in both directions, on the sequential dimension relative account-balance total. These
distances are inserted into a distance matrix used as input for the asymmetric Taylor-Butina clustering
method.
4.2 Asymmetric Clustering Using Taylor-Butina Algorithm
The asymmetric SAM distances calculated between the training sequences on evolution in relative
account-balance total are used as input for the asymmetric, disjoint Taylor-Butina algorithm with the
aim to distinguish clusters with respect to the sequential dimension and indirectly also with respect to
the dependent variable, i.e., churn. Depending on the threshold (in range [0,1]) used, the clusters
obtained by the algorithm are more or less homogeneous. Lower thresholds result in smaller, but more
homogeneous clusters, whereas higher threshold values result in larger, but less homogeneous clusters.
In our application, however, our primary objective is not finding the optimal clustering solution in terms
of homogeneity but to keep the number of clusters limited as the cluster membership is incorporated by
means of dummies into the final classification model. For each cluster, a certain minimum number of
customers is needed, to enhance the possibility that the cluster dummies would significantly influence
the dependent variable. As the Taylor-Butina algorithm iteratively identifies the sequence with the
highest number of neighbors, it follows that the first cluster defined is the largest one and that
subsequent clusters have fewer and fewer members. Therefore, it might be that small thresholds are not
22
optimal with respect to our predictive goal. Our ambition is to distinguish a reasonable number of
clusters with 1) a certain minimum number of customers and 2) with the highest possible homogeneity.
We experimented with several levels of the similarity threshold. It seems that we need a rather high
threshold to keep the number of clusters limited. For instance, for a threshold of 0.80, we obtain 130
clusters of which the biggest cluster only contains 758 customers (i.e., 9.32%) and of which clusters 89
to 130 hold less than five customers. We investigated for which threshold the first cluster keeps a high
enough number of customers. Employing a threshold of 0.9999 resulted in a 23-cluster solution of
which the first cluster counts 3281 customers (i.e., 40.37%). As 23 clusters is still quite high a number,
which would result in 22 cluster dummies in the classification model, and as this cluster solution still
creates some rather small populated clusters (e.g., from cluster 10 on the clusters have less than 100
members), we decided to group some clusters together. Therefore, we performed a ‘second-order
clustering’ by using the 23 representative sequences (i.e., centrotypes) as input to a subsequent
clustering exercise, and investigated which cluster representatives are taken together into new clusters.
This resulted in a final five-cluster solution (see Table 4).
Table 4
Final five-cluster solution on training sample
Cluster
Old clusters
Frequency N
Percent N
1
2
3
4
5
1
2
3
4-5
6-23
3281
1629
1040
1101
1076
40.37
20.04
12.79
13.55
13.24
Frequency
churners
82
42
26
20
30
Percent
churners
2.50
2.58
2.50
1.82
2.79
For each of these five clusters, we defined five representatives. We prefer to have more than one
representative per cluster to enhance the quality of cluster allocation of the hold-out sequences. By
default, the grouping module of Mesa Suite software package version 2.1 returns only one
representative per cluster. Hence, for clusters 1 to 3, we have already one representative for each cluster,
for cluster 4 we have two centrotypes and for cluster 5 we have 18 centrotypes. For clusters 1 to 4 we
need to find additional representatives, whereas for cluster 5 we need to limit the number of
representatives to five. Therefore, for each cluster defined, we cluster on the cluster-specific SAM
distances until we get a five-cluster solution providing exactly five representatives for that cluster.
23
In a next step, the hold-out sequences are assigned to the five clusters identified on the training sample.
We calculated the distances between the hold-out sequences and the groups of five representative cluster
sequences (5*5) in both directions. As a proxy for the asymmetric distance between a hold-out sequence
and a representative, we make the sum of the distance between the hold-out sequence and the centrotype
and the distance between the centrotype and the hold-out sequence and divide this sum by two. The
latter approach is common practice in studies performing calculations on asymmetric proximities. After
all, it has been proven [34] that each asymmetric distance matrix decomposes into a symmetric matrix S
of averages sij = {qij + qji }/2 and a skew-symmetric matrix A with elements aij ={qij - qji }/2. Using
these proxies, each hold-out sequence is assigned to the cluster to which it has the smallest average
distance (i.e., smallest average distance towards five cluster representatives). Table 5 gives an overview
of the cluster distribution in the hold-out sample.
Table 5
Allocation of hold-out sequences to five clusters identified on training sample
Cluster
1
2
3
4
5
Frequency N
4514
601
916
842
1254
Percent N
55.60
7.40
11.28
10.36
15.45
Frequency churners
127
32
4
14
22
Percent churners
2.81
5.32
0.44
1.66
1.75
4.3 Defining the Best Subset of non-time varying Independent Variables
Before we compare the predictive performance of the LogSeq model with that of the LogNonseq model,
we first define a best subset of non-time varying independent variables to include in the logisticregression models besides the sequential dimension relbalance. Employing the leap and bound
algorithm [11] on the non-time varying independent variables in Table 3, we compared the best subsets
having size 1 until 20 on their sums of squares. As expected the increase in the performance criterion is
inversely proportional to the number of independent variables added. From Figure 2, we decide that a
subset containing the best five variables represents a good balance between number of independents
included and variance explained by the model.
24
Fig. 2. Number of variables in best subsets.
The independent variables in the best subset of size five are:
Table 6
Best subset selection of size 5
Best subset of size 5
st_days_until_next_exp
st_ratio_curr_avgtotal3
st_ratio_stillo_open_bef2003
st_months_last_titu_nozero
st_days_since_last_exp
Variable description
Number of days until the first next expiration date
(from January 1st, 2003 on) for a service still in
possession on December, 31st 2002. Standardized.
Account-balance total December 2002 / average
account-balance total calculated over October,
November and December 2002.
(Number of services still open before January 1st,
2003 / number of services ever opened before
January 1st, 2003) *100. Standardized.
Number of months ago since customer was titular
of at least one account where the balance is non
zero.
Number of days between December 31st, 2002
and last expiration date of a service before
January 1st, 2003. Standardized.
25
4.4 Comparing Churn Predictive Performance of LogSeq and LogNonseq Models
We compare the predictive performance measured by AUC on the hold-out sample of the LogNonseq
model with that of the LogSeq model. Both logistic-regression models include the best five non-time
varying variables from Table 6 as well as the sequential dimension relbalance. However, in the
LogNonseq model, the sequential dimension is incorporated by means of four non-time varying
independent variables (i.e., st_relbalanceJanMar, st_relbalanceMarJul, st_relbalanceJulOct and
st_relbalanceOctDec), whereas in the LogSeq model the sequential dimension is operationalized by four
cluster dummies. We hypothesize that the churn predictive performance of the LogSeq model will be
significantly higher than that of the LogNonseq model because operationalizing a sequential dimension
by non-time varying independent variables neglects the sequential information of the dimension.
Table 7
Churn Predictive Performance of LogSeq and LogNonseq model on hold-out sample
Performance measure
AUC
LogNonseq model
0.906
LogSeq model
0.964
Table 7 shows that our hypothesis is confirmed. There is a significant difference (χ2=17.69, p =
0.0000259) in predictive performance between the binary logistic-regression model including the best
subset of five independent variables and the sequential dimension operationalized by non-time varying
variables and the logistic-regression model with the same set of independent variables but the sequential
dimension expressed by cluster dummies deduced from sequence-alignment analysis. Table 8 shows the
parameter estimates and significance levels of the regressors for both models. Where possible the
standardized estimates are given.
26
Table 8
Parameter estimates for LogNonseq and LogSeq models
Sequential
Dimension
Best Subset
Variable
Estimate
Pr > Chi-Square
LogNonseq
LogSeq
LogNonseq
LogSeq
Intercept
-16.80
-8.23
<.0001
<.0001
st_days_until_next_exp
-1.25
-1.26
<.0001
<.0001
st_ratio_curr_avgtotal3
0.28
-0.09
0.0187
0.0109
st_ratio_stillo_open_bef2003
-1.25
-1.28
<.0001
<.0001
st_months_last_titu_nozero
0.12
0.02
0.0048
0.4392
st_days_since_last_exp
0.50
0.49
<.0001
<.0001
st_relbalanceJanMar
cl1
-21.79
0.29
0.2169
0.0401
st_relbalanceMarJul
cl2
-8.86
0.25
0.9299
0.1034
st_relbalanceJulOct
cl3
-38.25
0.23
0.0890
0.1744
st_relbalanceOctDec
cl4
-240.45
0.35
0.0011
0.0447
Although not all cluster dummies are significant at α=0.05 level, the LogSeq model significantly
outperforms the LogNonseq model. The insignificance of cluster dummy 3 (p= 0.1744) might stem
from a serious drop in percentage churners from training (2.58%) to hold-out sample (0.44%) to even
below 1% churners. Looking at the estimates for the relative account-balance total dimension in the
LogNonSeq model, we find that only the relative evolution in account-balance total for the last six
months seems to have a significant effect on the churn probability in 2003 (cf. st_relbalanceJulOct and
st_relbalanceOctDec). The bigger the positive difference in account-balance total the less likely the
customer will churn. Considering the non-sequential regressors, it appears that all five regressors are
significant for the LogNonseq model, while all but one, i.e. st_months_last_titu_nozero, for the LogSeq
model. All effects have the expected sign. The smaller the number of days until the next expire date
from January, 1st 2003 onwards, the higher the churn probability (cf. st_days_until_next_exp). The
effect of st_ratio_curr_avgtotal3 seems to be rather small, so we should not worry too much about the
difference in sign of effect between the LogNonseq and the LogSeq model.
We may conclude, that our new procedure, which combines sequence analysis with a traditional
classification model, is a possible strategy to follow in an attempt to overcome the caveat of traditional
classification models designed for modeling non-time varying independent variables. In this paper, we
27
have modeled a time-varying independent variable by preceding a traditional binary logistic-regression
model by an element/position-sensitive sequence-alignment on the sequential dimension. This resulted
in cluster-membership information in terms of the sequential dimension as well as implicitly with
respect to the dependent variable. We decided to include this cluster-membership information by means
of dummies in the final classification model. Instead of the latter, we could alternatively use the cluster
information to build cluster-specific classification models. However, in our application, this approach is
irrealistic because one of the five clusters identified on the test sample has less than 1% churners (cf.
cluster 3). So in case the predicted event is rather rare, building cluster-specific classification models
might be impossible due to too few people experiencing the event in some of the identified clusters.
Moreover, another drawback of building cluster-specific models, lies in the practical problems to
include several sequential dimensions in the classification model. Whereas it is easy to include another
set of cluster dummies in the classification model for each sequential dimension, building clusterspecific classification models on more than one sequential dimension implies simultaneously clustering
on several sequential dimensions employing multidimensional SAM analysis. However, as
computational complexity is already an issue of concern in case of unidimensional SAM, the latter
becomes even worse within multidimensional SAM analysis.
5. Conclusion
In this paper, we provide a new procedure that overcomes the caveat of traditional classification models
to incorporate sequential exogenous variables. Instead of transforming the sequential dimension in nontime varying variables, thereby ignoring the sequential-information part, a better practice is to employ a
sequence-analysis method for modeling the time-varying independent variable and to, subsequently,
incorporate this information in the traditional classification method, which is designed for modeling
non-time varying covariates. This way the best of both methods is combined. One possible strategy
hereby is to cluster the customers on the sequential dimension using SAM (in this paper an
element/position sensitive SAM) and to incorporate this cluster information in the classification model
by dummy variables. The latter approach is promising as the results from the attrition models at the
28
IFSP confirm our hypothesis of improved predictive performance when modeling the sequential
dimension by sequence-analysis methods instead of operationalizing them as non-time varying
variables. Besides this approach, preceding a traditional classification model, like binary logistic
regression, by a sequence-analysis method to model the sequential dimension in the model, other
approaches might exist. It might be worthwhile to elaborate on other procedures to combine sequenceanalysis methods designed to model sequential information with traditional classification methods
suited to model non-time varying independent variables. Another avenue for further research exists in
exploring how other sequence-analysis methods than sequence alignment could enhance the modeling
of sequential covariates in classification models. In this paper, we only included one sequential
dimension. Further research studies should incorporate several sequential covariates. Finally, we wish to
bring the parameter setting issue to the attention of researchers considering to apply or to elaborate on
our new procedure. The highly-tuned edit distances used for our churn application might not be valid in
other applications. The researcher should adapt the operational weights, element costs and position costs
of the sequence-alignment to fit the application at hand. Similarly, the threshold used for the asymmetric
clustering will need fine-tuning in order to obtain a good clustering solution for other applications.
Acknowledgements
The authors would like to thank the anonymous financial-services company for providing the data.
Next, we extend our thanks to John D. MacCuish and Norah E. MacCuish for providing a free academic
license of the software package Mesa Suite Version 1.2, Grouping Module (www.mesaac.com) and for
their kind assistance. Moreover, we would like to thank Bart Larivière, PhD candidate at Ghent
University, for sharing his knowledge of the data warehouse of the company. Finally, we express our
thanks to 1) Ghent University for funding the PhD project of Anita Prinzie (BOF Grantno. B00141), and
2) the Flemish Research Fund (FWO Vlaanderen) for providing the funding for the computing
equipment to complete this project (Grantno. G0055.01).
29
References
[1] A. Abbott and A. Hrycak, Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’
Careers, American Journal of Sociology 96, No. 1 (1990), 144-185.
[2] A. Abbott, Sequence analysis: new methods for old ideas, Annual Review of Sociology 21, (1995), 93-113.
[3] B. Baesens, G. Verstraeten and D. Van den Poel, Bayesian Network Classifiers for Identifying the Slope of the CustomerLifecycle of Long-Life Customers, European Journal of Operational Research 156, No. 2 (2004), 508-523.
[4] C.B. Bhattacharya, When customers are members: Customer retention in paid membership contexts, Journal of the
Academy of Marketing Science 26, No. 1 (1998), 31-44.
[5] W. Buckinx, E. Moons, D. Van den Poel and G. Wets, Customer-Adapted Coupon Targeting Using Feature Selection,
Expert Systems with Applications 26, No. 4 (2004), 509-518.
[6] D. Butina, Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and
Automated Way to Cluster Small and Large Data Sets, Journal of Chemical Information and Computer Sciences 39, No. 4
(1999), 747-750.
[7] A. Cohen, R. I. Ivry and S.W. Keele, Attention and structure in sequence learning, Journal of Experimental Psychology:
Learning, Memory and Cognition 16, No. 1 (1990), 17-30.
[8] M.R. Colgate and P.J. Danaher, Implementing a customer relationship strategy: The asymmetric impact of poor versus
excellent execution, Journal of the Academy of Marketing Science 28, No. 3 (2000), 375-387.
[9] M.C. Cooper and G.W. Milligan, The effect of error on determining the number of clusters, Proceedings International
Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of
Research, (1988), 319-328.
[10] W.S. DeSarbo and G. De Soete, On the use of hierarchical-clustering for the analysis of nonsymmetric proximities,
Journal of Consumer Research 11, No. 1 (1984), 601-610.
[11] G.M. Furnival and R.W. Wilson, Regressions by Leaps and Bounds, Technometrics 16, No. 4 (1974), 499-511.
[12] J. Ganesh, M.J. Arnold and K.E. Reynolds, Understanding the customer base of service providers: An examination of the
differences between switchers and stayers, Journal of Marketing, 64, No. 3 (2000), 65-87.
[13] D. Green and J.A. Swets, Signal detection theory and psychophysics, John Wiley & Sons, New York, USA, 1966.
[14] M. Gribskov and J. Devereux (eds.), Sequence Analysis Primer, Oxford University Press, New York, USA, 1992.
[15] J. Hair, R. Andersen, R. Tatham and W. Black, Multivariate Data Analysis, Prentice Hall, 1998.
[16] J. Hartigan, Clustering Algorithms, Wiley, New York, US, Wiley, 1975.
[17] B. Hay, Sequence Alignment Methods in Web Usage Mining, Doctoral Dissertation, LUC, Belgium, 2003.
30
[18] B. Hay, G. Wets and K. Vanhoof, Web Usage Mining by means of Multidimensional Sequence Alignment Methods.
WEBKDD 2002 – Mining web data for discovering usage patterns and profiles; Lecture Notes in Artificial Intelligence 2703,
(2003), 50-65.
[19] W. J. Hopp, A Sequential Model of R&D Investment over an Unbounded Time Horizon, Management Science 33, No. 4
(1987), 500-508.
[20] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall Advanced Reference Series, Englewood Cliffs,
NJ, 1998.
[21] R.A. Jarvis and E.A. Patrick, Clustering using a similarity measure based on shared nearest neighbors, IEEE Transactions
on Computers 22, No. 11 (1973), 1025-1034.
[22] C.H. Joh, T.A. Arentze and H.J.P. Timmermans, Multidimensional Sequence Alignment Methods for Activity-Travel
Pattern Analysis: A Comparison of Dynamic Programming and Genetic Algorithms, Geographical Analysis 33, No. 3 (2001),
247-270.
[23] C.H. Joh, T.A. Arentze and H.J.P. Timmermans, A position-sensitive sequence alignment method illustrated for spacetime activity-diary data, Environment and Planning A 33, No. 2 (2001), 313-338.
[24] J. Jonz, Textual sequence and 2nd language comprehension, Language Learning 39, No. 2 (1989), 207-249.
[25] L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis, John Wiley & Sons, 1990.
[26] E. Kim, W. Kim and Y. Lee, Combination of multiple classifiers for the customer’s purchase behavior prediction,
Decision Support Systems 34, No. 2 (2003), 167-175.
[27] K. Krishna and R. Krishnapuram, A Clustering Algorithm for Asymmetrical Related Data with Applications to Text
Mining, Proceedings of the tenth international conference on Information and Knowledge Management, 2001, 571-573.
[28] B. Larivière and D. Van den Poel, Investigating the role of product features in preventing customer churn by using
survival analysis and choice modeling: The case of financial services, Expert Systems with Applications 27, No. 2 (2004).
[29] V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Cybernetics and Control Theory
10, No. 8 (1965), 707-710.
[30] J. MacCuish and N.E. MacCuish, Mesa Suite Version 1.2 Grouping Module, Mesa Analytics & Computing, LLC,
www.mesaac.com, 2003.
[31] J. MacCuish, C. Nicolaou and N.E. MacCuish, Ties in Proximity and Clustering Compounds, Journal of Chemical
Information and Computer Sciences 41, No. 1 (2001), 134-146.
[32] S. Mc. Brearty, The Sagoan-Lupemban and middle stone-age sequence at the Muguruk site, World Archaeology 19, No. 3
(1988), 388-420.
[33] M.A. McClure, T.K. Vasi, and W.M. Fitch, Comparative analysis of multiple protein-sequence alignment methods,
Molecular Biology and Evolution 11, No. 4 (1994), 571-592.
31
[34] C.S. Myers and L.R. Rabiner, A level building dynamic time warping algorithm for connected word recognition, IEEE
Transactions on Acoustics, Speech, and Signal Processing ASSP 29, No. 2 (1981), 284 -297.
[35] B. Noble and W. Daniel, Applied Linear Algebra, Prentice-Hall, New Jersey, 1988, 20.
[36] K. Ozawa, CLASSIC: a hierarchical clustering algorithm based on asymmetric similarities, Pattern Recognition 16, No. 2
(1983), 201-211.
[37] A. Prinzie and D. Van den Poel, Investigating Purchasing-Sequence Patterns for Financial Services using Markov, MTD
and MTDg Models, European Journal of Operational Research, 2005, forthcoming.
[38] A.E. Raftery and S. Tavaré, Estimation and Modelling Repeated Patterns in High Order Markov Chains with the Mixture
Transition Distribution Model, Applied Statistics 43, No. 1 (1994), 179-199.
[39] R. Sabherwal and D. Robey, Reconciling variance and process strategies for studying information system development,
Information Systems Research, No. 6 (1995), 303-327.
[40] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macromolecules. The Theory and Practice of Sequence
Comparison, Addisson-Wesley Pub., Advanced Book Program, Mass., 1983.
[41] R. Sethuraman, V. Srinivasan and K. Doyle, Asymmetric and Neighborhood Cross-Price Effects: Some Empirical
Generalizations, Marketing Science 18, No. 1 (1999), 23-41.
[42] R. Taylor, Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New
Drugs and Agrochemicals, Journal of Chemical Information and Computer Sciences 35, No. 1 (1995), 59-67.
[43] P. Toth and D. Vigo, A heuristic algorithm for the symmetric and asymmetric vehicle routing problems with backhauls,
European Journal of Operational Research 113, No. 3 (1999), 528-543.
[44] A. Tversky and J.W. Hutchinson, Nearest neighbor analysis of psychological spaces, Psychological Review 93, No. 1
(1986), 3-22.
[45] G. L. Urban, P.L. Johnson and J.R. Hauser, Testing competitive market structures, Marketing Science, No. 3 (1984), 83112.
[46] D., Van den Poel and B. Larivière, Customer attrition analysis for financial services using proportional hazard models,
European Journal of Operational Research 157, No. 1 (2004), 196-217.
[47] R.A. Wagner and M.J. Fischer (1974), The string-to-string correction problem, Journal of the Association for Computing
Machinery, No. 21 (1974), 168-173.
[48] M.S. Waterman, Introduction to computational biology. Maps, sequences and genomes, Chapman and Hall, USA, 1995.
[49] W.C., Wilson, Activity pattern analysis by means of sequence-alignment methods, Environment and Planning A 30, No. 6
(1998), 1017-1038.
[50] B. Zielman, and W.J. Heiser, Models for asymmetric proximities, British Journal of Mathematical and Statistical
Psychology 49, (1996), 127-146.
32
FACULTEIT ECONOMIE EN BEDRIJFSKUNDE
HOVENIERSBERG 24
9000 GENT
Tel.
Fax.
: 32 - (0)9 – 264.34.61
: 32 - (0)9 – 264.35.92
WORKING PAPER SERIES
12
04/219 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, The Pragmatic Quality of Resources-Events-Agents Diagrams:
an Experimental Evaluation, January 2004, 23 p.
04/220 J. CHRISTIAENS, Gemeentelijke financiering van het deeltijds kunstonderwijs in Vlaanderen, Februari 2004, 27 p.
04/221 C.BEUSELINCK, M. DELOOF, S. MANIGART, Venture Capital, Private Equity and Earnings Quality, February
2004, 42 p.
04/222 D. DE CLERCQ, H.J. SAPIENZA, When do venture capital firms learn from their portfolio companies?, February
2004, 26 p.
04/223 B. LARIVIERE, D. VAN DEN POEL, Investigating the role of product features in preventing customer churn, by
using survival analysis and choice modeling: The case of financial services, February 2004, 24p.
04/224 D. VANTOMME, M. GEUENS, J. DE HOUWER, P. DE PELSMACKER, Implicit Attitudes Toward Green Consumer
Behavior, February 2004, 33 p.
04/225 R. I. LUTTENS, D. VAN DE GAER, Lorenz dominance and non-welfaristic redistribution, February 2004, 23 p.
04/226 S. MANIGART, A. LOCKETT, M. MEULEMAN et al., Why Do Venture Capital Companies Syndicate?, February
2004, 33 p.
04/227 A. DE VOS, D. BUYENS, Information seeking about the psychological contract: The impact on newcomers’
evaluations of their employment relationship, February 2004, 28 p.
04/228 B. CLARYSSE, M. WRIGHT, A. LOCKETT, E. VAN DE VELDE, A. VOHORA, Spinning Out New Ventures: A
Typology Of Incubation Strategies From European Research Institutions, February 2004, 54 p.
04/229 S. DE MAN, D. VANDAELE, P. GEMMEL, The waiting experience and consumer perception of service quality in
outpatient clinics, February 2004, 32 p.
04/230 N. GOBBIN, G. RAYP, Inequality and Growth: Does Time Change Anything?, February 2004, 32 p.
04/231 G. PEERSMAN, L. POZZI, Determinants of consumption smoothing, February 2004, 24 p.
04/232 G. VERSTRAETEN, D. VAN DEN POEL, The Impact of Sample Bias on Consumer Credit Scoring Performance
and Profitability, March 2004, 24 p. (forthcoming in Journal of the Operational Research Society, 2004).
04/233 S. ABRAHAO, G. POELS, O. PASTOR, Functional Size Measurement Method for Object-Oriented Conceptual
Schemas: Design and Evaluation Issues, March 2004, 43 p.
04/234 S. ABRAHAO, G. POELS, O. PASTOR, Comparative Evaluation of Functional Size Measurement Methods: An
Experimental Analysis, March 2004, 45 p.
04/235 G. PEERSMAN, What caused the early millennium slowdown? Evidence based on vector autoregressions, March
2004, 46 p. (forthcoming in Journal of Applied Econometrics, 2005)
04/236 M. NEYT, J. ALBRECHT, Ph. BLONDEEL, C. MORRISON, Comparing the Cost of Delayed and Immediate
Autologous Breast Reconstruction in Belgium, March 2004, 18 p.
04/237 D. DEBELS, B. DE REYCK, R. LEUS, M. VANHOUCKE, A Hybrid Scatter Search / Electromagnetism MetaHeuristic for Project Scheduling, March 2004, 22 p.
04/238 A. HEIRMAN, B. CLARYSSE, Do Intangible Assets and Pre-founding R&D Efforts Matter for Innovation Speed in
Start-Ups?, March 2004, 36 p.
FACULTEIT ECONOMIE EN BEDRIJFSKUNDE
HOVENIERSBERG 24
9000 GENT
Tel.
Fax.
: 32 - (0)9 – 264.34.61
: 32 - (0)9 – 264.35.92
WORKING PAPER SERIES
13
04/239 H. OOGHE, V. COLLEWAERT, Het financieel profiel van Waalse groeiondernemingen op basis van de
positioneringsroos, April 2004, 15 p.
04/240 E. OOGHE, E. SCHOKKAERT, D. VAN DE GAER, Equality of opportunity versus equality of opportunity sets, April
2004, 22 p.
04/241 N. MORAY, B. CLARYSSE, Institutional Change and the Resource Flows going to Spin off Projects: The case of
IMEC, April 2004, 38 p.
04/242 T. VERBEKE, M. DE CLERCQ, The Environmental Kuznets Curve: some really disturbing Monte Carlo evidence,
April 2004, 40 p.
04/243 B. MERLEVEDE, K. SCHOORS, Gradualism versus Big Bang: Evidence from Transition Countries, April 2004, 6 p.
04/244 T. MARCHANT, Rationing : dynamic considerations, equivalent sacrifice and links between the two approaches,
May 2004, 19 p.
04/245 N. A. DENTCHEV, To What Extent Is Business And Society Literature Idealistic?, May 2004, 30 p.
04/246 V. DE SCHAMPHELAERE, A. DE VOS, D. BUYENS, The Role of Career-Self-Management in Determining
Employees’ Perceptions and Evaluations of their Psychological Contract and their Esteemed Value of Career
Activities Offered by the Organization, May 2004, 24 p.
04/247 T. VAN GESTEL, B. BAESENS, J.A.K. SUYKENS, D. VAN DEN POEL, et al., Bayesian Kernel-Based
Classification for Financial Distress Detection, May 2004, 34 p. (forthcoming in European Journal of Operational
Research, 2004)
04/248 S. BALCAEN, H. OOGHE, 35 years of studies on business failure: an overview of the classical statistical
methodologies and their related problems, June 2004, 56 p.
04/249 S. BALCAEN, H. OOGHE, Alternative methodologies in studies on business failure: do they produce better results
than the classical statistical methods?, June 2004, 33 p.
04/250 J. ALBRECHT, T. VERBEKE, M. DE CLERCQ, Informational efficiency of the US SO2 permit market, July 2004,
25 p.
04/251 D. DEBELS, M. VANHOUCKE, An Electromagnetism Meta-Heuristic for the Resource-Constrained Project
Scheduling Problem, July 2004, 20 p.
04/252 N. GOBBIN, G. RAYP, Income inequality data in growth empirics : from cross-sections to time series, July 2004,
31p.
04/253 A. HEENE, N.A. DENTCHEV, A strategic perspective on stakeholder management, July 2004, 25 p.
04/254 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, User comprehension of accounting information structures: An
empirical test of the REA model, July 2004, 31 p.
04/255 M. NEYT, J. ALBRECHT, The Long-Term Evolution of Quality of Life for Breast Cancer Treated Patients, August
2004, 31 p.
04/256 J. CHRISTIAENS, V. VAN PETEGHEM, Governmental accounting reform: Evolution of the implementation in
Flemish municipalities, August 2004, 34 p.
04/257 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, Construction and Pre-Test of a Semantic Expressiveness
Measure for Conceptual Models, August 2004, 23 p.
04/258 N. GOBBIN, G. RAYP, D. VAN DE GAER, Inequality and Growth: From Micro Theory to Macro Empirics,
September 2004, 26 p.
FACULTEIT ECONOMIE EN BEDRIJFSKUNDE
HOVENIERSBERG 24
9000 GENT
Tel.
Fax.
: 32 - (0)9 – 264.34.61
: 32 - (0)9 – 264.35.92
WORKING PAPER SERIES
14
04/259 D. VANDAELE, P. GEMMEL, Development of a measurement scale for business-to-business service quality:
assessment in the facility services sector, September 2004, 30 p.
04/260 F. HEYLEN, L. POZZI, J. VANDEWEGE, Inflation crises, human capital formation and growth, September 2004, 23
p.
04/261 F. DE GRAEVE, O. DE JONGHE, R. VANDER VENNET, Competition, transmission and bank pricing policies:
Evidence from Belgian loan and deposit markets, September 2004, 59 p.
04/262 B. VINDEVOGEL, D. VAN DEN POEL, G. WETS, Why promotion strategies based on market basket analysis do
not work, October 2004, 19 p. (forthcoming in Expert Systems with Applications, 2005)
04/263 G. EVERAERT, L. POZZI, Bootstrap based bias correction for homogeneous dynamic panels, October 2004, 35 p.
04/264 R. VANDER VENNET, O. DE JONGHE, L. BAELE, Bank risks and the business cycle, October 2004, 29 p.
04/265 M. VANHOUCKE, Work continuity constraints in project scheduling, October 2004, 26 p.
04/266 N. VAN DE SIJPE, G. RAYP, Measuring and Explaining Government Inefficiency in Developing Countries, October
2004, 33 p.
04/267 I. VERMEIR, P. VAN KENHOVE, The Influence of the Need for Closure and Perceived Time Pressure on Search
Effort for Price and Promotional Information in a Grocery Shopping Context, October 2004, 36 p.
04/268 I. VERMEIR, W. VERBEKE, Sustainable food consumption: Exploring the consumer attitude – behaviour gap,
October 2004, 24 p.
04/269 I. VERMEIR, M. GEUENS, Need for Closure and Leisure of Youngsters, October 2004, 17 p.
04/270 I. VERMEIR, M. GEUENS, Need for Closure, Gender and Social Self-Esteem of youngsters, October 2004, 16 p.
04/271 M. VANHOUCKE, K. VAN OSSELAER, Work Continuity in a Real-life Schedule: The Westerschelde Tunnel,
October 2004, 12 p.
04/272 M. VANHOUCKE, J. COELHO, L. V. TAVARES, D. DEBELS, On the morphological structure of a network,
October 2004, 30 p.
04/273 G. SARENS, I. DE BEELDE, Contemporary internal auditing practices: (new) roles and influencing variables.
Evidence from extended case studies, October 2004, 33 p.
04/274 G. MALENGIER, L. POZZI, Examining Ricardian Equivalence by estimating and bootstrapping a nonlinear dynamic
panel model, November 2004, 30 p.
04/275 T. DHONT, F. HEYLEN, Fiscal policy, employment and growth: Why is continental Europe lagging behind?,
November 2004, 24 p.
04/276 B. VINDEVOGEL, D. VAN DEN POEL, G. WETS, Dynamic cross-sales effects of price promotions: Empirical
generalizations, November 2004, 21 p.
04/277 J. CHRISTIAENS, P. WINDELS, S. VANSLEMBROUCK, Accounting and Management Reform in Local
Authorities: A Tool for Evaluating Empirically the Outcomes, November 2004, 22 p.
04/278 H.J. SAPIENZA, D. DE CLERCQ, W.R. SANDBERG, Antecedents of international and domestic learning effort,
November 2004, 39 p.
04/279 D. DE CLERCQ, D.P. DIMO, Explaining venture capital firms’ syndication behavior: A longitudinal study, November
2004, 24 p.
FACULTEIT ECONOMIE EN BEDRIJFSKUNDE
HOVENIERSBERG 24
9000 GENT
Tel.
Fax.
: 32 - (0)9 – 264.34.61
: 32 - (0)9 – 264.35.92
WORKING PAPER SERIES
15
04/280 T. FASEUR, M. GEUENS, Different Positive Feelings Leading to Different Ad Evaluations: The Case of Cosiness,
Excitement and Romance, November 2004, 17 p.
04/281 B. BAESENS, T. VAN GESTEL, M. STEPANOVA, D. VAN DEN POEL, Neural Network Survival Analysis for
Personal Loan Data, November 2004, 23 p.
04/282 B. LARIVIÈRE, D. VAN DEN POEL, Predicting Customer Retention and Profitability by Using Random Forests and
Regression Forests Techniques, December 2004, 30 p.
05/283 R. I. LUTTENS, E. OOGHE, Is it fair to “make work pay”?, January 2005, 28 p.
05/284 N. A. DENTCHEV, Integrating Corporate Social Responsibility In Business Models, January 2005, 29 p.
05/285 K. FARRANT, G. PEERSMAN, Is the exchange rate a shock absorber or a source of shocks? New empirical
evidence, January 2005, 26 p. (forthcoming Journal of Money Credit and Banking, 2005)
05/286 G. PEERSMAN, The relative importance of symmetric and asymmetric shocks and the determination of the
exchange rate, January 2005, 24 p.
05/287 C. BEUSELINCK, M. DELOOF, S. MANIGART, Private Equity Investments and Disclosure Policy, January 2005, 44
p.
05/288 G. PEERSMAN, R. STRAUB, Technology Shocks and Robust Sign Restrictions in a Euro Area SVAR, January
2005, 29 p.
05/289 H.T.J. SMIT, W.A. VAN DEN BERG, W. DE MAESENEIRE, Acquisitions as a real options bidding game, January
2005, 52 p.
05/290 H.T.J. SMIT, W. DE MAESENEIRE, The role of investor capabilities in public-to-private transactions, January 2005,
41 p.

Download Report

Incorporating sequential information into traditional classification

Paperzz.com

Your Paperzz