FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 B-9000 GENT Tel. Fax. : 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92 WORKING PAPER Incorporating sequential information into traditional classification models by using an element/position- sensitive SAM Anita Prinzie 1 Prof. Dr. Dirk Van den Poel 2 February 2005 2005/292 1 Corresponding Author: Department of Marketing, Tel: +32 9 264 35 20, fax: +32 9 264 42 79, email [email protected]. 2 Email: [email protected] D/2005/7012/10 Abstract. The inability to capture sequential patterns is a typical drawback of predictive classification methods. This caveat might be overcome by modeling sequential independent variables by sequence-analysis methods. Combining classification methods with sequenceanalysis methods enables classification models to incorporate non-time varying as well as sequential independent variables. In this paper, we precede a classification model by an element/position-sensitive Sequence-Alignment Method (SAM) followed by the asymmetric, disjoint Taylor-Butina clustering algorithm with the aim to distinguish clusters with respect to the sequential dimension. We illustrate this procedure on a customer-attrition model as a decisionsupport system for customer retention of an International Financial-Services Provider (IFSP). The binary customer-churn classification model following the new approach significantly outperforms an attrition model which incorporates the sequential information directly into the classification method. Keywords: sequence analysis, binary classification methods, Sequence-Alignment Method, asymmetric clustering, customer-relationship management, churn analysis 2 1 Introduction In the past, traditional classification models like logistic regression have been applied successfully to the prediction of a dependent variable by a series of non-time varying independent variables [5]. In case there are time-varying independent variables, these are typically included in the model by transforming them into non-time varying variables [3]. Unfortunately, this practice results in information loss as the sequential patterns of the data are neglected. Hence, although traditional classification models are highly valid and robust for modeling non-time varying data, they are unable to capture sequential patterns in data. This caveat might be overcome by modeling time-varying independent variables by sequenceanalysis methods. Unlike traditional classification methods, sequence-analysis methods were designed for modeling sequential information. These methods take sequences of data, i.e., ordered arrays, as their input rather than individual data points. With the exception of marketing, sequence analysis is commonly applied in disciplines like archeology [32], biology [38], computer sciences [39], economics [19], history [2], linguistics [24], psychology [7] and sociology [1]. Sequence-analysis methods can be categorized depending on whether the sequences are treated as a whole or step by step [2]. Step-by-step methods examine relationships among elements or states in the sequences. Time-series methods are used to study the dependence of an interval-measured sequence on its own past. When the variable of interest is categorical, Markov methods are appropriate. The latter methods calculate transition probabilities based on the transition between two events [37]. Transitions from one prior category can be modeled by using event-history methods, also known as duration methods, hazard methods, failure analysis, and reliability analysis. The central research question studied is time until transition. Whole-sequence methods use the entire sequence as unit of analysis to discover similarities between sequences resulting in typologies. The central issue addressed is whether there are patterns in the sequences, either over the whole sequences or within parts of them. There are two approaches to this pattern question. In the algebraic approach, each sequence is reduced to some simplest form and sequences with similar ‘simplest forms’ are gathered under one heading. In the metric approach, a similarity measure between the sequences is calculated which is then subsequently processed by clustering, scaling and other categorization methods to extract typical sequential patterns. Methods like optimal matching or optimal 3 alignment are commonly applied within this metric approach. In an intermediate situation, local similarities are analyzed to find out the role of key subsequences embedded in longer sequences [49]. Given that traditional classification models are designed for modeling non-time varying independent variables and that sequence-analysis methods are well-suited to model dynamic information, it follows that a combination of both methods unites the best of both worlds and allows for building predictive classification models incorporating non-time varying as well as time-varying independent variables. One possible approach amongst others exists in preceding the traditional classification method by a sequence-analysis method to model the dynamic exogenous variables (cf. serial instead of parallel combination of classifiers, [26]). In this paper, we precede a logistic regression, as a traditional classification method, by a Sequence-Alignment Method (i.e., SAM), as a whole sequence method using the metric approach. The SAM analysis is used to model a time-varying independent variable. We identify how similar the customers are on the dynamic independent variable by calculating a similarity measure between each pair of customers, the SAM distances. These distances are further processed by a clustering algorithm to produce groups of customers which are relatively homogeneous with respect to the dynamic independent variable. As we cluster on a dimension influencing the dependent variable, the clusters are not only homogeneous in terms of the time-varying independent variable, but should also be homogeneous with respect to the dependent variable. This way, we make the implicit link between clustering and classification explicit. After all, clustering is in theory a special problem of classification associated with an equivalence relation defined over a set [36]. Including the clustermembership information as dummies in the classification model not only allows for modeling the dynamic independent variable in an appropriate way, it should even improve the predictive performance. In this paper, we illustrate the new procedure, which combines a sequence-analysis method with a traditional classification method, by estimating a customer-attrition model for a large International Financial-Services Provider (from now on referred to as IFSP). This attrition model feeds the managerial decision process and helps refining the retention strategy by elucidating the profile of customers with a high defection risk. A traditional logistic regression is applied to predict whether a customer will churn or not. This logistic regression is preceded by an element and position-sensitive 4 Sequence-Alignment Method to incorporate a time-varying covariate. We will calculate the distance between each customer on a sequential dimension, i.e., the evolution in relative account-balance total of the customer at the IFSP, and use these distances as input for a subsequent cluster analysis. The clustermembership information is incorporated in the logistic regression by dummies. We hypothesize that the logistic-regression model with the time-varying independent variable included as cluster dummy variables will outperform the traditional logistic regression where the same sequential dimension is incorporated by creating as many non-time varying independent variables as there are time points on which the dimension is measured. The remainder of this paper is structured as follows. In Section 2 we describe the different methods used. We discuss the basic principles of SAM and underline how the cost allocation influences the mathematical features of the resulting SAM distance measures determining whether a symmetric or asymmetric clustering algorithm is appropriate. We outline how a modification of Taylor’s clustersampling algorithm [42] and Butina’s cluster algorithm based on exclusion spheres [6] allows clustering on asymmetric SAM distances. In Section 3 we outline how the new procedure proposed in this paper is applied within a financial-services context to improve prediction of churn behavior. Section 4 investigates whether the results confirm our hypothesis on improved predictive performance. We conclude with a discussion of the main findings and introduce some avenues for further research. 2 Methodology 2.1 Sequence-Alignment Method (SAM) The Sequence-Alignment Method (SAM) was developed in computer sciences (text editing and voice recognition) and molecular biology (protein and nucleic acid analysis). A common application in computer sciences is string correction or string editing [47]. The main use of sequence comparison in molecular biology is to detect the homology between macromolecules. If the distance between two macromolecules is small enough, one may conclude that they have a common evolutionary ancestor. Applications of sequence alignment in molecular biology use comparatively simple alphabets (the four nucleotide molecules or the twenty amino acids) but tend to have very long sequences [49]. Conversely, 5 in marketing applications, sequences will mostly be shorter but with a very large alphabet. Besides SAM applications in computer sciences and molecular biology, there are applications in social science [1], transportation research [21] and speech processing [34]. Recently, SAM has been applied in marketing to discover visiting patterns of websites [18]. Sankoff & Kruskall [40], Waterman [48] and Gribskov & Devereux [14] are good references on Sequence-Alignment Method. SAM handles variable-length sequences and incorporates sequential information, i.e., the order in which the elements appear in a sequence, into its distance measure (unlike conventional position-based distance measures, like Euclidean, Minkowsky, city block and Hamming distances). The original sequence-alignment method can be summarized as follows. Suppose we compare sequence a, called the source, having i elements a=a [a1, …, ai] with sequence b, i.e., the target, having j elements b=b [b1, …, bj]. In general, the distance or similarity between sequence a and b is expressed by the number of operations (i.e., total amount of effort) necessary to convert sequence a into b. The SAM distance is represented by a score. The higher the score, the more effort it takes to equalize the sequences and the less similar they are. The elementary operations are insertions, deletions and substitutions or replacements. Deletion and insertion operations, often referred to as indel, are applied to elements of the source (first) sequence in order to change the source into the target (second) sequence. Substitution operations indicate deletion + insertion. Some advanced research involves other operations like swaps or transpositions (i.e., the interchange of adjacent elements in the sequence), compression (of two or more elements into one element) and expansion (of one element into two or more elements). Every elementary operation is given a weight (i.e., cost) greater than or equal to zero. It is common practice to make assumptions on the weights in order to achieve the metric axioms (nonnegative property, zero property, triangle inequality and symmetry) of mathematical distance (e.g., equal weights for deletions and insertions to preserve the symmetry axiom) [40]. Weights may be tailored to reflect the importance of operations, the similarity of particular elements (cf. element sensitive), the position of elements in the sequence (cf. position sensitive), or the number/type of neighboring elements or gaps [49]. A different weight for insertion and deletion as well as position-sensitive weights result in SAM distances which are no longer symmetric: cf. |ab| ~= |ba|. The latter has its implications on the clustering algorithm that could be used (cf. infra). Different meanings can be given to the word ‘distance’ in 6 sequence comparison. In this paper, we express the relatedness (similarity or distance) between customers on their evolution in relative account-balance total at the IFSP by calculating the weightedLevenshtein [29] distance between each possible pair of customers (i.e., pairwise-sequence analysis). The weighted-Levenshtein distance defines dissimilarity as the smallest sum of operation-weighting values required to change sequence a into b. This way a distance matrix is constructed and consecutively, used as input for a cluster analysis. 2.2 Cluster Analysis of weighted-Levenshtein SAM Distances We cluster the customers on the weighted-Levenshtein SAM distances, expressing how dissimilar they are on the sequential dimension. The cluster-membership information resulting from this cluster analysis is translated into cluster dummies, which represent the sequential dimension in a subsequent classification model. We hypothesize that a classification model including cluster indicators (operationalized as dummy variables) based on SAM distances will outperform a similar model where the same sequential dimension is incorporated by as many non-time varying independent variables as time points on which the dimension is measured. After all, these dummies are good indicators of what type of behavior the customer exhibits towards the sequential dimension (i.e., time-varying independent variable), as well as towards the dependent variable (cf. explicit typology of customers on time-varying covariate results in implicit typology on the dependent variable). A distance matrix holding the pairwise weighted-Levenshtein distances between customer sequences, is used as a distance measure for clustering. As discussed earlier, depending on how the weights (i.e., costs) for SAM are set, the distances in the matrix are symmetric or asymmetric. Most common clustering methods employ symmetric, hierarchical algorithms such as Wards, Single-, Complete-, Average-, or Centroïd linkage [15, 20, 25], non-hierarchical algorithms such as Jarvis-Patrick [21], or partitional algorithms such as k-means or hill-climbing. Such methods require symmetric measures, e.g. Tanimoto, Euclidean, Hamman or Ochai, as their inputs. One drawback of these methods is that they cannot capture important asymmetric relationships. Nevertheless, there exist many practical scenarios where the underlying relation is asymmetric. Asymmetric relationships are common in transportation research (cf. different distance between two cities A and B (|AB|~=|BA|) due to other routes (e.g. the 7 Vehicle Routing Problem [43])), in text mining (cf. word associations, e.g. most people will relate ‘data’ to ‘mining’ more strongly than conversely [44]), in sociometric ratings (cf. a person i could express a higher like or dislike rating to person j than vice versa), in chemoinformatics (cf. compound A may fit into compound B while the reverse is not necessarily true) and to a lesser extent in marketing research (cf. brand-switching counts [10], ‘first choice’-‘second choice’ connections [45] and the asymmetric price effects between competing brands [41]). A good overview of models for asymmetric proximities is given by Zielman and Heiser [50]. Although there are a lot of research settings involving asymmetric proximities, only a few clustering algorithms can handle asymmetric data. Most of these are based on a nearest-neighbor table (NNT). Krishna et al. [27] provide a clustering algorithm for asymmetric data (i.e., CAARD algorithm which closely resembles the Leader Clustering Algorithm (LCA) [16]) with applications to text mining. Ozawa [36] defines a hierarchical asymmetric clustering algorithm called Classic, and applies it on the detection of gestalt clusters. His algorithm is based on an iteratively defined nested sequence of NNRs (i.e., Nearest Neighbors Relations). MacCuish et al. [31] converted the Taylor-Butina exclusion region grouping- algorithms [6, 42] into a real clustering algorithm, which can be used for both disjoint or non-disjoint (overlapping), either symmetric or asymmetric clustering. Although this algorithm is designed for clustering compounds (i.e., the chemo-informatics field with applications like compound acquisition and lead optimization in high-throughput screening), in this paper it is employed to cluster customers on marketing-related information. More specifically, we apply the asymmetric, disjoint version of the algorithm to the asymmetric SAM distances obtained earlier. The asymmetric disjoint Taylor-Butina algorithm is a five-step procedure [30]: 1. Create the threshold nearest-neighbor table using similarities in both directions. 2. Find true singletons, i.e., data points (in our case customers) with an empty nearest-neighbor list. Those elements do not fall into any cluster. 3. Find the data point with the largest nearest-neighbor list. This point tends to be in the center of the k-th (cf. k clusters) most densely occupied region of the data space. The data point together with all its neighbors within its exclusion region, constitute a cluster. The data point itself becomes the representative data point for the cluster. Remove all elements in the cluster from all 8 nearest-neighbor lists. This process can be seen as putting an ‘exclusion sphere’ around the newly formed cluster [6]. 4. Repeat step 3 until no data points exist with a non-empty nearest-neighbor list. 5. Assign remaining data points, i.e., false singletons, to the group that contains their most similar nearest neighbor, but identify them as “false singletons”. These elements have neighbors at the given similarity threshold criterion (e.g. all elements with a dissimilarity measure smaller than 0.3 are deemed similar), but a ‘stronger’ cluster representative, i.e., one with more neighbors in the list, excludes those neighbors (cf. cluster criterion). Representative Compound False Singleton Threshold=.15 Exclusion Regions diameter set by threshold value Dissimilarity in both directions True Singleton Fig. 1. Asymmetric Taylor-Butina Schematic (MacCuish et al., 2003). 2.3 Incorporating Cluster Membership Information in the Classification Model After having applied SAM and cluster analysis using the asymmetric Taylor-Butina algorithm, we build a classification model to predict a binary target variable, in our application ‘churn’. As a classification method, we use binary logistic regression. We build two churn models using the logistic-regression method. One model includes the sequential dimension as cluster dummies resulting from clustering the SAM distances (from now on referred to as LogSeq). The second model incorporates the sequential dimension in a traditional way by as many non-time varying regressors as there are time points, on 9 which the dimension is measured (from now on referred to as LogNonseq). Both models are estimated on a training sample and subsequently validated on a hold-out sample, containing customers not belonging to the training sample. We compare the predictive performance of the LogSeq model with that of the LogNonseq model. In order to test the performance of the LogSeq model on the hold-out sample, we need to define a procedure to assign the hold-out customers to the clusters identified on the training sample. We define five sequences per cluster in the training sample as representatives. By default the grouping module of the Mesa Suite software package, which implements the Taylor-Butina clustering algorithm [30], returns only one representative for each identified cluster. We prefer to have more than one representative customer for each cluster in order to improve the quality of allocation of customers in the hold-out sample to the clusters identified on the training sample. Therefore, once we have found a good k-th cluster solution on the training sample, we apply the Taylor-Butina algorithm to the clusterspecific SAM distances in order to obtain a five-cluster solution delivering five representatives for the given cluster. This way, each cluster has five representatives. Next, we calculate the SAM distances of the hold-out sequences towards these groups of five cluster representatives and vice versa. Each holdout sequence is assigned to the cluster to which it has the smallest average distance (i.e., smallest average distance towards five cluster representatives). This cluster membership information is transformed into cluster dummy variables. The predictive performance of the classification models (in this case: logistic regression) is assessed by the Area Under the receiver operating Curve (AUC). Unlike the Percentage Correctly Classified (i.e., PCC), this performance measure is independent of the chosen cut-off. The Receiver Operating Characteristics curve plots the hit percentage (events predicted to be events) on the vertical axis versus the percentage false alarms (non-events predicted to be events) on the horizontal axis for all possible cut-off values [13]. The predictive accuracy of the logistic-regression models is expressed by the area under the ROC curve (AUC). The AUC statistic ranges from a lower limit of 0.5 for chance (nullmodel) performance to an upper limit of 1.0 for perfect performance [13]. We compare the predictive performance of the LogSeq model with the predictive accuracy of the LogNonseq model. We hypothesize that the LogSeq model will outperform the LogNonseq model. 10 3. A Financial-Services Application We illustrate our new procedure, which combines sequence analysis with a traditional classification method, on a churn-prediction case to support the customer-retention decision system of a major Financial-Services Provider (i.e., IFSP). Over the past two decades, the financial markets have become more competitive due to the mature nature of the sector on the one hand and deregulation on the other, resulting in diminishing profit margins and blurring distinctions between banks, insurers and brokerage firms (i.e., universal banking). Hence, nowadays a small number of large institutions offering a wider set of services dominate the financial-services industry. These developments stimulated bank assurance companies to implement Customer Relationship Management (CRM). Under this intensive competitive pressure, companies realize the importance of retaining their current customers. The substantive relevance of attrition modeling comes from the fact that an increase in retention rate of just one percentage point may result in substantial profit increases [46]. Successful customer retention allows organizations to focus more on the needs of their existing customers, thereby increasing the managerial insights into these customers’ needs and hence decreasing the servicing costs. Moreover, long-term customers buy more [12] and if satisfied, might provide new referrals through positive word-of-mouth for the company. These customers tend to be less sensitive to competitive marketing actions. Finally, losing customers leads to opportunity costs due to lost sales and because attracting new customers is five to six times more expensive than customer retention [4, 8]. For an overview on the literature in attrition analysis we refer to Van den Poel and Larivière [46]. Combining several techniques (just like in this paper) to achieve improved attrition models has already been shown to be highly effective [28]. 3.1 Customer Selection In this paper, we define a ‘churned’ customer as someone who closed all his accounts at the IFSP. We predict whether customers still being customer at December 31st, 2002, will churn on all their accounts in the next year (i.e., 2003) or not. Several selection criteria are used to decide which customers to include into our analysis. Firstly, we only selected customers who became customer from January 1st, 1992 onwards because the information in the data warehouse before this date is less detailed. Secondly, we only select customers having at least three distinct purchase moments before January 2003. This 11 constraint is imposed because we wish to focus the attrition analysis on the more valuable customers. Given the fact that most customers at the IFSP only possess one financial service, the selected customers clearly belong to the more precious clients of the IFSP. Thirdly, we only keep customers still being customer on December 31st, 2002 (cf. prediction of churn event in 2003). This eventually results in 16,254 customers left among which 399 customers (2.45%) closed all their accounts in 2003. We randomly created a training and hold-out sample of 8,127 customers each, among which 200 (2.46%) and 199 (2.45%) churners respectively. There is no overlap between the training and hold-out sample. 3.2 Construction of the Sequential Dimension As discussed earlier, we want to include a sequential covariate in a traditional classification model. One such sequential dimension likely to influence the churn probability is the customers’ evolution in account-balance total at the IFSP. We define the latter variable as a sum of the customers’ total assets (i.e., total outstanding balance on short- and long-term credit accounts + total debit on current account) and total liabilities (i.e., total amount on savings and investment products + credit on current account + sum of monthly insurance fees). Although this account-balance total is a continuous dimension, it is registered in the data warehouse at discrete moments in time; at the end of the month for bank accounts and on a yearly basis for insurance products. We have reliable data for account-balance total from January 1st, 2002 onwards. We build sequences of relative difference in account-balance total (i.e., relbalance) rather than sequences of absolute account-balance total with the aim to facilitate the capturing of overall trends in account-balance total. Each sequence contains four elements (see Table 1): relbalanceJanMar, relbalanceMarJul, relbalanceJulOct, relbalanceOctDec. 12 Table 1 Four elements of the relative account-balance total dimension Dimension relbalance relbalanceJanMar relbalanceMarJul relbalanceJulOct relbalanceOctDec Definition (account-balance total March 2002 – account-balance total January 2002) / account-balance total January 2002 (account-balance total July 2002 – account-balance total March 2002) / account-balance total March 2002 (account-balance total October 2002 – account-balance total July 2002) / accountbalance total July 2002 (account-balance total December 2002 – account-balance total October 2002) / account-balance total December 2002 Besides observing the account-balance total at discrete moments in time, we converted the ratio-scaled relative account-balance total sequence into a categorical dimension. The latter is crucial to ensure that the SAM analysis will find any similarities between the customers’ sequences. Based on an investigation of the distribution of the relative account-balance total, nine categories are distinguished representing approximately an equal number of customers (cf. to enhance discovery of similarities between customers): Table 2 Values for the categorical relative account-balance total dimension and element-based costs Element 0 1 2 3 4 5 6 7 8 Values of relbalance 0 - 0.5 < relbalance < 0 -2.5 < relbalance <= - 0.5 -10 < relbalance <= - 2.5 relbalance <= -10 0 < relbalance < 0.05 0.05 <= relbalance < 0.5 0.5 <= relbalance < 2.5 relbalance >= 2.5 Deletion/ Insertion cost of element 0.2 0.4 0.6 0.8 1 0.4 0.6 0.8 1 For the LogSeq model customers are clustered using the SAM and Taylor-Butina algorithm on their evolution in relative account-balance total as expressed by a sequence of four categorical relative account-balance total variables. In the LogNonseq model, the sequential dimension is included by the four relative account-balance total variables (i.e., relbalanceJanMar, relbalanceMarJul, relbalanceJulOct and relbalanceOctDec) measured at the ratio scale level. 13 3.3 Non-time Varying Independent Variables Besides the sequential dimension, several non-time varying covariates are created (see Table 3). Two blocks of independent variables can be distinguished. A first block captures behavioral/transactionalrelated information. Some of these variables are related to the number of accounts open(ed) / closed, while others consider the account-balance total of the customer over a certain period of time. We also include some exogenous variables expressing when and in what service category the next expiration will occur. Finally, we tried to incorporate some regressors expressing how active the customer is: cf. his recency or number of months since being titular of at least one account. The second block of variables involves non-transactional data, i.e., socio-demographical information like gender, age and cohort. Table 3 Non-time varying independent variables to predict churn behavior Name of Variable st_days_until_next_exp st_days_since_last_exp st_days_since_last_intentend dummy_cat1_next_exp ... dummy_cat14_next_exp nbr_purchevent_bef2003 st_nbr_serv_opened_bef2003 nbr_serv_still_open_bef2003 nbr_serv_open_cat1_bef2003 ... nbr_serv_open_cat14_bef2003 dummy_lastcat1_opened_bef2003 ... dummy_lastcat14_opened_bef2003 nbr_serv_closed_bef2003 nbr_serv_intentend_bef2003 nbr_serv_closed_cat1_bef2003 – nbr_serv_closed_cat14_bef2003 dummy_lastcat1_closed_bef2003 ... dummy_lastcat14_closed_bef2003 dummy_last_cat1_intentend_bef2003 Description Number of days until the first next expiration date (from January 1st, 2003 on) for a service still in possession at December 31st, 2002. Standardized. Number of days between December 31st, 2002 and last expiration date of a service before January 1st, 2003. Standardized. Number of days between December 31st, 2002 and date (before January 1st, 2003) on which the customer intentionally closed a service. Remark: Mostly the expiration date is equal to the closing date. Dummies indicating in what service category the next coming expiration date in 2003 is. Number of distinct purchase moments the customer has before January 1st, 2003. The minimum value is 3 due to customer selection. Number of accounts a customer ever opened before January 1st, 2003. Standardized. Number of services the customer still possesses on December 31st, 2002. Number of accounts still open in service category 1 (respectively 2, ..., 14) on December 31st, 2002. Dummies indicating in what service category the customer last opened an account before January 1st, 2003. Number of accounts the customer has closed (intentionally closed or due to expiration) before January 1st, 2003. Number of sercices the customer intentionally closed before expiration date, before January 1st, 2003. Number of accounts expired or intentionally closed in service category 1 (respectively 2, ..., 14) before January 1st, 2003. Dummies indicating in what service category the customer last closed (intentionally or due to expiration) a service before January 1st, 2003. Dummies indicating in what service category the customer last 14 ... dummy_last_cat14_intentend_bef2003 ratio_closed_open_bef2003 ratio_stillo_open_bef2003 st_recency st_avg_balance_total3 st_avg_balance_total6 st_ratio_tot_3_6 st_avg_diff_balance_total2 st_avg_diff_balance_total3 st_ratio_curr_avgtotal3 st_ratio_curr_avgtotal6 st_avg_balance_min4to2 st_ratio_curr_avgtotalmin4to2 last_avg_reinvest_time st_last_use_homebanking months_last_titu_nozero dummy_contentieux lor age_ 31_Dec_2002 age_becoming_customer gender dummy_cohort_G1 ... dummy_cohort_G5 intentionaly closed an account before January 1st, 2003. (Number of accounts closed before 2003 / number of accounts opened before 2003) * 100. (Number of services still open on December 31st, 2002 / number of services ever opened before January 1st, 2003) *100. Time between last purchase moment and December 31st, 2002. Standardized. The average account-balance total of the customer over the last three months of 2002. (Account-balance total October 2002 + account-balance total November 2002 + account-balance total December 2002) / 3. Standardized. The average account-balance total of the customer over the last six months of 2002. Standardized. Ratio of the total account-balance total of the customer over the last three months of 2002 and the total account-balance total of the customer over the last six months before 2003. Standardized. ((account-balance total December 2002 – account-balance total November 2002) + (account-balance total November 2002 – account-balance total October 2002)) / 2. Standardized. ((account-balance total December 2002 – account-balance total November 2002) + (account-balance total November 2002 – account-balance total October 2002) + (account-balance total October 2002 – account-balance total September 2002)) / 3. Standardized. Account-balance total December 2002 / average account-balance total calculated over October, November and December 2002. Account-balance total December 2002 / average account-balance total calculated over last six months of 2002. (account-balance total November 2002 + account-balance total October 2003 + account-balance total September 2002) / 3. Standardized. Tunover in December 2002 / avg_account-balance total_min4to2. Standardized. Latest average reinvest time before 2003. The average reinvest time indicates how long the customer waits before investing money that is suddenly available again. Number of days since last use of home banking. Deduced from the last logon date or if missing from the last transaction date or if missing from the first logon date or if missing from the home banking start date. Standardized. Number of months ago since customer was titular of at least one account where the balance is non zero. Dummy indicating whether the customer is at least at one account contentieux (bad debt) on December 31st, 2002. Length of relationship expressed in years. Age of the customer on December 31st, 2002. Age of the customer when becoming customer at the IFSP. Gender of the customer. Dummies indicating whether the customer belongs to cohort group 1, 2, 3, 4, 5 or 6. Cohort 1: 1900 <= birth date <= 1924 (i.e., early baby boomers). Cohort 2: 1925 <= birth date <= 1945 (i.e., GI generation). Cohort 3: 1946 >= birth date <= 1955 (i.e., Silent Generation). Cohort 4: 1956 >= birth date <= 1964 (i.e., late baby boomers). Cohort 5: 1965 <= birth date <= 1980 (i.e., X generation). Cohort 6: birth date > 1980 (i.e., Y generation). 15 4. Results 4.1 An Element and Position-sensitive SAM Analysis 4.1.1 Operation Costs We calculated the distance between each customer in both directions on the relative categorical accountbalance total dimension mainly using the weighted-Levenshtein distance. All sequences have length four, i.e., there are four elements in each sequence (see Section 3.2). Typically, the operation weights for deletion and insertion are set to 1. In this paper, we do not follow this approach. In order to ensure that different trajectories result in different SAM distances, the operation weights for deletion and insertion are not set equal. We define wdel = 1 and wins just below 1; wins= 0.9. Similarly, to favor different SAM distances, the weight of reordering is not set like in many research studies to the sum of the cost of one deletion and insertion. We arbitrarily set the reordering weight to 2.3. We intentionally define the reordering weight to be larger than the sum of the deletion and insertion weights to simplify and speed up the search for the optimal alignment (cf. calculating an optimal alignment is equivalent to finding the longest common subsequence of the two sequences compared when the substitution weight is at least equal to the sum of the deletion and insertion weights). 4.1.2 Element-Sensitive Costs We adapted this rather conventional SAM-distance calculation to an element-sensitive SAM analysis to better reflect the research context. Besides the operation weights we charge an element-based cost depending on the element in the source being deleted or inserted. As can be seen from the values assigned to the categorical relative account-balance total variable, (cf. Table 2), some values are more related to each other. For instance, values 4 and 8 are more divergent than values 0 and 1. Therefore, a different extra cost is added depending on the element being inserted in or deleted from the source. The latter increases the variance in the final SAM distances. We set the element-based costs for deletion and insertion equal. The element-cost setting does not distinguish between the categories of relbalance reflecting positive evolutions in relbalance and categories expressing a negative evolution on relbalance (e.g. equal element costs for element value 1 or 5). See Table 2 (cf. supra). 16 4.1.3 Absolute position-sensitive Costs Besides incorporating an element-based cost in the calculation of the conventional SAM distances, we convert the SAM-distance measure into a position-sensitive measure. Normally, SAM describes differences between sequences only in terms of the difference in sequential order of the elements, by changing the order of the common elements of the source sequence if it differs from that of the common elements in the target (i.e., reordering), and in terms of the difference in element composition by deleting the unique elements from the source and inserting the unique elements of the target sequence in the source. Hence, conventional SAM is not sensitive to the positions at which elements are inserted, deleted or reordered. After all, in bio-informatics studies, this position-sensitivity is not useful, as the elements in DNA strings are relatively independent from each other. Consecutive DNA elements are not likely to affect one another. However, one can think of many other sequences where the elements are influencing each other. Whenever the elements in the sequence are measured at consecutive time points, we can assume that previous values influence subsequent values in the sequence. For instance, in activity-sequence analysis sequential relationships between activities is a primary concern. Likewise, we assume in our application that the elements in the sequence consisting of the evolution in relative account-balance total of the customer over 2002, are correlated. Although there are many applications where the elements in the sequence are influencing each other, there is, to our knowledge, only one research study, which incorporates the positional component into the original SAM distance concept. Joh et al. [23] developed a position-sensitive sequence-alignment analysis for activity analysis. Positionsensitivity is taken into account by considering the distance by which the sequential order of the source element is changed. The reordering distance h is measured as h=|i-j|, where i and j are the positions of the reordered elements in the source and target sequences. The position-sensitive SAM distance is defined as follows: R d (a, b ) = min wd D + wi I + η ∑ hr r =1 (1) where wd weight for deletion η weight for reordering 17 D number of deletions R number of reorderings wi weight for insertion hr distance of reordering the rth common element I number of insertions The authors show that for larger values of the reordering weight, there is a significant difference between the clustering solution found using the traditional SAM measure and the one resulting from the position-sensitive SAM analysis. Whereas Joh et al. [23] developed a relative position-sensitive SAM analysis, i.e., a SAM analysis that is sensitive to the difference in positions of common elements that need to be reordered, we wish to develop an absolute position-sensitive SAM analysis which does not only consider the positions of elements reordered, but also the positions at which elements are deleted from the source as well as the positions in the source at which elements are inserted. We prefer an absolute position-sensitive measure to a relative measure because we wish to distinguish between operations applied in the beginning of the sequence and operations performed at the end of the sequence. The rationale for this comes from the fact that we assume recent evolutions in relative account-balance total to influence the customers’ churn probability more intensively than the customers’ relative account-balance total in the beginning of the sequence, e.g. the relative account-balance total of the customer in the period October-December 2002 probably influences the churn probability more than the relative account-balance total of the customer between January-March 2002. Consider example 1 and 2. Let sequence a be the source, sequence b the target. In both examples, we need to change the order of elements 6 and 1. Whereas in the target element 6 precedes element 1, in the source this is reverse. Using a relative position-sensitive distance measure like Joh et al. [23], the SAM cost from reordering for example 1 and 2 would be the same. Supposing we reorder element 6 and the reordering weight is 1, we obtain a reordering cost for example 1 of: |2-1| = 1 and for example 2 of: |4-3| = 1. From these examples it follows that a relative position-sensitive SAM measure does not distinguish between sequences where the reordering is applied over the same number of positions, but at distinct positions in the source. Whereas a relative position-sensitive SAM measure keeps the distances symmetric (cf. reordering cost defined using difference in position of element to reorder in source and target), an 18 absolute position-sensitive SAM method results in asymmetric distances (cf. reordering cost defined using only position of reordered element in the source). position 1 2 3 4 1 2 3 4 b 6 1 7 7 6 1 7 7 a 1 6 7 7 7 7 1 6 Example 1 Example 2 The absolute position-sensitive reordering cost multiplies the position of the reordered element in the source with the reordering weight. As mentioned earlier, we also convert the deletion and insertion costs into position-sensitive costs. The absolute position-sensitive deletion cost considers the position in the source where the element is deleted. Likewise, the insertion cost is made position-sensitive by incorporating the position of the element in the target to be inserted in the source. We use the position of the element in the target as a proxy for the position in the source where the target element is inserted. Next we describe how the final element and absolute-position sensitive SAM distances are calculated. 4.1.4 Hay’s pairwise-Sequence Alignment Algorithm A major concern in sequence comparison and SAM analysis is the algorithm used to calculate the distances within a reasonable time window. To address this computational complexity problem [40] we apply an algorithm by Hay [17] that structures the equalizing process in a fast and easy way. It has not yet been proven to always lead to an optimal alignment, i.e., the trajectory (the sequence of operations necessary to equalize the source with the target) resulting in the smallest distance possible. Yet, the algorithm mostly does. 19 - Step 1: Identify the longest common substrings respecting the sequential order of elements. It is well known, that if the substitution/reordering weight is at least the sum of the deletion and insertion weights, calculating an optimal alignment is equivalent to finding the longest-common subsequences of the two sequences compared. These longest common substrings represent the structural integrity of the two sequences or the structural skeleton [33]. In case there is more than one possible longest common substring, we opt for the longest substring of which the absolute sum of the differences in positions between all common elements in source and target is smallest as the latter prefers matches between source and target at less remote positions above matches at more distant source-target positions. In example 3 we compare a customer having a rather small evolution in relative account-balance total (e.g. 0 1 5 0) with another customer starting with rather small evolutions in relative customer accountbalance total but ending with a decreasing account-balance total (e.g. 5 1 2 3). Example of sequence pairs 5 1 2 3 5 1 2 3 0 1 5 0 0 1 5 0 Longest common substring 5 or 1. We opt for identification of the common element 1. − Step 2: Identify elements, which are not included in the substring and appear in the source and the target. Count one reordering for each such identified element. Example of sequence pairs Common elements not appearing in the longest common substring 5 1 2 3 5 0 1 5 0 At the end of this step, the order of the substituted elements has been changed. In the above example, the order of element 5 is changed to precede 1 rather than to succede 1. The total reordering cost is the sum of the product of the reordering weight with the position of the reordered element in the source. R Cost reordering = ∑ η * pos r _ reorel r =1 R (2) number of reorderings 20 η reordering weight posr_reorel absolute position of rth reordered element in the source − Step 3: Identify elements not included in the substring and which appear in eather one of the compared sequences. Count one deletion operation for each unique element in the source, one insertion operation for each unique element in the target. Example of sequence pairs Elements unique to the source or target 5 1 2 3 The two zero elements are deleted from the source, 0 1 5 0 elements 2 and 3 are inserted in the source. The costs for deletions and insertions are besides position-sensitive also element-sensitive. D ( Cost deletion = ∑ wd * c d _ e + pos d _ del d =1 ) wd weight for deletion cd_e cost for deletion of dth element with a certain value (i.e., element cost) posd_del cost for deletion of dth element at a given position in the source (i.e., position cost) I ( Cost insertion = ∑ wi * ci _ e + posi _ ins i =1 ) wi weight for insertion ci_e cost for insertion of ith element with a certain value (i.e., element cost) posi_ins cost for insertion of ith element from a given position in the target (i.e, position cost) (3) (4) Applying this algorithm, using the operation weights, the element-based costs and the position costs, the total SAM distance is calculated as follows: 21 R D ∑ η * pos r _ reorel + ∑ wd * c d _ e + pos d _ del r =1 d =1 SAMdist = min I + ∑ wi * ci _ e + pos i _ ins i =1 ( ( ) ) (5) The total SAM distance for our example is: SAMdist = (2.3 * 3) + [(1* (0.2+1)) + (1 * (0.2+4))] + [(0.9 * (0.6+3)) + (0.9* (0.8+4))] (6) = 19.86 In this paper, we calculate the element- and position-sensitive SAM distance between each customer in the training sample, in both directions, on the sequential dimension relative account-balance total. These distances are inserted into a distance matrix used as input for the asymmetric Taylor-Butina clustering method. 4.2 Asymmetric Clustering Using Taylor-Butina Algorithm The asymmetric SAM distances calculated between the training sequences on evolution in relative account-balance total are used as input for the asymmetric, disjoint Taylor-Butina algorithm with the aim to distinguish clusters with respect to the sequential dimension and indirectly also with respect to the dependent variable, i.e., churn. Depending on the threshold (in range [0,1]) used, the clusters obtained by the algorithm are more or less homogeneous. Lower thresholds result in smaller, but more homogeneous clusters, whereas higher threshold values result in larger, but less homogeneous clusters. In our application, however, our primary objective is not finding the optimal clustering solution in terms of homogeneity but to keep the number of clusters limited as the cluster membership is incorporated by means of dummies into the final classification model. For each cluster, a certain minimum number of customers is needed, to enhance the possibility that the cluster dummies would significantly influence the dependent variable. As the Taylor-Butina algorithm iteratively identifies the sequence with the highest number of neighbors, it follows that the first cluster defined is the largest one and that subsequent clusters have fewer and fewer members. Therefore, it might be that small thresholds are not 22 optimal with respect to our predictive goal. Our ambition is to distinguish a reasonable number of clusters with 1) a certain minimum number of customers and 2) with the highest possible homogeneity. We experimented with several levels of the similarity threshold. It seems that we need a rather high threshold to keep the number of clusters limited. For instance, for a threshold of 0.80, we obtain 130 clusters of which the biggest cluster only contains 758 customers (i.e., 9.32%) and of which clusters 89 to 130 hold less than five customers. We investigated for which threshold the first cluster keeps a high enough number of customers. Employing a threshold of 0.9999 resulted in a 23-cluster solution of which the first cluster counts 3281 customers (i.e., 40.37%). As 23 clusters is still quite high a number, which would result in 22 cluster dummies in the classification model, and as this cluster solution still creates some rather small populated clusters (e.g., from cluster 10 on the clusters have less than 100 members), we decided to group some clusters together. Therefore, we performed a ‘second-order clustering’ by using the 23 representative sequences (i.e., centrotypes) as input to a subsequent clustering exercise, and investigated which cluster representatives are taken together into new clusters. This resulted in a final five-cluster solution (see Table 4). Table 4 Final five-cluster solution on training sample Cluster Old clusters Frequency N Percent N 1 2 3 4 5 1 2 3 4-5 6-23 3281 1629 1040 1101 1076 40.37 20.04 12.79 13.55 13.24 Frequency churners 82 42 26 20 30 Percent churners 2.50 2.58 2.50 1.82 2.79 For each of these five clusters, we defined five representatives. We prefer to have more than one representative per cluster to enhance the quality of cluster allocation of the hold-out sequences. By default, the grouping module of Mesa Suite software package version 2.1 returns only one representative per cluster. Hence, for clusters 1 to 3, we have already one representative for each cluster, for cluster 4 we have two centrotypes and for cluster 5 we have 18 centrotypes. For clusters 1 to 4 we need to find additional representatives, whereas for cluster 5 we need to limit the number of representatives to five. Therefore, for each cluster defined, we cluster on the cluster-specific SAM distances until we get a five-cluster solution providing exactly five representatives for that cluster. 23 In a next step, the hold-out sequences are assigned to the five clusters identified on the training sample. We calculated the distances between the hold-out sequences and the groups of five representative cluster sequences (5*5) in both directions. As a proxy for the asymmetric distance between a hold-out sequence and a representative, we make the sum of the distance between the hold-out sequence and the centrotype and the distance between the centrotype and the hold-out sequence and divide this sum by two. The latter approach is common practice in studies performing calculations on asymmetric proximities. After all, it has been proven [34] that each asymmetric distance matrix decomposes into a symmetric matrix S of averages sij = {qij + qji }/2 and a skew-symmetric matrix A with elements aij ={qij - qji }/2. Using these proxies, each hold-out sequence is assigned to the cluster to which it has the smallest average distance (i.e., smallest average distance towards five cluster representatives). Table 5 gives an overview of the cluster distribution in the hold-out sample. Table 5 Allocation of hold-out sequences to five clusters identified on training sample Cluster 1 2 3 4 5 Frequency N 4514 601 916 842 1254 Percent N 55.60 7.40 11.28 10.36 15.45 Frequency churners 127 32 4 14 22 Percent churners 2.81 5.32 0.44 1.66 1.75 4.3 Defining the Best Subset of non-time varying Independent Variables Before we compare the predictive performance of the LogSeq model with that of the LogNonseq model, we first define a best subset of non-time varying independent variables to include in the logisticregression models besides the sequential dimension relbalance. Employing the leap and bound algorithm [11] on the non-time varying independent variables in Table 3, we compared the best subsets having size 1 until 20 on their sums of squares. As expected the increase in the performance criterion is inversely proportional to the number of independent variables added. From Figure 2, we decide that a subset containing the best five variables represents a good balance between number of independents included and variance explained by the model. 24 Fig. 2. Number of variables in best subsets. The independent variables in the best subset of size five are: Table 6 Best subset selection of size 5 Best subset of size 5 st_days_until_next_exp st_ratio_curr_avgtotal3 st_ratio_stillo_open_bef2003 st_months_last_titu_nozero st_days_since_last_exp Variable description Number of days until the first next expiration date (from January 1st, 2003 on) for a service still in possession on December, 31st 2002. Standardized. Account-balance total December 2002 / average account-balance total calculated over October, November and December 2002. (Number of services still open before January 1st, 2003 / number of services ever opened before January 1st, 2003) *100. Standardized. Number of months ago since customer was titular of at least one account where the balance is non zero. Number of days between December 31st, 2002 and last expiration date of a service before January 1st, 2003. Standardized. 25 4.4 Comparing Churn Predictive Performance of LogSeq and LogNonseq Models We compare the predictive performance measured by AUC on the hold-out sample of the LogNonseq model with that of the LogSeq model. Both logistic-regression models include the best five non-time varying variables from Table 6 as well as the sequential dimension relbalance. However, in the LogNonseq model, the sequential dimension is incorporated by means of four non-time varying independent variables (i.e., st_relbalanceJanMar, st_relbalanceMarJul, st_relbalanceJulOct and st_relbalanceOctDec), whereas in the LogSeq model the sequential dimension is operationalized by four cluster dummies. We hypothesize that the churn predictive performance of the LogSeq model will be significantly higher than that of the LogNonseq model because operationalizing a sequential dimension by non-time varying independent variables neglects the sequential information of the dimension. Table 7 Churn Predictive Performance of LogSeq and LogNonseq model on hold-out sample Performance measure AUC LogNonseq model 0.906 LogSeq model 0.964 Table 7 shows that our hypothesis is confirmed. There is a significant difference (χ2=17.69, p = 0.0000259) in predictive performance between the binary logistic-regression model including the best subset of five independent variables and the sequential dimension operationalized by non-time varying variables and the logistic-regression model with the same set of independent variables but the sequential dimension expressed by cluster dummies deduced from sequence-alignment analysis. Table 8 shows the parameter estimates and significance levels of the regressors for both models. Where possible the standardized estimates are given. 26 Table 8 Parameter estimates for LogNonseq and LogSeq models Sequential Dimension Best Subset Variable Estimate Pr > Chi-Square LogNonseq LogSeq LogNonseq LogSeq Intercept -16.80 -8.23 <.0001 <.0001 st_days_until_next_exp -1.25 -1.26 <.0001 <.0001 st_ratio_curr_avgtotal3 0.28 -0.09 0.0187 0.0109 st_ratio_stillo_open_bef2003 -1.25 -1.28 <.0001 <.0001 st_months_last_titu_nozero 0.12 0.02 0.0048 0.4392 st_days_since_last_exp 0.50 0.49 <.0001 <.0001 st_relbalanceJanMar cl1 -21.79 0.29 0.2169 0.0401 st_relbalanceMarJul cl2 -8.86 0.25 0.9299 0.1034 st_relbalanceJulOct cl3 -38.25 0.23 0.0890 0.1744 st_relbalanceOctDec cl4 -240.45 0.35 0.0011 0.0447 Although not all cluster dummies are significant at α=0.05 level, the LogSeq model significantly outperforms the LogNonseq model. The insignificance of cluster dummy 3 (p= 0.1744) might stem from a serious drop in percentage churners from training (2.58%) to hold-out sample (0.44%) to even below 1% churners. Looking at the estimates for the relative account-balance total dimension in the LogNonSeq model, we find that only the relative evolution in account-balance total for the last six months seems to have a significant effect on the churn probability in 2003 (cf. st_relbalanceJulOct and st_relbalanceOctDec). The bigger the positive difference in account-balance total the less likely the customer will churn. Considering the non-sequential regressors, it appears that all five regressors are significant for the LogNonseq model, while all but one, i.e. st_months_last_titu_nozero, for the LogSeq model. All effects have the expected sign. The smaller the number of days until the next expire date from January, 1st 2003 onwards, the higher the churn probability (cf. st_days_until_next_exp). The effect of st_ratio_curr_avgtotal3 seems to be rather small, so we should not worry too much about the difference in sign of effect between the LogNonseq and the LogSeq model. We may conclude, that our new procedure, which combines sequence analysis with a traditional classification model, is a possible strategy to follow in an attempt to overcome the caveat of traditional classification models designed for modeling non-time varying independent variables. In this paper, we 27 have modeled a time-varying independent variable by preceding a traditional binary logistic-regression model by an element/position-sensitive sequence-alignment on the sequential dimension. This resulted in cluster-membership information in terms of the sequential dimension as well as implicitly with respect to the dependent variable. We decided to include this cluster-membership information by means of dummies in the final classification model. Instead of the latter, we could alternatively use the cluster information to build cluster-specific classification models. However, in our application, this approach is irrealistic because one of the five clusters identified on the test sample has less than 1% churners (cf. cluster 3). So in case the predicted event is rather rare, building cluster-specific classification models might be impossible due to too few people experiencing the event in some of the identified clusters. Moreover, another drawback of building cluster-specific models, lies in the practical problems to include several sequential dimensions in the classification model. Whereas it is easy to include another set of cluster dummies in the classification model for each sequential dimension, building clusterspecific classification models on more than one sequential dimension implies simultaneously clustering on several sequential dimensions employing multidimensional SAM analysis. However, as computational complexity is already an issue of concern in case of unidimensional SAM, the latter becomes even worse within multidimensional SAM analysis. 5. Conclusion In this paper, we provide a new procedure that overcomes the caveat of traditional classification models to incorporate sequential exogenous variables. Instead of transforming the sequential dimension in nontime varying variables, thereby ignoring the sequential-information part, a better practice is to employ a sequence-analysis method for modeling the time-varying independent variable and to, subsequently, incorporate this information in the traditional classification method, which is designed for modeling non-time varying covariates. This way the best of both methods is combined. One possible strategy hereby is to cluster the customers on the sequential dimension using SAM (in this paper an element/position sensitive SAM) and to incorporate this cluster information in the classification model by dummy variables. The latter approach is promising as the results from the attrition models at the 28 IFSP confirm our hypothesis of improved predictive performance when modeling the sequential dimension by sequence-analysis methods instead of operationalizing them as non-time varying variables. Besides this approach, preceding a traditional classification model, like binary logistic regression, by a sequence-analysis method to model the sequential dimension in the model, other approaches might exist. It might be worthwhile to elaborate on other procedures to combine sequenceanalysis methods designed to model sequential information with traditional classification methods suited to model non-time varying independent variables. Another avenue for further research exists in exploring how other sequence-analysis methods than sequence alignment could enhance the modeling of sequential covariates in classification models. In this paper, we only included one sequential dimension. Further research studies should incorporate several sequential covariates. Finally, we wish to bring the parameter setting issue to the attention of researchers considering to apply or to elaborate on our new procedure. The highly-tuned edit distances used for our churn application might not be valid in other applications. The researcher should adapt the operational weights, element costs and position costs of the sequence-alignment to fit the application at hand. Similarly, the threshold used for the asymmetric clustering will need fine-tuning in order to obtain a good clustering solution for other applications. Acknowledgements The authors would like to thank the anonymous financial-services company for providing the data. Next, we extend our thanks to John D. MacCuish and Norah E. MacCuish for providing a free academic license of the software package Mesa Suite Version 1.2, Grouping Module (www.mesaac.com) and for their kind assistance. Moreover, we would like to thank Bart Larivière, PhD candidate at Ghent University, for sharing his knowledge of the data warehouse of the company. Finally, we express our thanks to 1) Ghent University for funding the PhD project of Anita Prinzie (BOF Grantno. B00141), and 2) the Flemish Research Fund (FWO Vlaanderen) for providing the funding for the computing equipment to complete this project (Grantno. G0055.01). 29 References [1] A. Abbott and A. Hrycak, Measuring Resemblance in Sequence Data: An Optimal Matching Analysis of Musicians’ Careers, American Journal of Sociology 96, No. 1 (1990), 144-185. [2] A. Abbott, Sequence analysis: new methods for old ideas, Annual Review of Sociology 21, (1995), 93-113. [3] B. Baesens, G. Verstraeten and D. Van den Poel, Bayesian Network Classifiers for Identifying the Slope of the CustomerLifecycle of Long-Life Customers, European Journal of Operational Research 156, No. 2 (2004), 508-523. [4] C.B. Bhattacharya, When customers are members: Customer retention in paid membership contexts, Journal of the Academy of Marketing Science 26, No. 1 (1998), 31-44. [5] W. Buckinx, E. Moons, D. Van den Poel and G. Wets, Customer-Adapted Coupon Targeting Using Feature Selection, Expert Systems with Applications 26, No. 4 (2004), 509-518. [6] D. Butina, Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way to Cluster Small and Large Data Sets, Journal of Chemical Information and Computer Sciences 39, No. 4 (1999), 747-750. [7] A. Cohen, R. I. Ivry and S.W. Keele, Attention and structure in sequence learning, Journal of Experimental Psychology: Learning, Memory and Cognition 16, No. 1 (1990), 17-30. [8] M.R. Colgate and P.J. Danaher, Implementing a customer relationship strategy: The asymmetric impact of poor versus excellent execution, Journal of the Academy of Marketing Science 28, No. 3 (2000), 375-387. [9] M.C. Cooper and G.W. Milligan, The effect of error on determining the number of clusters, Proceedings International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, (1988), 319-328. [10] W.S. DeSarbo and G. De Soete, On the use of hierarchical-clustering for the analysis of nonsymmetric proximities, Journal of Consumer Research 11, No. 1 (1984), 601-610. [11] G.M. Furnival and R.W. Wilson, Regressions by Leaps and Bounds, Technometrics 16, No. 4 (1974), 499-511. [12] J. Ganesh, M.J. Arnold and K.E. Reynolds, Understanding the customer base of service providers: An examination of the differences between switchers and stayers, Journal of Marketing, 64, No. 3 (2000), 65-87. [13] D. Green and J.A. Swets, Signal detection theory and psychophysics, John Wiley & Sons, New York, USA, 1966. [14] M. Gribskov and J. Devereux (eds.), Sequence Analysis Primer, Oxford University Press, New York, USA, 1992. [15] J. Hair, R. Andersen, R. Tatham and W. Black, Multivariate Data Analysis, Prentice Hall, 1998. [16] J. Hartigan, Clustering Algorithms, Wiley, New York, US, Wiley, 1975. [17] B. Hay, Sequence Alignment Methods in Web Usage Mining, Doctoral Dissertation, LUC, Belgium, 2003. 30 [18] B. Hay, G. Wets and K. Vanhoof, Web Usage Mining by means of Multidimensional Sequence Alignment Methods. WEBKDD 2002 – Mining web data for discovering usage patterns and profiles; Lecture Notes in Artificial Intelligence 2703, (2003), 50-65. [19] W. J. Hopp, A Sequential Model of R&D Investment over an Unbounded Time Horizon, Management Science 33, No. 4 (1987), 500-508. [20] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall Advanced Reference Series, Englewood Cliffs, NJ, 1998. [21] R.A. Jarvis and E.A. Patrick, Clustering using a similarity measure based on shared nearest neighbors, IEEE Transactions on Computers 22, No. 11 (1973), 1025-1034. [22] C.H. Joh, T.A. Arentze and H.J.P. Timmermans, Multidimensional Sequence Alignment Methods for Activity-Travel Pattern Analysis: A Comparison of Dynamic Programming and Genetic Algorithms, Geographical Analysis 33, No. 3 (2001), 247-270. [23] C.H. Joh, T.A. Arentze and H.J.P. Timmermans, A position-sensitive sequence alignment method illustrated for spacetime activity-diary data, Environment and Planning A 33, No. 2 (2001), 313-338. [24] J. Jonz, Textual sequence and 2nd language comprehension, Language Learning 39, No. 2 (1989), 207-249. [25] L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis, John Wiley & Sons, 1990. [26] E. Kim, W. Kim and Y. Lee, Combination of multiple classifiers for the customer’s purchase behavior prediction, Decision Support Systems 34, No. 2 (2003), 167-175. [27] K. Krishna and R. Krishnapuram, A Clustering Algorithm for Asymmetrical Related Data with Applications to Text Mining, Proceedings of the tenth international conference on Information and Knowledge Management, 2001, 571-573. [28] B. Larivière and D. Van den Poel, Investigating the role of product features in preventing customer churn by using survival analysis and choice modeling: The case of financial services, Expert Systems with Applications 27, No. 2 (2004). [29] V.I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, Cybernetics and Control Theory 10, No. 8 (1965), 707-710. [30] J. MacCuish and N.E. MacCuish, Mesa Suite Version 1.2 Grouping Module, Mesa Analytics & Computing, LLC, www.mesaac.com, 2003. [31] J. MacCuish, C. Nicolaou and N.E. MacCuish, Ties in Proximity and Clustering Compounds, Journal of Chemical Information and Computer Sciences 41, No. 1 (2001), 134-146. [32] S. Mc. Brearty, The Sagoan-Lupemban and middle stone-age sequence at the Muguruk site, World Archaeology 19, No. 3 (1988), 388-420. [33] M.A. McClure, T.K. Vasi, and W.M. Fitch, Comparative analysis of multiple protein-sequence alignment methods, Molecular Biology and Evolution 11, No. 4 (1994), 571-592. 31 [34] C.S. Myers and L.R. Rabiner, A level building dynamic time warping algorithm for connected word recognition, IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP 29, No. 2 (1981), 284 -297. [35] B. Noble and W. Daniel, Applied Linear Algebra, Prentice-Hall, New Jersey, 1988, 20. [36] K. Ozawa, CLASSIC: a hierarchical clustering algorithm based on asymmetric similarities, Pattern Recognition 16, No. 2 (1983), 201-211. [37] A. Prinzie and D. Van den Poel, Investigating Purchasing-Sequence Patterns for Financial Services using Markov, MTD and MTDg Models, European Journal of Operational Research, 2005, forthcoming. [38] A.E. Raftery and S. Tavaré, Estimation and Modelling Repeated Patterns in High Order Markov Chains with the Mixture Transition Distribution Model, Applied Statistics 43, No. 1 (1994), 179-199. [39] R. Sabherwal and D. Robey, Reconciling variance and process strategies for studying information system development, Information Systems Research, No. 6 (1995), 303-327. [40] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macromolecules. The Theory and Practice of Sequence Comparison, Addisson-Wesley Pub., Advanced Book Program, Mass., 1983. [41] R. Sethuraman, V. Srinivasan and K. Doyle, Asymmetric and Neighborhood Cross-Price Effects: Some Empirical Generalizations, Marketing Science 18, No. 1 (1999), 23-41. [42] R. Taylor, Simulation Analysis of Experimental Design Strategies for Screening Random Compounds as Potential New Drugs and Agrochemicals, Journal of Chemical Information and Computer Sciences 35, No. 1 (1995), 59-67. [43] P. Toth and D. Vigo, A heuristic algorithm for the symmetric and asymmetric vehicle routing problems with backhauls, European Journal of Operational Research 113, No. 3 (1999), 528-543. [44] A. Tversky and J.W. Hutchinson, Nearest neighbor analysis of psychological spaces, Psychological Review 93, No. 1 (1986), 3-22. [45] G. L. Urban, P.L. Johnson and J.R. Hauser, Testing competitive market structures, Marketing Science, No. 3 (1984), 83112. [46] D., Van den Poel and B. Larivière, Customer attrition analysis for financial services using proportional hazard models, European Journal of Operational Research 157, No. 1 (2004), 196-217. [47] R.A. Wagner and M.J. Fischer (1974), The string-to-string correction problem, Journal of the Association for Computing Machinery, No. 21 (1974), 168-173. [48] M.S. Waterman, Introduction to computational biology. Maps, sequences and genomes, Chapman and Hall, USA, 1995. [49] W.C., Wilson, Activity pattern analysis by means of sequence-alignment methods, Environment and Planning A 30, No. 6 (1998), 1017-1038. [50] B. Zielman, and W.J. Heiser, Models for asymmetric proximities, British Journal of Mathematical and Statistical Psychology 49, (1996), 127-146. 32 FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT Tel. Fax. : 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92 WORKING PAPER SERIES 12 04/219 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, The Pragmatic Quality of Resources-Events-Agents Diagrams: an Experimental Evaluation, January 2004, 23 p. 04/220 J. CHRISTIAENS, Gemeentelijke financiering van het deeltijds kunstonderwijs in Vlaanderen, Februari 2004, 27 p. 04/221 C.BEUSELINCK, M. DELOOF, S. MANIGART, Venture Capital, Private Equity and Earnings Quality, February 2004, 42 p. 04/222 D. DE CLERCQ, H.J. SAPIENZA, When do venture capital firms learn from their portfolio companies?, February 2004, 26 p. 04/223 B. LARIVIERE, D. VAN DEN POEL, Investigating the role of product features in preventing customer churn, by using survival analysis and choice modeling: The case of financial services, February 2004, 24p. 04/224 D. VANTOMME, M. GEUENS, J. DE HOUWER, P. DE PELSMACKER, Implicit Attitudes Toward Green Consumer Behavior, February 2004, 33 p. 04/225 R. I. LUTTENS, D. VAN DE GAER, Lorenz dominance and non-welfaristic redistribution, February 2004, 23 p. 04/226 S. MANIGART, A. LOCKETT, M. MEULEMAN et al., Why Do Venture Capital Companies Syndicate?, February 2004, 33 p. 04/227 A. DE VOS, D. BUYENS, Information seeking about the psychological contract: The impact on newcomers’ evaluations of their employment relationship, February 2004, 28 p. 04/228 B. CLARYSSE, M. WRIGHT, A. LOCKETT, E. VAN DE VELDE, A. VOHORA, Spinning Out New Ventures: A Typology Of Incubation Strategies From European Research Institutions, February 2004, 54 p. 04/229 S. DE MAN, D. VANDAELE, P. GEMMEL, The waiting experience and consumer perception of service quality in outpatient clinics, February 2004, 32 p. 04/230 N. GOBBIN, G. RAYP, Inequality and Growth: Does Time Change Anything?, February 2004, 32 p. 04/231 G. PEERSMAN, L. POZZI, Determinants of consumption smoothing, February 2004, 24 p. 04/232 G. VERSTRAETEN, D. VAN DEN POEL, The Impact of Sample Bias on Consumer Credit Scoring Performance and Profitability, March 2004, 24 p. (forthcoming in Journal of the Operational Research Society, 2004). 04/233 S. ABRAHAO, G. POELS, O. PASTOR, Functional Size Measurement Method for Object-Oriented Conceptual Schemas: Design and Evaluation Issues, March 2004, 43 p. 04/234 S. ABRAHAO, G. POELS, O. PASTOR, Comparative Evaluation of Functional Size Measurement Methods: An Experimental Analysis, March 2004, 45 p. 04/235 G. PEERSMAN, What caused the early millennium slowdown? Evidence based on vector autoregressions, March 2004, 46 p. (forthcoming in Journal of Applied Econometrics, 2005) 04/236 M. NEYT, J. ALBRECHT, Ph. BLONDEEL, C. MORRISON, Comparing the Cost of Delayed and Immediate Autologous Breast Reconstruction in Belgium, March 2004, 18 p. 04/237 D. DEBELS, B. DE REYCK, R. LEUS, M. VANHOUCKE, A Hybrid Scatter Search / Electromagnetism MetaHeuristic for Project Scheduling, March 2004, 22 p. 04/238 A. HEIRMAN, B. CLARYSSE, Do Intangible Assets and Pre-founding R&D Efforts Matter for Innovation Speed in Start-Ups?, March 2004, 36 p. FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT Tel. Fax. : 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92 WORKING PAPER SERIES 13 04/239 H. OOGHE, V. COLLEWAERT, Het financieel profiel van Waalse groeiondernemingen op basis van de positioneringsroos, April 2004, 15 p. 04/240 E. OOGHE, E. SCHOKKAERT, D. VAN DE GAER, Equality of opportunity versus equality of opportunity sets, April 2004, 22 p. 04/241 N. MORAY, B. CLARYSSE, Institutional Change and the Resource Flows going to Spin off Projects: The case of IMEC, April 2004, 38 p. 04/242 T. VERBEKE, M. DE CLERCQ, The Environmental Kuznets Curve: some really disturbing Monte Carlo evidence, April 2004, 40 p. 04/243 B. MERLEVEDE, K. SCHOORS, Gradualism versus Big Bang: Evidence from Transition Countries, April 2004, 6 p. 04/244 T. MARCHANT, Rationing : dynamic considerations, equivalent sacrifice and links between the two approaches, May 2004, 19 p. 04/245 N. A. DENTCHEV, To What Extent Is Business And Society Literature Idealistic?, May 2004, 30 p. 04/246 V. DE SCHAMPHELAERE, A. DE VOS, D. BUYENS, The Role of Career-Self-Management in Determining Employees’ Perceptions and Evaluations of their Psychological Contract and their Esteemed Value of Career Activities Offered by the Organization, May 2004, 24 p. 04/247 T. VAN GESTEL, B. BAESENS, J.A.K. SUYKENS, D. VAN DEN POEL, et al., Bayesian Kernel-Based Classification for Financial Distress Detection, May 2004, 34 p. (forthcoming in European Journal of Operational Research, 2004) 04/248 S. BALCAEN, H. OOGHE, 35 years of studies on business failure: an overview of the classical statistical methodologies and their related problems, June 2004, 56 p. 04/249 S. BALCAEN, H. OOGHE, Alternative methodologies in studies on business failure: do they produce better results than the classical statistical methods?, June 2004, 33 p. 04/250 J. ALBRECHT, T. VERBEKE, M. DE CLERCQ, Informational efficiency of the US SO2 permit market, July 2004, 25 p. 04/251 D. DEBELS, M. VANHOUCKE, An Electromagnetism Meta-Heuristic for the Resource-Constrained Project Scheduling Problem, July 2004, 20 p. 04/252 N. GOBBIN, G. RAYP, Income inequality data in growth empirics : from cross-sections to time series, July 2004, 31p. 04/253 A. HEENE, N.A. DENTCHEV, A strategic perspective on stakeholder management, July 2004, 25 p. 04/254 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, User comprehension of accounting information structures: An empirical test of the REA model, July 2004, 31 p. 04/255 M. NEYT, J. ALBRECHT, The Long-Term Evolution of Quality of Life for Breast Cancer Treated Patients, August 2004, 31 p. 04/256 J. CHRISTIAENS, V. VAN PETEGHEM, Governmental accounting reform: Evolution of the implementation in Flemish municipalities, August 2004, 34 p. 04/257 G. POELS, A. MAES, F. GAILLY, R. PAEMELEIRE, Construction and Pre-Test of a Semantic Expressiveness Measure for Conceptual Models, August 2004, 23 p. 04/258 N. GOBBIN, G. RAYP, D. VAN DE GAER, Inequality and Growth: From Micro Theory to Macro Empirics, September 2004, 26 p. FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT Tel. Fax. : 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92 WORKING PAPER SERIES 14 04/259 D. VANDAELE, P. GEMMEL, Development of a measurement scale for business-to-business service quality: assessment in the facility services sector, September 2004, 30 p. 04/260 F. HEYLEN, L. POZZI, J. VANDEWEGE, Inflation crises, human capital formation and growth, September 2004, 23 p. 04/261 F. DE GRAEVE, O. DE JONGHE, R. VANDER VENNET, Competition, transmission and bank pricing policies: Evidence from Belgian loan and deposit markets, September 2004, 59 p. 04/262 B. VINDEVOGEL, D. VAN DEN POEL, G. WETS, Why promotion strategies based on market basket analysis do not work, October 2004, 19 p. (forthcoming in Expert Systems with Applications, 2005) 04/263 G. EVERAERT, L. POZZI, Bootstrap based bias correction for homogeneous dynamic panels, October 2004, 35 p. 04/264 R. VANDER VENNET, O. DE JONGHE, L. BAELE, Bank risks and the business cycle, October 2004, 29 p. 04/265 M. VANHOUCKE, Work continuity constraints in project scheduling, October 2004, 26 p. 04/266 N. VAN DE SIJPE, G. RAYP, Measuring and Explaining Government Inefficiency in Developing Countries, October 2004, 33 p. 04/267 I. VERMEIR, P. VAN KENHOVE, The Influence of the Need for Closure and Perceived Time Pressure on Search Effort for Price and Promotional Information in a Grocery Shopping Context, October 2004, 36 p. 04/268 I. VERMEIR, W. VERBEKE, Sustainable food consumption: Exploring the consumer attitude – behaviour gap, October 2004, 24 p. 04/269 I. VERMEIR, M. GEUENS, Need for Closure and Leisure of Youngsters, October 2004, 17 p. 04/270 I. VERMEIR, M. GEUENS, Need for Closure, Gender and Social Self-Esteem of youngsters, October 2004, 16 p. 04/271 M. VANHOUCKE, K. VAN OSSELAER, Work Continuity in a Real-life Schedule: The Westerschelde Tunnel, October 2004, 12 p. 04/272 M. VANHOUCKE, J. COELHO, L. V. TAVARES, D. DEBELS, On the morphological structure of a network, October 2004, 30 p. 04/273 G. SARENS, I. DE BEELDE, Contemporary internal auditing practices: (new) roles and influencing variables. Evidence from extended case studies, October 2004, 33 p. 04/274 G. MALENGIER, L. POZZI, Examining Ricardian Equivalence by estimating and bootstrapping a nonlinear dynamic panel model, November 2004, 30 p. 04/275 T. DHONT, F. HEYLEN, Fiscal policy, employment and growth: Why is continental Europe lagging behind?, November 2004, 24 p. 04/276 B. VINDEVOGEL, D. VAN DEN POEL, G. WETS, Dynamic cross-sales effects of price promotions: Empirical generalizations, November 2004, 21 p. 04/277 J. CHRISTIAENS, P. WINDELS, S. VANSLEMBROUCK, Accounting and Management Reform in Local Authorities: A Tool for Evaluating Empirically the Outcomes, November 2004, 22 p. 04/278 H.J. SAPIENZA, D. DE CLERCQ, W.R. SANDBERG, Antecedents of international and domestic learning effort, November 2004, 39 p. 04/279 D. DE CLERCQ, D.P. DIMO, Explaining venture capital firms’ syndication behavior: A longitudinal study, November 2004, 24 p. FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 9000 GENT Tel. Fax. : 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92 WORKING PAPER SERIES 15 04/280 T. FASEUR, M. GEUENS, Different Positive Feelings Leading to Different Ad Evaluations: The Case of Cosiness, Excitement and Romance, November 2004, 17 p. 04/281 B. BAESENS, T. VAN GESTEL, M. STEPANOVA, D. VAN DEN POEL, Neural Network Survival Analysis for Personal Loan Data, November 2004, 23 p. 04/282 B. LARIVIÈRE, D. VAN DEN POEL, Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques, December 2004, 30 p. 05/283 R. I. LUTTENS, E. OOGHE, Is it fair to “make work pay”?, January 2005, 28 p. 05/284 N. A. DENTCHEV, Integrating Corporate Social Responsibility In Business Models, January 2005, 29 p. 05/285 K. FARRANT, G. PEERSMAN, Is the exchange rate a shock absorber or a source of shocks? New empirical evidence, January 2005, 26 p. (forthcoming Journal of Money Credit and Banking, 2005) 05/286 G. PEERSMAN, The relative importance of symmetric and asymmetric shocks and the determination of the exchange rate, January 2005, 24 p. 05/287 C. BEUSELINCK, M. DELOOF, S. MANIGART, Private Equity Investments and Disclosure Policy, January 2005, 44 p. 05/288 G. PEERSMAN, R. STRAUB, Technology Shocks and Robust Sign Restrictions in a Euro Area SVAR, January 2005, 29 p. 05/289 H.T.J. SMIT, W.A. VAN DEN BERG, W. DE MAESENEIRE, Acquisitions as a real options bidding game, January 2005, 52 p. 05/290 H.T.J. SMIT, W. DE MAESENEIRE, The role of investor capabilities in public-to-private transactions, January 2005, 41 p.
© Copyright 2026 Paperzz