Title: Text Mining and Its Place in Predictive Modeling Student: Corey Travis Question: Is text mining useful in identifying loyal customers? Background: Businesses oftentimes consider only numerical data when creating predictive models. This, generally, makes sense. However, textual data can also be included into predictive modeling. In this project, two data sets are used from a small business looking to identify its most loyal customers. By developing an accurate model, the business can identify current loyal customers and can easily predict which new customers will become loyal to the business. Data Mining Approach: In summary, the project called for utilizing text mining. In one of the data sources, responses to a survey are recorded exactly as written by customers. The data was parsed. Since the business is looking for the most loyal customers, the process required that the data from the text file be clustered. Essentially, SAS was able to cluster customers who used positive language in describing why the business has the best loyalty card. AS the user, I was able to instruct SAS to ignore particular word forms, such as abbreviations and conjunctions, to ensure my data was useful and uncluttered. This is then combined with the numerical data, which scores the same customers numerically in 14 different attributes. From this point, models can be created. Using an 80/20 split with training and validation data, the models were fitted and then compared. In this report, I will create an example of a data mining project to show the impact text mining has in predictive modeling. Typically, data mining, or data analytics, uses numerical data to create models, clusters, and more. It would feel safe to say the more information, or data, SAS has to work with, the more accurate the results will be. Some businesses, however, do not utilize text data because of the common misconception that it cannot be combined with numerical data. In the example I created, the results help prove that text mining has its uses in predictive modeling. In this project, I was given two sets of data. One is the actual text given as an answer to four questions asked to customers. Those four questions were: ‘Why do we have the best loyalty card?’, ‘Why do we have the worst loyalty card?’, ‘Why is this your least favorite stop?’, and ‘Why is this your favorite stop?’ For each of these questions, customers responded using text. Furthermore, the text provided was often in broken sentences, such as: ‘The points on card’ or ‘No points, horrible customer service’. The second set of data provided was numeric in nature. Essentially, numbers were assigned to customers based on the clusters in which they fit. To begin the project, I opened my data (which was already converted into SAS Table files) inside of SAS and began building a diagram. It is important to understand the necessity of adding your start code. In doing so, it allows SAS to connect to your library of data so it can appropriately run algorithms. After the two separate data files are imported, I add them to my diagram separately. I began working with the text data first. The first step is adding four data parsing nodes. Each will serve the purpose of clustering the customers into categories based on a specific variable. Figure 2 shows the four data parsing nodes added that cluster the data based on their textual response to the questions Figure 1 shows that the parsing nodes are instructed to use and report a particular variable while also ignoring the remaining three. Figure 3 Indicates in the general properties window of the parsing node, the user can instruct SAS which types of words to ignore when analyzing. In order to achieve this, each node must have its variables edited. In doing so, each node must be parsing the data based on the variable in its title. We want the node to ignore all variables not associated with its title while having the node focus on the single variable we want it to. Another important function in this step is to tell the node which words should be ignored. In the general properties section, each node has an option for the user to choose which types of words should be ignored. Due to SAS’s included English dictionary, the program can categorize each word in a text file into nouns, pronouns, verbs, etc. The user can then tell SAS which types of words to ignore when analyzing the data. This helps SAS quickly gather the necessary information without the clutter of words that are unnecessary, such as ‘the’, ‘and’, and ‘to’. 1 Next, we want to cluster the data analyzed by the parsing node. The one chosen to do was the node titled (and based on the questions) ‘Why Best Loyalty Card’. Since the goal is to identify which customers are the most loyal for this company, the people clustered into this section are more likely to fit into the concept of a loyal customer the best. After analyzing the clustering results, we can see how the observations in the text file can be grouped to help us make a prediction about loyal customers. Figure 4 shows the clustering node used the data analyzed by the parsing node. The results show that customers using similar language can be grouped. The frequency of the groups of words being read is also presented. After this step, I began working with the numerical data. First, I added an Impute node to help resolve the issues of missing data in the file. Regression models tend to ignore observations if data is missing, resulting in less training data to work with. This helps prevent that form happening. Next we partitioned the data. By doing this, we split our data into a training set and a validation set. From the validation set, a testing set can be drawn, as well. This was not incorporated in this project. Instead, I had the node partition 80 percent of the data into training and the remaining 20 percent into validation. The training data will be used for preliminary model fitting. The validation data is important for comparing multiple models. Figure 5 shows how the data is separated into bins in both the training and validation data. According to the observations, about 55 percent of customers consider this business to have the best loyalty card. The other 45 percent did not agree that the business had the best loyalty card. 2 Next, we add regression models into the diagram. One node will model a regression based on the numerical data only. The other will model a regression using both the numerical and textual data. In doing so, we can then compare the two models to determine which one is a better identifier of loyal customers. Since the most loyal customers would consider this business to have the best loyalty card, these are the people we want to target. Using the training data, we can test the model to determine if the model can identify our targeted customers. Figure 6 shows the two nodes used to create a regressions using different combinations of data sources. The top node uses only the numerical data. The bottom node uses both numerical and textual data. The two models can then be compared. Finally, we can compare our two different models. In doing so, we hope to identify which regression is a better predictor of loyal customers. We add a node to the end of the diagram chain that compares the two models. Figure 7 shows the results. 3 Based on our results, the regression utilizing both the numerical and textual data was a better model. This interpretation is always left up to the user, but SAS offers its suggestion based on several concepts. Therefore, it is oftentimes the best choice and the same choice the user would arrive at on his own. In Figure 7, the fit statistics show the regression with both data sets is better. The validation data we included earlier was used to determine this after the models were fit with the training data. According to the average squared error, our better model had less errors associated with it than the alternative. This is our best indicator that the best model includes both data sets when identifying loyal customers based on our instructions. To further prove this, Figure 7 also includes the ROC chart generated by the model comparison node. As indicated by the points on the chart, the model using numerical and textual data has a larger area under the curve and is consistently higher on the sensitivity axis. This indicates that this model is a better predictor than the alternative. Clearly, this quick example illustrates the importance of text mining when developing predictive models. Textual data is information, too. By clustering this data and using it alongside the numerical data, SAS was able to create a more accurate model. Over the next several months, this project will be expanded even further to create even more models, including tree diagrams and other regressions using different properties. By doing this, I can help create the most accurate model to identify the most loyal customers for a business. In doing so, businesses can quickly identify loyal customers resulting in a target group for promotions, customer retention, and even more. Businesses can also use this to identify which customers to ask advice from. The textual data offered opinions on the business’s loyalty card and the business as a whole. While all customers offer suggestions, the most loyal customers can be trusted more to offer the best ones. One could even flip the situation and look for customers using negative words to identify problems that should be addressed. The application for data mining is huge, and its ability to assist in making business decisions only increases when incorporating text mining. Conclusion: One model featured data from the numerical source only. The second model featured data from both the textual and numerical files. According to the results, the model utilizing both datasets generated more accurate results, a larger area under the ROC function curve, and a lower average error. Essentially, this model was a better identifier and predictor of loyal customers. This helps exemplify how text mining can actually strengthen model accuracy when utilized. This same concept can be used for many applicable scenarios in the business world, indicating text mining can be a very useful tool when combined with numerical data mining and predictive modeling. Literature Cited: No actual literature was cited in the report, but I used the SAS help tool on their website for a couple of definitions. I also used a book from class. I also used two publications that discuss text mining and apply it to business examples. Author contact information: Corey Travis; 270-589-8009; [email protected] Works Cited: 4 Boire, Richard. "Data Mining for Customer Loyalty." Data Mining for Managers (2014): 1-4. Data Mining for Customer Loyalty. Boire Filler Group, 15 Mar. 2009. Web. 18 Nov. 2015. <http://www.boirefillergroup.com/pdf/DataMiningforCustLoyalty.pdf>. Delen, Dursun. Real-World Data Mining: Applied Business Analytics and Decision Making. Upper Saddle River, New Jersey: Pearson Education, 2014. Print. Nareddy, Maheshwar, and Goutam Chakraborty. "Tomer Loyalty Program through Text Mining of Customers’ Comments." SAS Global Forum (2011): 1-8. SAS Support. SAS, 2011. Web. 25 Nov. 2015. <http://support.sas.com/resources/papers/proceedings11/223-2011.pdf>. 5
© Copyright 2026 Paperzz