Text Mining

Title: Text Mining and Its Place in Predictive Modeling
Student: Corey Travis
Question: Is text mining useful in identifying loyal customers?
Background:
Businesses oftentimes consider only numerical data when creating predictive models. This, generally,
makes sense. However, textual data can also be included into predictive modeling. In this project, two
data sets are used from a small business looking to identify its most loyal customers. By developing an
accurate model, the business can identify current loyal customers and can easily predict which new
customers will become loyal to the business.
Data Mining Approach: In summary, the project called for utilizing text mining. In one of the data
sources, responses to a survey are recorded exactly as written by customers. The data was parsed. Since
the business is looking for the most loyal customers, the process required that the data from the text file
be clustered. Essentially, SAS was able to cluster customers who used positive language in describing
why the business has the best loyalty card. AS the user, I was able to instruct SAS to ignore particular
word forms, such as abbreviations and conjunctions, to ensure my data was useful and uncluttered. This
is then combined with the numerical data, which scores the same customers numerically in 14 different
attributes. From this point, models can be created. Using an 80/20 split with training and validation
data, the models were fitted and then compared.
In this report, I will create an example of a data mining project to show the impact text mining has in
predictive modeling. Typically, data mining, or data analytics, uses numerical data to create models,
clusters, and more. It would feel safe to say the more information, or data, SAS has to work with, the
more accurate the results will be. Some businesses, however, do not utilize text data because of the
common misconception that it cannot be combined with numerical data. In the example I created, the
results help prove that text mining has its uses in predictive modeling.
In this project, I was given two sets of data. One is the actual text given as an answer to four
questions asked to customers. Those four questions were: ‘Why do we have the best loyalty card?’,
‘Why do we have the worst loyalty card?’, ‘Why is this your least favorite stop?’, and ‘Why is this your
favorite stop?’ For each of these questions, customers responded using text. Furthermore, the text
provided was often in broken sentences, such as: ‘The points on card’ or ‘No points, horrible customer
service’. The second set of data provided was numeric in nature. Essentially, numbers were assigned to
customers based on the clusters in which they fit.
To begin the project, I opened my data (which was already converted into SAS Table files) inside
of SAS and began building a diagram. It is important to understand the necessity of adding your start
code. In doing so, it allows SAS to connect to your library of data so it can appropriately run algorithms.
After the two separate data files are imported, I add them to my diagram separately.
I began working with the text data first. The first step is adding four data parsing nodes. Each
will serve the purpose of clustering the customers into categories based on a specific variable.
Figure 2 shows the four data parsing nodes added that cluster the
data based on their textual response to the questions
Figure 1 shows that the parsing nodes are instructed to use and report a
particular variable while also ignoring the remaining three.
Figure 3 Indicates in the general properties
window of the parsing node, the user can
instruct SAS which types of words to ignore
when analyzing.
In order to achieve this, each node must have its variables edited. In doing so, each node must be
parsing the data based on the variable in its title. We want the node to ignore all variables not
associated with its title while having the node focus on the single variable we want it to. Another
important function in this step is to tell the node which words should be ignored. In the general
properties section, each node has an option for the user to choose which types of words should be
ignored. Due to SAS’s included English dictionary, the program can categorize each word in a text file
into nouns, pronouns, verbs, etc. The user can then tell SAS which types of words to ignore when
analyzing the data. This helps SAS quickly gather the necessary information without the clutter of words
that are unnecessary, such as ‘the’, ‘and’, and ‘to’.
1
Next, we want to cluster the data analyzed by the parsing node. The one chosen to do was the
node titled (and based on the questions) ‘Why Best Loyalty Card’. Since the goal is to identify which
customers are the most loyal for this company, the people clustered into this section are more likely to
fit into the concept of a loyal customer the best. After analyzing the clustering results, we can see how
the observations in the text file can be grouped to help us make a prediction about loyal customers.
Figure 4 shows the clustering node used the data analyzed by the parsing node. The results show that customers using similar
language can be grouped. The frequency of the groups of words being read is also presented.
After this step, I began working with the numerical data. First, I added an Impute node to help
resolve the issues of missing data in the file. Regression models tend to ignore observations if data is
missing, resulting in less training data to work with. This helps prevent that form happening. Next we
partitioned the data. By doing this, we split our data into a training set and a validation set. From the
validation set, a testing set can be drawn, as well. This was not incorporated in this project. Instead, I
had the node partition 80 percent of the data into training and the remaining 20 percent into validation.
The training data will be used for preliminary model fitting. The validation data is important for
comparing multiple models.
Figure 5 shows how the data is separated into bins in both the training and validation data. According to the observations,
about 55 percent of customers consider this business to have the best loyalty card. The other 45 percent did not agree that the
business had the best loyalty card.
2
Next, we add regression models into the diagram. One node will model a regression based on
the numerical data only. The other will model a regression using both the numerical and textual data. In
doing so, we can then compare the two models to determine which one is a better identifier of loyal
customers. Since the most loyal customers would consider this business to have the best loyalty card,
these are the people we want to target. Using the training data, we can test the model to determine if
the model can identify our targeted customers.
Figure 6 shows the two nodes used to create a regressions using different combinations of data sources. The top node uses only
the numerical data. The bottom node uses both numerical and textual data. The two models can then be compared.
Finally, we can compare our two different models. In doing so, we hope to identify which
regression is a better predictor of loyal customers. We add a node to the end of the diagram chain that
compares the two models. Figure 7 shows the results.
3
Based on our results, the regression utilizing both the numerical and textual data was a better
model. This interpretation is always left up to the user, but SAS offers its suggestion based on several
concepts. Therefore, it is oftentimes the best choice and the same choice the user would arrive at on his
own. In Figure 7, the fit statistics show the regression with both data sets is better. The validation data
we included earlier was used to determine this after the models were fit with the training data.
According to the average squared error, our better model had less errors associated with it than the
alternative. This is our best indicator that the best model includes both data sets when identifying loyal
customers based on our instructions.
To further prove this, Figure 7 also includes the ROC chart generated by the model comparison
node. As indicated by the points on the chart, the model using numerical and textual data has a larger
area under the curve and is consistently higher on the sensitivity axis. This indicates that this model is a
better predictor than the alternative.
Clearly, this quick example illustrates the importance of text mining when developing predictive
models. Textual data is information, too. By clustering this data and using it alongside the numerical
data, SAS was able to create a more accurate model. Over the next several months, this project will be
expanded even further to create even more models, including tree diagrams and other regressions using
different properties. By doing this, I can help create the most accurate model to identify the most loyal
customers for a business. In doing so, businesses can quickly identify loyal customers resulting in a
target group for promotions, customer retention, and even more.
Businesses can also use this to identify which customers to ask advice from. The textual data
offered opinions on the business’s loyalty card and the business as a whole. While all customers offer
suggestions, the most loyal customers can be trusted more to offer the best ones. One could even flip
the situation and look for customers using negative words to identify problems that should be
addressed. The application for data mining is huge, and its ability to assist in making business decisions
only increases when incorporating text mining.
Conclusion:
One model featured data from the numerical source only. The second model featured data from both
the textual and numerical files. According to the results, the model utilizing both datasets generated
more accurate results, a larger area under the ROC function curve, and a lower average error.
Essentially, this model was a better identifier and predictor of loyal customers. This helps exemplify how
text mining can actually strengthen model accuracy when utilized. This same concept can be used for
many applicable scenarios in the business world, indicating text mining can be a very useful tool when
combined with numerical data mining and predictive modeling.
Literature Cited: No actual literature was cited in the report, but I used the SAS help tool on their
website for a couple of definitions. I also used a book from class. I also used two publications that
discuss text mining and apply it to business examples.
Author contact information: Corey Travis; 270-589-8009; [email protected]
Works Cited:
4
Boire, Richard. "Data Mining for Customer Loyalty." Data Mining for Managers (2014):
1-4. Data Mining for Customer Loyalty. Boire Filler Group, 15 Mar. 2009. Web. 18
Nov. 2015. <http://www.boirefillergroup.com/pdf/DataMiningforCustLoyalty.pdf>.
Delen, Dursun. Real-World Data Mining: Applied Business Analytics and Decision
Making. Upper Saddle River, New Jersey: Pearson Education, 2014. Print.
Nareddy, Maheshwar, and Goutam Chakraborty. "Tomer Loyalty Program through Text
Mining of Customers’ Comments." SAS Global Forum (2011): 1-8. SAS Support.
SAS, 2011. Web. 25 Nov. 2015.
<http://support.sas.com/resources/papers/proceedings11/223-2011.pdf>.
5