Mike_Miller_Eugene_Gross_Assignment_5

Michael Miller
Eugene Gross
XLS-CS-553-WS 2014S
Assignment 5
The Text Topic node allows for text analysis of topics which are a combination of two or more individual
terms. Using topics greatly increases the accuracy of data analysis as they look to find patterns of words
that are representative of data labels. Whereas searching against the population data with single terms
in a dictionary often returns a tremendous number of results, utilizing topics allows for a more granular
search of the population greatly narrowing down the amount of text to be analyzed. Text topics are
often searched against a training data set for learning purposes, then based on the training set results
we are able to make predictions on new or unknown sets of data.
Image 2 and Image 3 show the topics list for both the Text Topic and Text Topic CST nodes. This topics
list contains information about each topic such as the Topic ID, Document Cutoff, Term Cutoff, Topic
Category, Number of Terms, and the Number of Documents. This information gives a statistical
summary view of each topic. In our analysis the Text Topic node created 25 topics and the Text Topic
CST node created 50 topics. Image 4 and Image 5 graphs the total number of documents broken down
into each topic. Each bar on the graph represents a topic. We can see from this chart that in the Text
Topic node Topic ID 24 was contained in the least number of documents and in the Text Topic CST node
Topic ID 38 was contained in the least number of documents. Image 6 and Image 7 graphs the total
number of terms in the documents broken down into each topic. Topic ID 12 and Topic ID 11 contained
the most terms per topic in the Text Topic node and the Text Topic CST nodes respectively. Images 8 and
9 list summary information for the Text Topic nodes input variables.
The Decision Tree node applies a set of rules based on one input to a data population resulting in a
number of branching data segments. The Decision Tree is made up of three different parts. The first part
is the root which is the original data segment containing all data. The second part of the Decision Tree
are the branches which are made up of resultant data after rules are applied to the data contained
within the root. The final part of the Decision Tree are the leaves which are creations of rules being
applied to each branch segment of data. The rules that are applied to the data at each stage of the
Decision Tree are made up of input words. This is an important aspect for text analysis because this
means that rules can be interpreted and thus are meaningful to humans. All nodes in the Decision tree
contain a mutually exclusive set of rules meaning that child data from the parent data set is found in
only a single node. The importance of this aspect is that once the data rules have been applied to
segments it is then possible to learn and predicts new node values for future and unknown data
populations.
In Images 10 and 11 the Fit Statistics pane of our Decisions Tree nodes displays a number of different
statistics for each of our three data sets (Train, Validation, and Test). We can see that our CST model had
very similar Misclassification Rates to our other model for both the Train and Validation data sets but
had a slightly lower Misclassification Rate when looking solely at the Test data set. The Root Average
Squared Error is very similar for both models when looking at the Train and Validation data sets as well
but is a bit lower for the CST model for the Test set. Both of these statistics are indicators that the CST
model performs a bit better in terms of predictive accuracy. Overall the Statistic pane give a great at-a-
This assignment was completed as a group.
glance view for comparison between our two models and will often show telltale signs of which models
perform the best comparatively speaking.
The Output pane for our Decision Tree nodes in Image 10 and 11 contains a summary report of all
output data generated by running the nodes. This pane yields a plethora of important information such
as variable summary, model events, variable importance and a tree leaf report. Basically all the
information that went into constructing the Tree is summarized in this pane.
The Score Ranking Overlay pane in Images 10 and 11 graphs assessment statistics for both of our models
across deciles. The following assessment characteristics can be plotted on the y-axis of the Score
Ranking Overlay graph:







cumulative lift
lift
gain
% response
cumulative % response
% captured response
cumulative % captured response
We have the cumulative lift for both of our Decision Tree nodes plotted. When analyzing the cumulative
lift of the models we are looking for total area under the curve. The more area under the curve is
indicative of a model that outperforms a model with less area under the curve. It is evident that both of
our Decision Tree nodes performed similarly. The curves between the Train and Validation data sets for
both nodes were nearly identical showing us strong predictive performance.
The Leaf Statistics pane shown in Images 10 and 11 uses a bar chart to represent each leaf in the
Decision Tree. Each leaf has 2 bars representing the Training and the Validation data. The height of each
bar equals the percentage of donors in each leaf for each data set. From this chart we can see that the
Decision Tree CST node generated 37 leaves while the Decision Tree node generated 29. When viewing
this chart we are looking for height consistency in each leaf for both data sets. Consistency between the
data sets represents stronger predictive performance and a low rate of error. Looking at our results the
Decision Tree node was more consistent than was the CST node. Leaves 9, 21 and 33 in the CST node
showed a large difference in sum between the two data sets. Leaves 22 and 28 in the Decision Tree
node showed the greater difference in sums although the severity was much less dramatic than that of
the three CST nodes. Overall the Decision Tree node showed more consistency between each data set
thus it performed better in grouping data into leaves than did the CST model.
The Tree Map pane in Images 10 and 11 uses a graphical representation to show all nodes in the
Decision Tree. The width of the Tree Map node is proportional to the number of Training cases
contained in that node. The Tree Map gives a compact view of often very large and complicated
Decision Trees. We can see in our Tree Maps that the CST node has more branches and leaves than does
the Decision Tree node. The CST node also has a number of branches and leaves that contain a very
small number of Training cases which can be seen by the relatively small width of several of the nodes.
The Decision Tree panes in Images 12 and 13 displays a graphical representation of our data broken
down into hierarchical nodes of braches and leaves. We can see that our CST model contained a more
This assignment was completed as a group.
complex tree with more branches and leaves. The Data Tree allows us to visually see the rules that were
applied at each node in the tree. For example, in the Decision Tree node a rule was applied to the root
data looking for the occurrence of the topic +seizure +hospital +patient +day +state. If the rate of
occurrence in a document was 1.6785 or missing entirely then it was placed into a branch. Conversely if
the rate of occurrence in a document drawn from the root data was greater than or equal to 1.6785
then it was moved into an adjacent branch. Successive rules were then applied to each of these branch
nodes to further break the data down into leave nodes. Each node in the tree brings us a step closer to
drawing a conclusion through predictive analysis of a large set of population data. For example, based
on our Decision Tree we can see that of the patients that received a vaccination it was an extremely rare
occurrence for that patient to be sore or run a fever post-injection. It was even rarer if a patient did
experience these symptoms for the duration of such adverse reactions to exceed 2.5 days. The
conclusions that were drawn from the rules applied in our Decision Tree can then be applied to new sets
of unknown data. For example, in 2015 if 50 million vaccinations were to take place we could predict
within a certain margin of error the rate of the adverse reactions of soreness or fever post-injection.
The Model Comparison node allows for the comparison of different models to better understand which
models perform the best. To do this population data is divided into two different sets; a training set and
a test set. The training set is where learning takes place and the test set is where performance is
calculated using a number of different statistical methods. Sometimes a third set of data called a
Validation set is used to calculate error after the training has been completed. The metrics drawn from
the test set are used to draw conclusion regarding which models perform optimally given the data type
being analyzed.
The Fit Statistics chart in Image 14 lists a number of different statistics for both of our models. Each data
role has a number of columns listing a statistical calculation that gives us an understanding of goodnessof-fit. We can see that both models have relatively identical Misclassification Rates. This statistics is a
good indicator of model discriminate powers being accurate. Both of our models have low rates of false
positives. Statistically speaking both of our models are very similar with the exception of the Gain and
Lift across all data roles. The CST models has a higher lift and gain across all three data roles. This
indicates that on the surface the CST model performs more optimally than our other model.
Image 15 displays a Receiver Operating Characteristic (ROC) graph that compares the performance of
our data analysis model against our CST model. We can see that all three data roles (Train, Validate and
Test) performed relatively consistently. This is an indicator that our training data role is a sufficient
predictor for use with unknown sets of data. Both curves in the Train and Validate data roles have very
similar apexes indicating a similar probability that an event will be correctly predicted as an event. A
closer look at the curves indicates that each peek pushes upwards and to the left of the plane. This is an
indicator that our models have discriminatory power and do a good job using it. Our curves show a shift
towards the lower left region of the plane. This is an indicator that our data models do not have a high
propensity for false-positives. In all, the ROC graph for our models shows adequacy in cutting off false
positives and keeping true positives. When both models were applied to the Test data we can see that
the CST model has a higher peek than does our other model. This tells us that when both models were
applied to the Test data set that the CST model showed slightly better performance than did our other
model.
This assignment was completed as a group.
In Image 16 we have the cumulative lift for the three data roles of both of our models plotted. When
analyzing the cumulative lift of the models we are looking for total area under the curve. In all three
data roles the CST model outperforms has more area under the curve than our other model indicating
that the CST model has the greater predictability capabilities of the two models.
The Model Comparison Output window in Image 17 contains a variable summary listing variables, their
roles, measurement level and frequency count. The output windows also lists all data contained within
the Fit Statistics window for each of three data roles. Data is grouped by target variable and then by
role so it is easier to read the overall performance of each role. For comparison of models and data roles
the spreadsheet view provided in the Fit Statistics window allows for a quick view of model performance
metrics for which conclusions can be drawn.
This assignment was completed as a group.
Image 1 – Node Diagram:
This assignment was completed as a group.
Image 2 - Text Topic:
Image 3 - Text Topic – CST:
This assignment was completed as a group.
Image 4 - Text Topic:
Image 5 - Text Topic – CST:
This assignment was completed as a group.
Image 6 - Text Topic:
Image 7 - Text Topic – CST:
This assignment was completed as a group.
Image 8 - Text Topic:
Image 9 - Text Topic – CST:
This assignment was completed as a group.
Image 10 - Decision Tree:
This assignment was completed as a group.
Image 11 - Decision Tree – CST:
This assignment was completed as a group.
Image 12 - Decision Tree:
*Reference included PDF (Assignment_5_DecisionTree.pdf)
This assignment was completed as a group.
Image 13 - Decision Tree – CST:
*Reference included PDF (Assignment_5_DecisionTreeCST.pdf)
This assignment was completed as a group.
Image 14 – Model Comparison Fit Statistics:
Image 15 – Model Comparison ROC Chart:
This assignment was completed as a group.
Image 16 – Model Comparison Score Ranking Overlay (Cumulative Lift):
This assignment was completed as a group.
Image 17 – Model Comparison Output:
This assignment was completed as a group.