Michael Miller Eugene Gross XLS-CS-553-WS 2014S Assignment 5 The Text Topic node allows for text analysis of topics which are a combination of two or more individual terms. Using topics greatly increases the accuracy of data analysis as they look to find patterns of words that are representative of data labels. Whereas searching against the population data with single terms in a dictionary often returns a tremendous number of results, utilizing topics allows for a more granular search of the population greatly narrowing down the amount of text to be analyzed. Text topics are often searched against a training data set for learning purposes, then based on the training set results we are able to make predictions on new or unknown sets of data. Image 2 and Image 3 show the topics list for both the Text Topic and Text Topic CST nodes. This topics list contains information about each topic such as the Topic ID, Document Cutoff, Term Cutoff, Topic Category, Number of Terms, and the Number of Documents. This information gives a statistical summary view of each topic. In our analysis the Text Topic node created 25 topics and the Text Topic CST node created 50 topics. Image 4 and Image 5 graphs the total number of documents broken down into each topic. Each bar on the graph represents a topic. We can see from this chart that in the Text Topic node Topic ID 24 was contained in the least number of documents and in the Text Topic CST node Topic ID 38 was contained in the least number of documents. Image 6 and Image 7 graphs the total number of terms in the documents broken down into each topic. Topic ID 12 and Topic ID 11 contained the most terms per topic in the Text Topic node and the Text Topic CST nodes respectively. Images 8 and 9 list summary information for the Text Topic nodes input variables. The Decision Tree node applies a set of rules based on one input to a data population resulting in a number of branching data segments. The Decision Tree is made up of three different parts. The first part is the root which is the original data segment containing all data. The second part of the Decision Tree are the branches which are made up of resultant data after rules are applied to the data contained within the root. The final part of the Decision Tree are the leaves which are creations of rules being applied to each branch segment of data. The rules that are applied to the data at each stage of the Decision Tree are made up of input words. This is an important aspect for text analysis because this means that rules can be interpreted and thus are meaningful to humans. All nodes in the Decision tree contain a mutually exclusive set of rules meaning that child data from the parent data set is found in only a single node. The importance of this aspect is that once the data rules have been applied to segments it is then possible to learn and predicts new node values for future and unknown data populations. In Images 10 and 11 the Fit Statistics pane of our Decisions Tree nodes displays a number of different statistics for each of our three data sets (Train, Validation, and Test). We can see that our CST model had very similar Misclassification Rates to our other model for both the Train and Validation data sets but had a slightly lower Misclassification Rate when looking solely at the Test data set. The Root Average Squared Error is very similar for both models when looking at the Train and Validation data sets as well but is a bit lower for the CST model for the Test set. Both of these statistics are indicators that the CST model performs a bit better in terms of predictive accuracy. Overall the Statistic pane give a great at-a- This assignment was completed as a group. glance view for comparison between our two models and will often show telltale signs of which models perform the best comparatively speaking. The Output pane for our Decision Tree nodes in Image 10 and 11 contains a summary report of all output data generated by running the nodes. This pane yields a plethora of important information such as variable summary, model events, variable importance and a tree leaf report. Basically all the information that went into constructing the Tree is summarized in this pane. The Score Ranking Overlay pane in Images 10 and 11 graphs assessment statistics for both of our models across deciles. The following assessment characteristics can be plotted on the y-axis of the Score Ranking Overlay graph: cumulative lift lift gain % response cumulative % response % captured response cumulative % captured response We have the cumulative lift for both of our Decision Tree nodes plotted. When analyzing the cumulative lift of the models we are looking for total area under the curve. The more area under the curve is indicative of a model that outperforms a model with less area under the curve. It is evident that both of our Decision Tree nodes performed similarly. The curves between the Train and Validation data sets for both nodes were nearly identical showing us strong predictive performance. The Leaf Statistics pane shown in Images 10 and 11 uses a bar chart to represent each leaf in the Decision Tree. Each leaf has 2 bars representing the Training and the Validation data. The height of each bar equals the percentage of donors in each leaf for each data set. From this chart we can see that the Decision Tree CST node generated 37 leaves while the Decision Tree node generated 29. When viewing this chart we are looking for height consistency in each leaf for both data sets. Consistency between the data sets represents stronger predictive performance and a low rate of error. Looking at our results the Decision Tree node was more consistent than was the CST node. Leaves 9, 21 and 33 in the CST node showed a large difference in sum between the two data sets. Leaves 22 and 28 in the Decision Tree node showed the greater difference in sums although the severity was much less dramatic than that of the three CST nodes. Overall the Decision Tree node showed more consistency between each data set thus it performed better in grouping data into leaves than did the CST model. The Tree Map pane in Images 10 and 11 uses a graphical representation to show all nodes in the Decision Tree. The width of the Tree Map node is proportional to the number of Training cases contained in that node. The Tree Map gives a compact view of often very large and complicated Decision Trees. We can see in our Tree Maps that the CST node has more branches and leaves than does the Decision Tree node. The CST node also has a number of branches and leaves that contain a very small number of Training cases which can be seen by the relatively small width of several of the nodes. The Decision Tree panes in Images 12 and 13 displays a graphical representation of our data broken down into hierarchical nodes of braches and leaves. We can see that our CST model contained a more This assignment was completed as a group. complex tree with more branches and leaves. The Data Tree allows us to visually see the rules that were applied at each node in the tree. For example, in the Decision Tree node a rule was applied to the root data looking for the occurrence of the topic +seizure +hospital +patient +day +state. If the rate of occurrence in a document was 1.6785 or missing entirely then it was placed into a branch. Conversely if the rate of occurrence in a document drawn from the root data was greater than or equal to 1.6785 then it was moved into an adjacent branch. Successive rules were then applied to each of these branch nodes to further break the data down into leave nodes. Each node in the tree brings us a step closer to drawing a conclusion through predictive analysis of a large set of population data. For example, based on our Decision Tree we can see that of the patients that received a vaccination it was an extremely rare occurrence for that patient to be sore or run a fever post-injection. It was even rarer if a patient did experience these symptoms for the duration of such adverse reactions to exceed 2.5 days. The conclusions that were drawn from the rules applied in our Decision Tree can then be applied to new sets of unknown data. For example, in 2015 if 50 million vaccinations were to take place we could predict within a certain margin of error the rate of the adverse reactions of soreness or fever post-injection. The Model Comparison node allows for the comparison of different models to better understand which models perform the best. To do this population data is divided into two different sets; a training set and a test set. The training set is where learning takes place and the test set is where performance is calculated using a number of different statistical methods. Sometimes a third set of data called a Validation set is used to calculate error after the training has been completed. The metrics drawn from the test set are used to draw conclusion regarding which models perform optimally given the data type being analyzed. The Fit Statistics chart in Image 14 lists a number of different statistics for both of our models. Each data role has a number of columns listing a statistical calculation that gives us an understanding of goodnessof-fit. We can see that both models have relatively identical Misclassification Rates. This statistics is a good indicator of model discriminate powers being accurate. Both of our models have low rates of false positives. Statistically speaking both of our models are very similar with the exception of the Gain and Lift across all data roles. The CST models has a higher lift and gain across all three data roles. This indicates that on the surface the CST model performs more optimally than our other model. Image 15 displays a Receiver Operating Characteristic (ROC) graph that compares the performance of our data analysis model against our CST model. We can see that all three data roles (Train, Validate and Test) performed relatively consistently. This is an indicator that our training data role is a sufficient predictor for use with unknown sets of data. Both curves in the Train and Validate data roles have very similar apexes indicating a similar probability that an event will be correctly predicted as an event. A closer look at the curves indicates that each peek pushes upwards and to the left of the plane. This is an indicator that our models have discriminatory power and do a good job using it. Our curves show a shift towards the lower left region of the plane. This is an indicator that our data models do not have a high propensity for false-positives. In all, the ROC graph for our models shows adequacy in cutting off false positives and keeping true positives. When both models were applied to the Test data we can see that the CST model has a higher peek than does our other model. This tells us that when both models were applied to the Test data set that the CST model showed slightly better performance than did our other model. This assignment was completed as a group. In Image 16 we have the cumulative lift for the three data roles of both of our models plotted. When analyzing the cumulative lift of the models we are looking for total area under the curve. In all three data roles the CST model outperforms has more area under the curve than our other model indicating that the CST model has the greater predictability capabilities of the two models. The Model Comparison Output window in Image 17 contains a variable summary listing variables, their roles, measurement level and frequency count. The output windows also lists all data contained within the Fit Statistics window for each of three data roles. Data is grouped by target variable and then by role so it is easier to read the overall performance of each role. For comparison of models and data roles the spreadsheet view provided in the Fit Statistics window allows for a quick view of model performance metrics for which conclusions can be drawn. This assignment was completed as a group. Image 1 – Node Diagram: This assignment was completed as a group. Image 2 - Text Topic: Image 3 - Text Topic – CST: This assignment was completed as a group. Image 4 - Text Topic: Image 5 - Text Topic – CST: This assignment was completed as a group. Image 6 - Text Topic: Image 7 - Text Topic – CST: This assignment was completed as a group. Image 8 - Text Topic: Image 9 - Text Topic – CST: This assignment was completed as a group. Image 10 - Decision Tree: This assignment was completed as a group. Image 11 - Decision Tree – CST: This assignment was completed as a group. Image 12 - Decision Tree: *Reference included PDF (Assignment_5_DecisionTree.pdf) This assignment was completed as a group. Image 13 - Decision Tree – CST: *Reference included PDF (Assignment_5_DecisionTreeCST.pdf) This assignment was completed as a group. Image 14 – Model Comparison Fit Statistics: Image 15 – Model Comparison ROC Chart: This assignment was completed as a group. Image 16 – Model Comparison Score Ranking Overlay (Cumulative Lift): This assignment was completed as a group. Image 17 – Model Comparison Output: This assignment was completed as a group.
© Copyright 2026 Paperzz