Interpreting Black-Box Classifiers Using Instance-Level Visual Explanations Paolo Tamagnini Josua Krause Sapienza University of Rome Rome, RM 00185, Italy New York University Tandon School of Engineering Brooklyn, NY 11201, USA Aritra Dasgupta Enrico Bertini Pacific Northwest National Laboratory Richland, WA 99354, USA New York University Tandon School of Engineering Brooklyn, NY 11201, USA ABSTRACT To realize the full potential of machine learning in diverse realworld domains, it is necessary for model predictions to be readily interpretable and actionable for the human in the loop. Analysts, who are the users but not the developers of machine learning models, often do not trust a model because of the lack of transparency in associating predictions with the underlying data space. To address this problem, we propose Rivelo, a visual analytics interface that enables analysts to understand the causes behind predictions of binary classifiers by interactively exploring a set of instance-level explanations. These explanations are model-agnostic, treating a model as a black box, and they help analysts in interactively probing the high-dimensional binary data space for detecting features relevant to predictions. We demonstrate the utility of the interface with a case study analyzing a random forest model on the sentiment of Yelp reviews about doctors. KEYWORDS machine learning, classification, explanation, visual analytics ACM Reference format: Paolo Tamagnini, Josua Krause, Aritra Dasgupta, and Enrico Bertini. 2017. Interpreting Black-Box Classifiers Using Instance-Level Visual Explanations. In Proceedings of HILDA’17, Chicago, IL, USA, May 14, 2017, 6 pages. DOI: http://dx.doi.org/10.1145/3077257.3077260 1 INTRODUCTION In this paper we present a workflow and a visual interface to help domain experts and machine learning developers explore and understand binary classifiers. The main motivation is the need to develop methods that permit people to inspect what decisions a model makes after it has been trained. While solid statistical methods exist to verify the performance of a model in an aggregated fashion, typically in terms of accuracy Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. HILDA’17, Chicago, IL, USA © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-5029-7/17/05. . . $15.00 DOI: http://dx.doi.org/10.1145/3077257.3077260 over hold-out data sets, there is a lack of established methods to help analysts interpret the model over specific sets of instances. This kind of activity is crucial in situations in which assessing the semantic validity of a model is a strong requirement. In some domains where human trust in the model is an important aspect (e.g., healthcare, justice, security), verifying the model exclusively through the lens of statistical accuracy is often not sufficient [4, 6, 8, 15, 16, 24]. This need is exemplified by the following quote coming from the recent DARPA XAI program: “the effectiveness of these systems is limited by the machines current inability to explain their decisions and actions to human users [. . . ] it is essential to understand, appropriately trust, and effectively manage an emerging generation of artificially intelligent machine partners" [9]. Furthermore, being able to inspect a model and observe its decisions over a data set has the potential to help analysts better understand the data and ultimately the phenomenon it describes. Unfortunately, the existing statistical procedures used for model validation do not communicate this type of information and no established methodology exists. In practice, this problem is often addressed using one or more of the following strategies: (1) build a more interpretable model in the first place even if it reduces performance (typically decision trees or logistic regression); (2) calculate the importance of the features used by a model to get a sense of how it makes decisions; (3) verify how the model behaves (that is, what output it generates) when fed with a known set of relevant cases one wants to test. All of these solutions however have major shortcomings. Building more interpretable models is often not possible unless one is ready to accept relevant reductions in model performance. Furthermore, and probably less obvious, many models that are considered interpretable can still generate complicated structures that are by no means easy to inspect and interpret by a human being (e.g., decision trees with a large set of nodes) [8]. Methods that rely on calculation of feature importance, e.g., weights of a linear classifier, report only on the global importance of the features and does not tell much about how the classifier makes decisions in particular cases. Finally, manual inspection of specific cases works only with a very small set of data items and does not assure a more holistic analysis of the model. To address all of these issues we propose a solution that provides the following benefits: Tamagnini et al. HILDA’17, May 14, 2017, Chicago, IL, USA (1) Requires only to be able to observe the input/output behavior of a model and as such it can be applied to any existing model without having access to its internal structure. (2) Captures decisions and feature importance at a local level, that is, at the level of single instances, while enabling the user to obtain an holistic view of the model. (3) It can be used by domain experts with little knowledge of machine learning. The solution we propose leverages instance-level explanations: techniques that compute feature importance locally, for a single data item at a time. For instance, in a text classifier such techniques produce the set of words the classifier uses to make a decisions for a specific document. The explanations are then processed and aggregated to generate an interactive workflow that enables the inspection and understanding of the model both locally and globally. In Section 4, we will describe the workflow (Figure 2) in detail which consists of the following steps. The system generates one explanation for each data item contained in the data set and creates a list of features ranked according to how frequently they appear in the explanations. Once the explanations and the ranked list are generated, the user can interact with the results as follows: (1) the user selects one or more features to focus on specific decisions made with them; (2) the system displays the data items explained by the selected features together with information about their labels and correctness; (3) the user inspects them and selects specific instances to compare in more detail; (4) the system provides a visual representation of the descriptors / vectors that represent the selected data items (e.g., words used as descriptors in a document collection) and permits to visually compare them; (5) the user can select one or more of the descriptors / vectors to get access to the raw data if necessary (e.g., the actual text of a document). In the following sections, we describe this process in more details. We first provide background information on related work, we then describe the methods used to generate the explanations followed by a more detailed description of the visual user interface and interactions developed to realize the workflow. Finally, we provide a small use case to show how the tool works in practice and conclude with information on the existing limitations and how we intend to extend the work in the future. 2 RELATED WORK Different visual analytics techniques and tools have been developed to inspect the decision-making process of machine learning models and several authors advocated for the need to make models more interpretable [12]. In this section, we describe the previous work that is most relevant to our proposed model explanation process. Many of the explanation methods investigated by researchers employ a white-box approach, that is, they aim at visualizing the model by creating representations of the internal structures of the models. For instance, logistic regression is often used to create a transparent weighting of the features and visualization systems have been developed to visualize decisions trees [22] and neural networks [14, 19]. These methods however can only be applied to specific models and, as such, suffer from limited flexibility. Another option is to treat the model as a black-box and try to extract useful information out of it. One solution developed in the past is the idea of training a more interpretable model out Figure 1: Illustrating the process of explanation computation. x is an observed instance vector. We treat the ML model as function f that maps a vector to a prediction score. The explanation algorithm tries to find the shortest vector e for which f (x − e) is below the threshold. This vector e serves as an explanation for the prediction. of an existing complex model, e.g., inferring rules from a neural network [7]. A more recent solution is the idea of generating local explanations that are able to explain how a model makes a decision for single instances of a data set [11, 13, 20]. These works are excellent solutions to the problem of investigating single instances but there are no established methods to go from single instances back to a global view of the model. This is precisely what our work tries to achieve by using instance-level explanations and embedding them in an interactive system that enables their navigation. Another related approach is to visualize sets of alternative models to see how they compare in terms of the predictions they make. For instance, ModelTracker [1] visualizes predictions and their correctness and how these change when some model parameters are updated. Similarly, MLCube Explorer [10] helps users compare model outcomes over various subsets and across multiple models with a data cube analysis type of approach. One main difference between these methods and the one we propose is that we do not base our analysis exclusively on model output, but also use the intermediary representation provided by explanations. This enables us to derive more specific information about how a model makes some decisions, and go beyond exploring what decisions it makes. Somewhat related to our work are interactive machine learning solutions in which the output of the model is visualized to allow the user to give feedback to the model and improve its decisions [2, 3, 21]. The goal of our work however is to introduce methods and tools that can be easily integrated in existing settings and workflows adopted by domain experts, and as such does not rely on the complex modifications necessary to include explicit user feedback in the process. 3 INSTANCE-LEVEL EXPLANATIONS An instance level explanation consists of a set of features that are considered the most responsible for the prediction of an instance, i.e., the smallest set of features which have to be changed in the instance’s binary vector to alter the predicted label. We used a variant of the instance-level explanation algorithm designed by Martens and Provost [17] for document classification. Interpreting Black-Box Classifiers Using Instance-Level Visual Explanations HILDA’17, May 14, 2017, Chicago, IL, USA Figure 2: The Rivelo workflow involves the following user interactions: selection of features, selection of the explanation, vectors inspection, and exploration of the raw data. By switching back and forth between those steps, the user can freely change the selections in a smooth and animated visualization. This explanation-driven workflow can extract global behaviours of the model from patterns of local anomalies through human interaction. The overall process of the explanation generation is illustrated in Figure 1. Rivelo computes an explanation for each instance using the input-output relationships of a black-box model. The technique works by creating artificial instances derived from observed values in order to examine the influence of the features to the output. It assumes binary feature vectors. Starting with the original binary vector x and its predicted label, the algorithm “removes" features from the vector creating an artificial instance x − e. Features are added to the vector of “removed" features e until the predicted label of x − e is different than the one of x. The set E = {k | ek = 1} of “removed" features is then called an explanation of the original vector x. Removing in this context indicates the change of a feature that is present to not being present. We chose the term “removing" because it is particularly intuitive for sparse binary data. The technique works only for sparse binary data where it is possible to assign prediction responsibility to components relative to present features. This is the case for bag of words in document classifiers, but also for any kind of data where instances represent a set of items, like medications in the medical domain and products bought or liked in market basket analysis. The machine learning model is assumed to deterministically compute a prediction score f (x) between 0 and 1 for a given input vector x. The predicted label is then computed by comparing this score to a given optimal threshold. This threshold is computed on the training data minimizing the number of misclassifications. The process of removing features from a vector is the core of the explanation algorithm. The original algorithm by Martens and Provost [17] consists of successively removing the feature with the highest impact on the prediction score towards the threshold until the label changes. This results in the most compact explanation for a given instance. If the prediction score cannot be changed towards the threshold by removing a feature the instance is discarded without outputting an explanation. To increase the number of explained instances, in order to provide a fuller picture of the model’s decisions in our visualization, we relax this restriction by removing random features with no impact on the prediction score. Oftentimes, after removing some features randomly the prediction score starts changing again leading to an explanation for this otherwise discarded instance. However, removing features with no impact on the prediction score violates the compactness property of the resulting explanation. In order to restore this property we add a post-processing step that re-adds features that do not contribute to a favorable prediction score change. This process is time consuming requiring us to pre-compute explanations offline. Another difference of our approach to the algorithm by Martens and Provost [17] is the handling of explanations that grow too long. Explanations that are long are not intuitively interpretable by a user thus they are discarded in the original algorithm. As we are adding random features, in some cases the explanation might grow too long before the compacting step. In order to not discard explanations that are only temporarily too long we perform the length check after this step. 4 RIVELO: THE EXPLANATION INTERFACE We implemented the workflow we briefly described in the introduction in an interactive visual interface we call Rivelo1 . The application in its current implementation works exclusively with binary classifiers and binary features and it assumes to receive as an input a data set and a trained classifier. The data set must also contain ground truth information, that is, for each data item what is the correct label the classifier is supposed to predict. Once the system is launched, it automatically computes the following information: one explanation for each data item; information about whether the prediction made by the classifier is correct or incorrect (including false positives and false negatives); the list of features, ranked according to how frequently they appear in the explanations. For each feature, it also computes the ratio between positive and negative labels, that is whether the feature tends to predict more positive or negative outcomes, and the number of errors the classifier makes when predicting items whose explanations contain that feature. The user interface is made of the following main panels that reflect the steps of the workflow (Figure 3): a feature list panel (1, 2) on the left, to show the list of ranked features; the explanations panel (3, 4, 5) next to it, to show the explanations and data items containing the selected features; the descriptors panel (6), containing a visual representation of the vectors / descriptors of the data items selected in the explanations panel; and the raw data panel (7, 8) containing the raw data of selected vectors. We now describe each panel and interaction with them in more details. 1 Rivelo GitHub repository at: https://github.com/nyuvis/rivelo. HILDA’17, May 14, 2017, Chicago, IL, USA Tamagnini et al. Figure 3: Showing the user interface of Rivelo. (1) is a selected feature and (2) are its relative indicators for average prediction, errors and frequency within the explanations set. Some of the features on this list are grayed out because they wouldn’t return any explanation if added to the query. (3) is the selected explanation with the number of explained documents and the number of used features. (4) are the cells representing each explained instance, with color resembling the prediction value. Among them, (5) represent some false positive instances. (6) is the descriptor representing visually one of the explained instances. (7) is the raw data related to the same descriptor. (8) shows in bold how the explanation feature is used in the text data. The feature list panel displays the features computed in the beginning and displays the following information in a list: the name of the feature, the ratio of positive labels, the number of errors and the number of explanations it belongs to. The ratio is displayed as a colored dot with shades that go from blue to red to depict ratios between 0% and 100% true outcomes. The number of errors and the number of explanations are depicted using horizontal bar charts. Users can sort the list according to three main parameters: (1) number of explanations, which enables them to focus on importance, that is, how many decisions the classifier actually makes using that feature. (2) ratio of positive labels, which enables them to focus on specific sets of decisions, and (3) number of errors, which enables them to focus on correctness, that is, what features and instances create most of the problems. Once one or more features are selected, the explanations panel displays all explanations containing the selected features. The explanations are sorted according to the number of explained instances. Each explanation has a group of colored cells next to it ((4) in Figure 3) that represent one instance each. The color of the cell indicates the value of its prediction (blue for positive and red for negative) and different shades depending on the prediction value. When the cell represents an instance that is classified incorrectly it is marked with a small cross, to indicate the error ((5) in Figure 3). By selecting an explanation or a single cell we can display the explained instances in the descriptors panel. The panel displays a list of visual “descriptors", each depicting the vector of values used to represent an instance in the data set ((6) in Figure 3). The descriptor is designed as follows. Each rectangle is split into as many (thin) cells as the number of features in the data set. A cell is colored with a dark shade (small vertical dark segments in the image) when a feature is present in the instance and left empty when it is not present. A cell is colored in green when it represents a feature contained in the explanation. The background color represent the predicted label. The main use of descriptors is to help the user visually assess the similarity between the selected instances according to which (binary) features they contain. When looking at the list of descriptors, one can at a glance learn how many features are contained in a instance (that is, how many dark cells are in it) and how similar the distribution of features is. Descriptors are particularly useful in the case of instances with sparse features, that is, when the number of features present in a given instance is much smaller than the number of those which are not present. Next to the vectors panel, the raw data panel displays the actual raw data corresponding to the selected descriptors. This is useful to create a “semantic bridge" between the abstract representation used by the classifier and the descriptor representation, and the original data. In our example, the data shown is text from a document collection but different representations can be used in this panel to connect the abstract vector to the actual original data contained Interpreting Black-Box Classifiers Using Instance-Level Visual Explanations in the data set (e.g., images). In this case we also highlight, in each text snippet, the words that correspond to the features contained in the vector. Similar solutions can be imagined for other data types. One additional panel of the user interface is accessible on demand to obtain aggregate statistics about the explanation process and the classifier (not shown in the figure). The panel provides several pieces of information including: percentage of explained instances over the total number of instances, number of instances predicted correctly and incorrectly for each label, number of features, number of data instances as well as the prediction threshold used by the classifier to discriminate between positive and negative cases. 5 USE CASE: DOCTOR REVIEWS ON YELP In this section we present a use case based on the analysis of a text classifier used to predict the rating a user will give to a doctor in Yelp, the popular reviews aggregator. To generate the classifier, we used a collection of 5000 reviews submitted by Yelp users. We first processed the collection using stemming and stop-word removal and then created a binary feature for each term extracted, for a total of 2965 binary features. We also created a label out of the rating field, grouping together reviews with 1 or 2 stars for negative reviews and those with 4 and 5 stars for positive reviews. This way we created 4356 binary vectors, one for each review, of which 3334 ( 76%) represent positive reviews and the remaining ( 24%) negative reviews. Then, we trained a random forest model [5] based on the data and labels just described and obtained a classifier with an area under the ROC curve score equal to AUC = 0.875. We exported the scorer function of the model, the predicted label and the ground truth label for each review of test data to generate the input for Rivelo. We also computed the optimal threshold for the scorer function 0.6. The value is closer to 1 to compensate for the high amount of positive reviews in the data set. After the first computation of explanations, which took around 1.5 hours to complete, we explained 64% of the reviews with explanations of length up to 5 words. By post-processing, which took about 1.5 hours as well, we were able to reduce the size of 34.5% of the 1563 explanations that were longer than 5 features. This way we increased the number of explained instances by 542, explaining 76.52% of instances of our test set. By compacting long explanations in the post-processing, we are able to explain 12% more instances than the original algorithm by Martens and Provost [17]. Figure 3 shows the results obtained by the entire procedure. Looking at the set of features sorted by frequency one can readily see that most of the model decisions take place to predict the positive label and that many of the words used capture adjectives that represent positive sentiments such as, “friendly", “love" and “recommend", which have been labeled correctly as positive words. By taking a closer look we also see less obvious words, that do not seem to have straightforward role in the classification, even if they are used more than others in explanations. For example, the word “Dr." is used frequently to explain positive reviews. Using the error bar indicator next to “Dr." ((2) in Figure 3), we can also see that the feature has a very low false positive rate. When we select this feature, the interface shows all the explanations and instances containing it. We can then see that the large majority of cases is predicted by the word “Dr." alone or a combination of this word with some other positive property such as “great" HILDA’17, May 14, 2017, Chicago, IL, USA and “recommend". We can also see that some of the instances are classified incorrectly, especially those in which the explanation contains exclusively the word “Dr.". To better understand this trend, we inspect several raw text documents associated to these instances and figure out that reviewers tend to use the name of the doctor preceded by the word “Dr." whenever they have to say something positive, but they tend to refer to the practitioner as “the doctor", using a more generic terminology, when writing a negative review. Through this inspection, we can also better understand how the few false positives explained by “Dr." happen. When we look at the raw text of selected false positive cases we see that most of them represent rare cases where the patient is using the word “Dr." without using the doctor name (e.g. “Dr. is very nice but the staff is rude."). The model therefore tends to be confused by outlier cases in which a contradiction of the rule is present. Another interesting word is the word “call", which, as shown in the figure, tends to predict negative reviews, even though with a somewhat high error rate. While not immediately obvious why this word leads to negative reviews, we figure out, through the visual inspection enabled by our application, that reviewers tend to mention cases in which they have called the doctor’s office and received poor assistance. A similar case is the feature “dentist" (not visible in the figure), which also tends to predict negative reviews. The feature however has also a somewhat higher error rate which means many positive reviews are misclassified as negative. Through closer inspection, we realize that the classifier is not able to disambiguate cases in which the word “dentist" is used in a positive context such as “awesome dentist". The most common word in explanations with hundreds of associated documents is “great". The majority of those documents are true positive, but we can still find and select the few false positive the model generates. Selecting all the misclassified positive reviews containing “great" we spot an interesting problem: some of the reviewers sometime use the word “great" in a sarcastic way, making the detection of a negative connotation too hard for the classifier. Similarly, we also notice that the word “great" is sometime used in conjunction with a negation, that is, “not great", making it once again too hard for the classifier to make the correct prediction with its current configuration. An interesting aspect of this last case is that the manual inspection of misclassified instances can lead to ideas on how the classifier could be improved. For instance, in this last case equipping the classifier with means to detect negation may lead to improved performance. 6 DISCUSSION Our use case shows how the proposed solution can help figure out major decisions made by the classifier, spot potential issues and possibly also help derive insights on how problems can be solved. The system, however has, in its current implementation, a number of relevant limitations, that we discuss below. First, Rivelo works exclusively with binary classification and binary feature sets. While this specific configuration covers a large set of relevant cases (e.g., we tested the system with a medical data set describing drugs administered to patients in emergency rooms to predict admissions), many other relevant cases are not covered; notably cases in which features or the predicted outcome are not binary. To solve this problem we will need to develop explanations Tamagnini et al. HILDA’17, May 14, 2017, Chicago, IL, USA and design visual representations able to handle the more general case of non-binary features and multiclass outcomes. While an extension of the technique to multiclass classification is not trivial extending to numerical input data could be achieved by adopting techniques like those proposed by Prospector [13] and LIME [20]. Second, the descriptors we use to compare selected instances in terms of their similarity do not lead to an optimal solution. Ideally, we would like to more directly explain what makes instances with similar configurations have differing outcomes and vice-versa. One potential solution we will investigate is to train local models to rank the features in the descriptors in ways that highlight the differences that lead to different outcomes (e.g., instances with the same explanation but different outcome). Another possible extension is to provide a scatter-plot visualization using a projection of the selected instances, similar to the technique used by Piava et al. [18]. To better compare a large number of instances we could use a projection that uses dimensionality reduction, as for example tSNE [23]. This way we could visually correlate instance similarities and outcomes. Multidimensional projections however often suffer of several distortion effects that reduce the trustworthiness of the visualization. In addition, they do not provide a direct relationship between the original data space and the trends observed in the projections, making it harder to understand the root causes of observed issues. Another important limitation is our focus on understanding one single model at a time. Often important insights can be generated by comparing multiple models. We plan to explore how our technique can be extended to multiple model comparisons and see what advantages it may provide. 7 CONCLUSION AND FUTURE WORK In this paper, we have shown how a visual explanation workflow can be used to help people make sense of and assess a classifier using a black-box approach based on instance-level explanations and interactive visual interfaces. Our system called Rivelo aggregates instances using explanations and provides a set of interactions and visual means to navigate and interpret them. The work has not been validated with a user study. Further investigation is needed to fully understand the effectiveness of our approach, potential limitations and the extent to which it helps analysts reach useful and accurate conclusions. For this reason, we intend to evaluate the system in the near future through a series of user studies involving analysts and domain experts with a range of expertise in machine learning and in the application domain. ACKNOWLEDGMENTS We thank Prof. Foster Provost for his help in understanding and using his instance-level explanation technique. We also thank Yelp for allowing us to use its invaluable data sets. The research described in this paper is part of the Analysis in Motion Initiative at Pacific NorthWest National Laboratory (PNNL). It was conducted under the Laboratory Directed Research and Development Program at PNNL, a multi-program national laboratory operated by Battelle. Battelle operates PNNL for the U.S. Department of Energy (DOE) under contract DE-AC05-76RLO01830. The work has also been partially funded by the Google Faculty Research Award "Interactive Visual Explanation of Classification Models". REFERENCES [1] Saleema Amershi, Max Chickering, Steven M. Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI ’15). ACM, 337–346. [2] Saleema Amershi, James Fogarty, Ashish Kapoor, and Desney Tan. 2011. Effective End-user Interaction with Machine Learning. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI’11). AAAI Press, 1529–1532. [3] Saleema Amershi, James Fogarty, and Daniel Weld. 2012. Regroup: Interactive Machine Learning for On-demand Group Creation in Social Networks. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’12). ACM, 21–30. [4] Bart Baesens, Rudy Setiono, Christophe Mues, and Jan Vanthienen. 2003. Using Neural Network Rule Extraction and Decision Tables for Credit-Risk Evaluation. Manage. Sci. 49, 3 (March 2003), 312–329. [5] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5–32. [6] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15). ACM, 1721–1730. [7] Mark W. Craven and Jude W. Shavlik. 1998. Using Neural Networks for Data Mining. (1998). [8] Alex A Freitas. 2014. Comprehensible classification models: a position paper. ACM SIGKDD explorations newsletter 15, 1 (2014), 1–10. [9] David Gunning. 2017. Explainable Artificial Intelligence (XAI). http://www. darpa.mil/program/explainable-artificial-intelligence. (2017). [Online; accessed 2017-April-7]. [10] Minsuk Kahng, Dezhi Fang, and Duen Horng (Polo) Chau. 2016. Visual Exploration of Machine Learning Results Using Data Cube Analysis. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA ’16). ACM. [11] Josua Krause, Adam Perer, and Enrico Bertini. 2014. INFUSE: Interactive Feature Selection for Predictive Modeling of High Dimensional Data. IEEE Transactions on Visualization and Computer Graphics 20, 12 (Dec 2014), 1614–1623. [12] Josua Krause, Adam Perer, and Enrico Bertini. 2016. Using Visual Analytics to Interpret Predictive Machine Learning Models. ArXiv e-prints 1 (June 2016). arXiv:1606.05685 [13] Josua Krause, Adam Perer, and Kenney Ng. 2016. Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, 5686–5697. [14] Mengchen Liu, Jiaxin Shi, Zhen Li, Chongxuan Li, Jun Zhu, and Shixia Liu. 2017. Towards Better Analysis of Deep Convolutional Neural Networks. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017), 91–100. [15] Yin Lou, Rich Caruana, and Johannes Gehrke. 2012. Intelligible Models for Classification and Regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’12). ACM, 150–158. [16] David Martens, Bart Baesens, Tony Van Gestel, and Jan Vanthienen. 2007. Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research 183, 3 (2007), 1466 – 1476. [17] David Martens and Foster Provost. 2014. Explaining Data-driven Document Classifications. MIS Q. 38, 1 (March 2014), 73–100. [18] Jose Gustavo S Paiva, Sao Carlos, and Rosane Minghim. 2015. An approach to supporting incremental visual data classification. IEEE Transactions on Visualization and Computer Graphics 21, 1 (Jan. 2015), 4–17. [19] Paulo E. Rauber, Samuel G. Fadel, Alexandre X. Falcao, and Alexandru C. Telea. 2017. Visualizing the Hidden Activity of Artificial Neural Networks. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan. 2017), 101–110. [20] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144. [21] Gary K. L. Tam, Vivek Kothari, and Min Chen. 2017. An Analysis of Machineand Human-Analytics in Classification. IEEE Trans. Vis. Comput. Graph. 23, 1 (2017), 71–80. [22] Stef van den Elzen and Jarke J. van Wijk. 2011. BaobabView: Interactive construction and analysis of decision trees. In Visual Analytics Science and Technology (VAST), IEEE Conference on. IEEE Computer Society, 151–160. [23] Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing HighDimensional Data Using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605. [24] Alfredo Vellido, Joséd. Martín-guerrero, and Paulo J. G. Lisboa. 2012. Making machine learning models interpretable. In In Proc. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Vol. 12. 163–172.
© Copyright 2026 Paperzz