Understanding ROC Curves and Model

 Understanding ROC Curves and Model Performance Kristen Hunter, Data Scientist May 2014 Introduction Civitas Learning data scientists produce models that make predictions about future outcomes, such as student success. Civitas works across many platforms and collects data from SIS, LMS, CRM and other data systems from higher education institutions including public two-­‐ and four-­‐ year, private for-­‐ and non-­‐profit, R1 and system institutions. Our models take hundreds of inputs per student, such as GPA, online activity, and more, and output predictions that can help a school target which students may be at risk of negative outcomes. With these insights, academic leaders, faculty, and students can take action and make better decisions, big and small, to improve learning and ensure student success. Understanding ROC Curves and Model Performance - Civitas Learning, Inc.
Testing Model Utility An important step in the data science process is testing which models have the best performance and produce the most helpful results for our partner institutions. To test the utility of a model, we compare the model-­‐produced predicted probability with the actual outcome of a student. For example, if the model predicts a student has a 5 percent chance of not completing a course, and the student drops out of the course, the model produced a poor prediction. If the model predicts a student has a 70 percent chance of not completing a course and the student drops out, the model produced a more useful prediction. The Problem with Accuracy One simple, but potentially misleading, way to evaluate a model’s performance is to look at accuracy. For each student, the model predicts a probability of not completing a course, and we can use these probabilities to categorize students into at risk and not at risk of dropping out. For example, if a student has a greater than 50 percent predicted chance of dropping out, we predict a student will not complete a course, and otherwise we predict completion. We then calculate simple accuracy, which is the percent of students for which we predicted the correct outcome. Accuracy is a good starting point, but it can make very useful models appear to have poor performance. Let’s assume we made a simple prediction that every student would complete a course (100 percent), and in reality, 70 percent of students completed the course. In this case, we would already be 70 percent accurate with a very naïve model, so a more complex model with 70% accuracy may not seem to provide any improvement. However, the complex model could provide much more value than the accuracy might suggest. The problem with using accuracy as a key metric for evaluating model performance is that does not take into account resource constraints. Predicting Student Success Let’s say I am an instructor with 100 students in one of my courses, and on average 30 of them do not complete the course each semester. I would like to know which students are at risk and would benefit most from targeted interventions to help steer them in the right direction. A predictive model can help me rank my students based on their predicted probability of not completing the course, and thus help me target those students who are at greatest risk and who would benefit the most from interventions. ROC curves help quantify and visualize the true power of a predictive model in a resource-­‐constrained world. Finding Your Way Around a ROC Curve: Quantifying Improvement from Random Ordering to Predictive Model Ordering 1 Understanding ROC Curves and Model Performance - Civitas Learning, Inc.
The horizontal axis of a ROC curve measures the rate of incorrect predictions: how many students the model predicts are at risk of not completing the course, but in fact complete the course. This is the false alarm rate or false positive rate. The vertical axis measures the amount of correct predictions: how many students the model predicts are at risk of not completing, and in fact do not complete the course. This is the detection rate or true positive rate. Imagine that you take the 100 students from the case study above and plot their predictions, as shown on the graph. Begin at the bottom left-­‐hand corner of the graph, where both the false alarm rate and detection rate are zero. This point displays what happens if you target no students – you have no incorrect predictions, but no correct predictions either. As you move up and to the right, the graph shows what happens to these two rates as you target increasingly large numbers of students. At the top right-­‐
hand corner of the graph, you have decided to target all 100 students. In this case, if you intervene with all 100 students and have a completion rate of 70 percent, you expect 70 false alarms and 30 detections, i.e. 70 students who complete and 30 who do not. The straight black line assumes that you ordered the students completely randomly. If you randomly target 50 of your 100 students, then on average 70 percent of them will complete and 30 percent will not. You will expect to have 50 * 0.3 = 15 detections or 2 Understanding ROC Curves and Model Performance - Civitas Learning, Inc.
students who will not complete the course, and 50 * 0.7 = 35 false alarms or students who will complete the course. You have targeted half of your students, so your false alarm rate and detection rate are both 50 percent – you’ve captured half of the total number of detections and false alarms. If you select students randomly, your expected detection and false alarm rates stay constant with respect to each other as you select more students. For example, if you targeted 75 students, your false alarm and detection rates will both be 75 percent. The curved red line represents ordering the student interventions based on a predictive model. For each student, you predict predict the probability that he will not complete the course, and then you target the students in order, from highest to lowest risk of not completing. You can see that at every false alarm rate, the detection rate is higher than if you ordered the students randomly. This means at every point, you are intervening with more true positives and fewer false alarms – meaning you are intervening with those who are in fact least likely to complete the course and most likely to benefit from an intervention. Specific examples follow to further explain this concept. Resource Tradeoffs Let’s say you start by targeting 20 students out of the 100. The diagonal line from the random order to the red curve shows the improvement from the predictive model, and is known as the equal error rate. With random ordering, you expect to have 14 true positives and 6 false positives. With the predictive model ordering, you expect to have 17 true positives and 3 false positives. Instead, let’s assume you are able to target more students, as long as the extra students are likely to be true detections. The vertical line is the fixed false alarm rate, because you maintain the same number of false alarms. Instead of expecting 14 true detections and 6 false alarms, you expect 18 true positives and 6 false positives: you target 4 extra students, but they are all expected to be true positives. Finally, you may have the case where you would like to target fewer students, as long as the students that you do target are the ones most likely to benefit from an intervention. This horizontal line is the fixed true positive rate, because you maintain the same number of true positives. Instead of expecting 14 true targets and 6 false alarms, you expect 14 true positives and 2 false alarms: you target 4 fewer students, but they are all expected to be false alarms. Model Performance Measure Students Targeted Detections False Alarms Random Ordering 20 14 6 Equal Error Rate 20 17 3 Fixed False Alarm Rate 24 18 6 3 Understanding ROC Curves and Model Performance - Civitas Learning, Inc.
Fixed Detection Rate 16 14 2 If you have fixed resources, you are expected to get more detections or students who will not complete and fewer false alarms or students who will complete, and if you increase or decrease your resources, you are expected to only gain true positives or lose false alarms. 4