Events Unrelated to Crime Predict Criminal Sentence Length Nora Barry, Laura Buchanan, Evelina Bakhturina, Daniel L. Chen1 Abstract In United States District Courts for federal criminal cases, prison sentence length guidelines are established by the severity of the crime and the criminal history of the defendant. In this paper, we investigate the sentence length determined by the trial judge, relative to this sentencing guideline. Our goal is to create a prediction model of sentencing length and include events unrelated to crime, namely weather and sports outcomes, to determine if these unrelated events are predictive of sentencing decisions and evaluate the importance weights of these unrelated events in explaining rulings. We find that while several appropriate features predict sentence length, such as details of the crime committed, other features seemingly unrelated, including daily temperature, baseball game scores, and location of trial, are predictive as well. Unrelated events were, surprisingly, more predictive than race, which did not predict sentencing length relative to the guidelines. This is consistent with recent research on racial disparities in sentencing that highlights the role of prosecutors in making charges that influence the maximum and minimum recommended sentence. Finally, we attribute the predictive importance of date to the 2005 U.S. Supreme Court case, United States v. Booker, after which sentence length more frequently fell near the guideline minimum and the range of minimum and maximum sentences became more extreme. 1 Barry: [email protected], Buchanana: [email protected], Bakhturina: [email protected]: NYU Center for Data Science; Chen: [email protected]: Toulouse Institute for Advanced Study Introduction I. United States District Court The United States District Courts (USDC) are the judicial backbone for hearing and sentencing federal crimes in the United States.2 Federal crimes include illegal activity committed on federal land, crimes committed by or against federal employees in particular roles, matters involving federal government regulations (e.g., illegal immigration, federal tax fraud, counterfeiting), or crimes against the U.S. that occur outside of the United States, such as terrorism.3 Among federal crimes, the most frequently heard cases involve immigration, drug trafficking, firearms, and fraud. Most frequently, the defendant in a case enters a plea agreement with the prosecutor, which is then approved of, or denied, by the judge.4 Otherwise, a sentencing trial is held and the judge determines the sentence for the criminal to serve: probation, federal prison, or both. In either situation, the judge has final say on the criminal sentence. There are 94 district courts in the United States. At least one district court is located in each state or U.S. territory. States that are large or have a large population have sub-state regional courts instead. The United States Sentencing Commission (USSC)5,6 produces the sentencing guidelines for federal judges to use when they make their sentencing decisions. The judges are given a guideline range for the criminal sentence that is based upon the severity of the crime and the defendant's criminal history. Due to these guidelines, the 2 "Court Role and Structure." United States Courts. N.p., n.d. Web. 10 May 2016. "Types of Cases." United States Courts. N.p., n.d. Web. 10 May 2016. 4 "Plea Bargain." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. 5 "United States Sentencing Commission." United States Sentencing Commission. N.p., n.d. Web. 13 May 2016. 6 "United States Sentencing Commission." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. 3 largest factor determining sentence range is the criminal charges brought to the judge by the prosecutor. For this paper, we use federal sentencing data made available by the USDC previously curated by one of the authors. II. Role of the Prosecutor The primary factor determining criminal sentence in USDC cases has been found to be, understandably, the criminal charges presented by the prosecutor and the criminal history of the defendant. However, research suggests that features unrelated to the case bias sentence length. In one study, the research team found that “blacks receive sentences that are almost 10 percent longer than those of comparable whites arrested for the same crimes.”7 This disparity can be primarily explained by the criminal charges the prosecutors present to the court. Specifically, when the defendant is black, the prosecutor is more likely to present a charge that carries a minimum mandatory sentence. Many other examples of disparities based on race, sex, education, income, etc., exist in the literature.8 III. Within Sentencing Range Discrepancies across choice of criminal charges do not fully explain these disparities. Judges are also known to, for example, give females a sentence nearer the guideline minimum, or prescribe criminal sentences outside of the guideline range for males. This motivates our decision to focus on sentence length relative to the recommended guideline range. 7 Rehavi, M. Marit, and Sonja B. Starr. "Racial Disparity in Federal Criminal Sentences." Journal of Political Economy 122.6 (2014): 1320-354. Web. 8 Mustard, David B. "Racial, Ethnic and Gender Disparities in Sentencing: Evidence from the US Federal Courts." SSRN Electronic Journal SSRN Journal (n.d.): n. pag. Web. For the USDC, the Federal Sentencing Commission writes recommended sentence minimum and maximum terms to help ensure that convicts who committed similar crimes are charged with similar sentences. As can be seen in the lookup table in Figure 1, the severity of the crime and the criminal history of the convict are used to determine the appropriate sentence range. The judge then determines or approves a sentence length, frequently, but not necessarily within this range. In the paper, we look past the recommended sentencing range and predict the sentence length within this range. Knowing that discrepancies in sentence length exist, and that the sources of these discrepancies have not been fully uncovered, we investigate a new set of features that may bias sentence length. In particular, we explore whether characteristics of events that co-occur with sentencing decisions predict, and potentially bias, those outcomes. The event types we chose to examine are weather and sports. First, we consider the difference in conditions when sentence length falls below or above the midpoint of the sentence guideline range to uncover the feasibility of prediction in this setting. We then move to investigating sentence length percentile relative to the sentence guideline range. This standardization allows us to look at where within a guideline range a sentence falls. The interpretation of this percentile measure is described in Table 1 below. We perform regression to predict this percentile. Figure 1. Federal Sentencing Lookup Table < 0% 0% - 50% 50% - 100% > 100% sentence length below guideline minimum (rare) sentence length between guideline minimum and guidline midpoint sentence length between guideline midpoint and guideline maximum sentence length above guideline maximum (rare) Table 1. Interpretation of Range Percentile Measure Data Description I. United States District Court Data The United States District Court Federal Sentencing data was made available by the Office of Research and Data in the United States Sentencing Commission. This data spanned federal court cases from 1992–2013. There are 35 features in this data, characterizing the defendant and crime. We keep 15 of these features due to their interpretability. Dummy variables were created as needed for features including race, location and citizenship. This brought us to a total of 253 features. • • • • • • • • • • • • • • • Date: continuous time variable Sentencing Month Sentencing Year Trial: 1 if a trial occurred Sex: 1 if the defendant is male Citizen: categorical variable denoting U.S. citizenship status Drug Crime: 1 if the crime involved drugs Crime Type: categorical variable denoting the crime type Race: categorical variable denoting the race of the defendant State: categorical variable denoting the state in which the crime occurred District: categorical variable denoting the district in which the crime occurred Probation Office: categorical variable denoting the probation office in which the pre-sentencing report is prepared NumCounts: the number of counts of conviction Education: categorical variable denoting the education level of the defendant Pre/Post Booker: 1 if the trial occurs before the United States v. Booker trial (explained further in the Discussion section) We also utilized some of the court-defined sentencing features such as Guideline Minimum and Maximum sentence length, Court Recommendation, Gun Min I & II, Gun Max I & II, Normed Range, Court Minimum and Court Maximum in our baseline model, but ultimately dropped them when building our final model. These features are explained further in the appendix. In our baseline model, our target variable was a binary indicator {-1,1} we computed denoting whether a sentence length fell below or above the midpoint of the sentence range. After, our target variable was sentence length percentile relative to the range. We compute the value using standard normalization. As our target variable was defined with the minimum and maximum sentence range, we dropped the minimum and maximum sentence range features when fitting our model to prevent data leakage. II. Weather Data In order to properly account for the weather in each district on a given day, we used a dataset originating from the NOAA (National Oceanic and Atmospheric Administration) database. This dataset consists of daily weather for 96 cities from 1992–2013. It includes over 90 features that depict various aspects of the weather conditions for each day. However, many of these features contain missing values, or are merely translations of other features. For this reason, we chose to include only the features below. We also feel these features capture the aspects of weather that are necessary in exploring a potential relation between weather and judge decisions. Table 2. Weather Data Features III. Sports Data Sports data available to us included data from MLB (Major League Baseball), NBA (National Basketball Association), NFL (National Football League), NHL (National Hockey League), and college football (CFB) for the years in which we had U.S. District Court Data. For the four professional leagues (MLB/NBA/NFL/NHL), there was an instance of each team in each game played (i.e. each game had two instances). While the features available were not identical across sports they were generally similar, and included information such as team name, field played on, score, and betting over/under. For the CFB data, there was one instance per game. Unlike the professional sports data, the CFB data is not as complete. This is understandable due to the organization of college football competitions. Teams typically play schools of the same size, budget, and quality of facilities.9 Due to this, some games played by smaller schools are not recorded. However, the games played by the Division I schools, the schools with the most developed football programs and likely the greatest regional following, are well represented. This data included team name, field played on, score, and so on. For each of the five sports datasets, we created two dataframes. The first dataframe includes information about the each team per game on the same day as the trial. We assumed that the judge would not know the result of the game before the end of the workday. For this dataframe, we included the date, team name, whether a game occurred, and whether the game would be played at the home stadium, or away. District was also needed, but added later. Date Team Game Y/N Home/Away District Table 3. Features in Day-Of Sports Dataframe The second desired dataframe represents the results of each game. The assumption is that if the outcome of a game were to affect the outcome of a trial, it would only be influenced by games played the day before.10 Because we were concerned with the game the day before, our second dataframe listed the date one day after the date that the game was played. We then included the team name, whether the game occurred, whether the game would be played at the home stadium or away, the points scored by the team, the points scored by the opposing team, the score margin 9 "College Football." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. A more sophisticated approach would be to taper off the weighting of game importance for several days following each game. We did not take this approach here. 10 (difference between team’s scores), and whether the team won or lost. Again, district was included later. Date + 1 Team Game Y/N Home/Away Points Scored Points Allowed Score Margin Win/Loss Table 4. Features in Day-Prior Sports Dataframe There were several challenges when pre-processing the sports data so that they could be organized into these dataframes. For example, for the score margin was not included in all data, and was calculated in these cases. The CFB data was organized differently from the professional sports data, so each instance of a game had to be split between the results per game per team. A lookup table between the team names and the district that would presumably be interested in that team was curated manually. For the lookup table to remain useful, several simplifying assumptions had to be made. In the first pass, each team was paired with the district where their home stadium was located. This meant that major cities such as Los Angeles, which is located in the Central California District Court district, were represented several times in the lookup table.11 New York City was challenging in that Brooklyn falls under the Eastern New York District Court, and the rest of New York City falls under the Southern New York District Court. In the majority of cases, New York City teams were represented by both districts, unless Brooklyn had its own team. After each team was paired with its “hometown” district, we induced spatial spread in the professional sports data. First, in states that have several districts but 11 Consider MLB (Los Angeles Dodgers/ Los Angeles Angels of Anaheim), NBA (Los Angeles Clippers/ Los Angeles Lakers), and NHL (Los Angeles Kings/ Anaheim Ducks) District only one team, the team was paired with all districts in that state. If a state had several districts and several teams, fandom maps based on Facebook likes12 were used to determine the more popular team in the ambiguous districts in that state. Finally, Boston teams were assigned to all New England districts, assuming homogeneity of fandom. If a district had no team and no obvious way to induce spread, it was not assigned any team (e.g., Guam, Puerto Rico, Montana, etc.). Due to the number of CFB teams, and assumptions about college football fan followings, we did not feel that spreading data outside of the district the school is located in was appropriate or desired. We choose not to use the betting over/under information included in the professional sports data, though that would be an interesting area of research worth pursuing.13 In the college football data, we choose not to include team ranking or whether the game was a special championship. An interesting future research aim would be to give a heavier weight to championship games and bowls, presuming that the lead up and results of the games would be more impactful on the community of fans invested in the game. Similarly, this information could be incorporated into the professional sports data. IV. Data Merge i. Weather 12 e.g., Person, and Barry Petchesky. "Here's Facebook's 2015 MLB Fandom Map." Regressing. N.p., 01 Apr. 2015. Web. 13 May 2016. 13 In principal, three types of fans can be envisioned. An extremely avid fan follows the odds and experiences anticipatory utility even before a game. An avid fan follows the odds and experiences utility only in the event of a surprise. A normal fan does not follow the odds and merely responds to the actual sports outcomes. Distinguishing these types of fans would be interesting, and we proceed here with the assumption that judges are normal fans. To combine the weather data with district courts data, we merge on date and location. First, we create a datetime feature from the “year”, “month”, and “day” features in the weather data. Second, we alter the location features of both datasets to prepare for the merge. The features “city” and “courthouse” correspond to the location in the weather and district courts datasets, respectively. However, we found that the city names differ between the USDC and weather datasets. In other words, we found many courthouses for which there was no corresponding weather data. To avoid dropping criminal cases that do not have corresponding weather data, we created our own metadata to link courthouses in the district data to the nearest city in the weather data. Through this, we were able to precisely merge the two datasets without loss of information. The schema of this merge includes all district court features, along with weather features 0-4 in the weather table above. ii. Sports To merge the sports data with the previously merged district court and weather data, we first dropped team name; we were interested to see if hometeam games affected the judges’ sentence, rather than particular teams. For each of the sports dataframes described above, we merge over date and district. Each sport is represented separately. If no sports data was available for any day-district combination, the sports data fields were filled with zeros. Methods & Results I. Binary Baseline Model To explore if sentencing length prediction is a feasible prediction setting, we started with a binary classification problem in place of regression. We created our target variable using the “rangePT” feature, which depicts where within the sentencing guidelines the judges’ final sentencing decision falls. For example, rangePT = 1 encodes that the judge’s sentencing length is equal to the guideline minimum, while rangePT = 2 encodes that the judges’ sentence falls on the lower half of the range, and so on. The table below depicts how we transformed this feature in detail. Table 5. Baseline Target Variable Summary In order to apply models such as support vector machine (SVM) and Logistic Regression for our baseline model, we project the rangePT variable into the set {-1,1} based on where the sentencing decision falls (below or above the median). For consistency, we drop the cases where either the sentencing length is equal to the median (rangePT = 3) or there is no guideline range to begin with (rangePT = 6).14 We fit both an SVM and Logistic Regression model for this classification. Because of the large number of features in our dataset, SVM failed to run on standard 14 These are dropped because there is no sense of being above or below median for these cases. computers. As a result, we began to consider other options for larger-scale model building such as Spark.15 Spark was able to run an SVM baseline model. However, our Logistic Regression model enabled us to both extract the important features (an attribute unsupported by Spark) and see the accuracy we are able to achieve in this setting. The figures below show a visual of accuracy of our model and the top 10% most important features learned by the model. Figure 2. ROC/AUC Represent Accuracy of Logistic Regression As we can see in the ROC graph, our logistic regression model achieves a very high AUC accuracy (.93). However, this classification problem is very different from our goal prediction problem, regression on sentence length percentile with range. Regardless, it is reassuring to see that accurate predictions are possible in this setting. 15 "Apache Spark™ - Lightning-Fast Cluster Computing." Apache Spark™ - Lightning-Fast Cluster Computing. N.p., n.d. Web. 13 May 2016. Figure 3. Top 10% Most Important Features in Logistic Regression Model The most valuable information that we gain from this model is the analysis of the top 10% most important features learned by the logistic regression model. All features listed are from the USDC dataset, and they encode different aspects of the sentencing decision process. For example, the top two features are “Guideline Min” and “Guideline Max,” which define the sentencing range. This is to be expected. The higher is the guideline minimum, the more likely the sentence was above the median recommendation. The higher is the guideline maximum, the more likely the sentence was below the median. In this setting, the features that are defined by the severity of the crime will undoubtedly be more important in predicting sentencing length than the weather and sports features. While obvious in retrospect, this initial analysis helped us determine that sentence length percentile within range can be an appropriate target variable. II. Features of Interest & Optimized Model For our final model (predicting where sentencing length falls relative to the guideline range), we compared Random Forest, Linear Regression and Gradient Boosting models. For Random Forest and Gradient Boosting, we used 100 estimators, which were then averaged to produce a final prediction. Although we would have liked to use more estimators, we were limited by the extensive runtime for fitting these models. In the end, we found Random Forest to be the most accurate in terms of mean squared error. The figure below shows the MSE for each of the three models attempted with default parameter values and 100 estimators: Figure 4. Comparison of Models in Mean Squared Error. We found that our Random Forest performed the best, and we utilized parameter tuning to choose the best model from this hypothesis space. We were then able to tune the model hyperparameters and increase the number of estimators to 250 to further improve accuracy. The hyperparameters we tuned include min_samples_leaf (the minimum number of samples in newly created leaves) and max_features (the number of features to consider when looking for the best split), which both help control overfitting. The optimal hyperparameters we found were min_samples_leaf = 9 and max_features = 0.6 (60% of features used in each node split). Once we fit our final, optimized model, we were able to extract the most important features shown below. Unlike our logistic regression baseline model, with Random Forest, we cannot tell whether these features have a negative or positive importance in our model, i.e., if they are negatively or positively correlated with our sentencing percentile target variable. Therefore, we compute the correlation between these features and our target variable, and color the bar graph accordingly to aid the reader in interpretation. Figure 5. Top 10% Most Important Features in Random Forest Model Discussion I. Model Performance Our baseline model predicted whether a criminal sentence falls above or below the sentencing guideline midpoint. It performed robustly with very high AUC of ROC. Unsurprisingly, we saw that the guideline minimum and maximum were the most predictive features. Other features in the top 10% most predictive were all features from the USDC dataset. Our final model had modest ability in predicting a continuous value. After tuning hyperparameters, we have a total mean square error of 2,941, less error then we encountered with any other model tested. II. Important Features i. Court Case Information The most important feature from the USDC data found to predict the percentile within the range of sentence guideline was the number of previous criminal charges. The sentence length and number of counts are positively correlated, as we would expect. This indicates that the judge is taking in particular information about the crime into consideration when determining the sentence length. Number of prior convictions could suggest the likelihood for the defendant to be a repeat offender, and therefore, the presumption would be that society would benefit from that defendant being imprisoned longer. A set of additional predictive features was whether the crime involved firearms, arson, or drugs. Again, this is a reassuring sign that the judge is using case specific information in their decision. ii. Unrelated Defendant Information We did not see that the race of the defendant was an important feature in our predictive model, as suggested in previous research. However, we do find characteristics of the defendant that should not influence sentence length did enter the top 10% most predictive features, such as sex and education level. It is important to further investigate whether these features truly influence the judge, as it would be unjust if they led to bias. iii. Time as an Important Feature The most predictive feature in our Random Forest model was the date of the sentencing decision. To the best of our knowledge, this can be linked to the 2005 United States Supreme Court decision referred to as United States v. Booker.16 This court decision determined that only prior convictions, facts admitted by the defendant, and facts proved to the jury beyond reasonable doubt could be used to extend the criminal sentence longer than the mandatory maximum. In other words, it introduced situations in which a judge could prescribe a sentence outside the sentencing range. We believe that this formal decision on opportunities to vary sentence length encouraged judges to change the way they made this determination. Interesting, while the U.S v. Booker case questioned the judge’s right to increase the sentence length past the maximum guideline sentence, we saw an overall decrease in the length of 16 "United States v Booker." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. sentence term relative to guideline range. Additionally, the range of minimum and maximum sentences becomes more extreme. Figure 6. Trends in Sentencing Pre and Post US v. Booker Figure 7. Trends in Sentencing Yang, 2013 iv. Sports as an Important Feature In our final model, we found that several sports features, all related the final scores of Major League Baseball Games, do in fact predict criminal sentence length. In future work, it would be important determine whether sporting events significantly bias a judge’s decision. If these sports features are truly predictive of the judges sentencing decisions, it’s worth noting that they are due to games that happened the prior day, not games that are going to happen. v. Weather & Location as Important Features We found many weather features appear in our top 10% most predictive features. Temperature minimum and maximum were our 2nd and 3rd most predictive features, and were positively correlated with sentence length. Additionally, we found that location of the courthouse (in particular, courthouses in Arizona, California, and Texas) were predictive of sentence length. Because these are all states in the southern part of the United States, we believe determining why weather and location are co-appearing in the top predictive features would be an interesting area of future research. Moreover, we found that USDC cases in the three states mentioned above account for a significant proportion of all USDC cases in the US, displayed in Figure 8 below. In the Appendix, we show the crime breakdowns of these states. Being from these states sharing a border with Mexico could possibly increase the likelihood of being heard in a federal criminal court. Figure 8. Proportion of USDC Cases in Arizona, California, and Texas Conclusions A justice system reasonably aspires to be consistent in the application of law across cases and to account for the particulars of a case. Our goal was to create a prediction model of criminal sentence lengths that accounts for non-judicial factors such as weather and sports events among the feature set. The feature weights offer a natural metric to evaluate the importance of these features unrelated to crime relative to case-specific factors. Using a Random Forest, we found several expected crime related features appearing within the top 10% most important features. However, we also found defendant characteristics (unrelated to the crime), sport game outcomes, weather, and location features all predictive of sentence length as well, and these features were, surprisingly, more predictive than the defendant’s race. Further investigating this predictive ability would be of interest to those studying the criminal justice system. Finally, date appears as the most predictive feature in determining sentence length. We suspect that judges revised their method of determining sentence length after United States v. Booker. Following this case, sentence length more frequently falls near the guideline minimum, while the range of minimum and maximum sentences becomes more extreme. References "Apache Spark™ - Lightning-Fast Cluster Computing." Apache Spark™ - Lightning-Fast Cluster Computing. N.p., n.d. Web. 13 May 2016. Chen, Daniel L., Tobias J. Moskowitz, and Kelly Shue. "Decision-Making Under the Gambler's Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires." SSRN Electronic Journal SSRN Journal (n.d.): n. pag. Web. "College Football." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. "Court Role and Structure." United States Courts. N.p., n.d. Web. 10 May 2016. Mustard, David B. "Racial, Ethnic and Gender Disparities in Sentencing: Evidence from the US Federal Courts." SSRN Electronic Journal SSRN Journal (n.d.): n. pag. Web. Person, and Barry Petchesky. "Here's Facebook's 2015 MLB Fandom Map." Regressing. N.p., 01 Apr. 2015. Web. 13 May 2016. "Plea Bargain." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. Rehavi, M. Marit, and Sonja B. Starr. "Racial Disparity in Federal Criminal Sentences." Journal of Political Economy 122.6 (2014): 1320-354. Web. Starr, Sonja B., and M. Marit Rehavi. "Racial Disparity in the Criminal Justice Process: Prosecutors, Judges, and the Effects of United States v. Booker." SSRN Electronic Journal SSRN Journal (n.d.): n. pag. Web. "Types of Cases." United States Courts. N.p., n.d. Web. 10 May 2016. "United States Sentencing Commission." United States Sentencing Commission. N.p., n.d. Web. 13 May 2016. "United States Sentencing Commission." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. "United States v Booker." Wikipedia. Wikimedia Foundation, n.d. Web. 13 May 2016. Yang, Crystal S. "Have Inter-Judge Sentencing Disparities Increased in an Advisory Guidelines Regime? Evidence from Booker." SSRN Electronic Journal SSRN Journal (n.d.): n. pag. Web. Appendix Feature Description: • • • • • • Gun Min I & Gun Min II: two features representing the mandatory minimum sentence (according to different calculations) Gun Max I & Gun Max II: two features representing the mandatory maximum sentence (according to different calculations) Guideline Minimum and Guideline Maximum: two features representing the guideline minimum and maximum sentence length Court Recommendation: the recommended sentence length according to a court-defined formula Court Minimum and Court Maximum: the court-defined minimum and maximum sentence length (used in the formula for Court Recommendation Normed Range: normalized sentencing range Crime Breakdown by Location: The figures below display the breakdown of crime types in the U.S. as a whole, Arizona, Texas and California. Through these graphs we hoped to determine whether or not the distribution of crime type in Arizona, Texas and California caused them to become important features in our final model. Unfortunately, it is difficult to make a definite conclusion.
© Copyright 2026 Paperzz