+ Kernel Methods in Data Analytics: Predicting College Football Outcomes by Logistic Regression Theodore Trafalis, School of Industrial and Systems Engineering , University of Oklahoma Panel in Sports Analytics, International Conference in Sports Management Athens, Greece May 9, 2016 + Agenda Background Existing Approaches Model Development Parameters Used Offensive Defensive Interactions Other Results and Interpretation Validation Conclusions and Recommendations Works Cited + Background College Football Post Season 30-40 Bowls College Football Playoffs Huge Viewership 6 Million per game in 2014 33 Million watched Championship Game Millions bet on outcomes of Bowl Games Uncommon matchups Lots of Uncertainty + Existing Approaches Las Vegas Oddsmakers Massey Ratings ELO Ratings Developed by Arpad Elo to rank Chess players Adapted to sports, videogames, programming, etc. S&P+ Ratings Conglomerate of Computer Polls History of Computer Polls in BCS era Derived from Play by Play data Efficiency Explosiveness Field Position Finishing Drives Turnovers Football Power Index Developed by ESPN Predictive and Simulation based + Model Development Binomial Logistic Regression Fits dependent variable into one of two non-overlapping sets Can utilize many different types of input variable Nominal Ordinal Interval Without Interactions With Interactions + Model Development Parameter Fitting 36 Games from 2014 used as training data Method of Least Squares Program Used Excel’s Solver GRG package + Parameters Used Offensive Metrics 8 Team Statistics Consider both teams Defensive Metrics 8 Team Statistics Consider both teams Quarterback Rating Conversion Metrics Disruptive Metrics Penalty Metrics Outside Rankings Conference Interaction Effects + Offensive Metrics Yardage Metrics Total Yards Total Yards per Game Passing Yards Passing Yards per Game Rushing Yards Rushing Yards per Game Scoring Metrics Total Points Scored Points Scored per Game + Defensive Metrics Yardage Metrics Total Yards Allowed Total Yards Allowed per Game Passing Yards Allowed Passing Yards Allowed per Game Rushing Yards Allowed Rushing Yards Allowed per Game Scoring Metrics Total Points Allowed Points Allowed per Game + Quarterback Rating Abbreviated QBR Measure of Quarterback Quality Completion % Yards per Attempt Touchdown % Interception % + Conversion and Conference Metrics Conversion Metrics Measure of a team’s ability to maintain possession of the ball. Number of First Downs 3rd Down Conversion Rate 4th Down Conversion Rate Conference Metric Power 5 vs. Great 5 + Disruptive Metrics Defensive Disruption Measure of Defense’s ability to disrupt Offense Sacks Interceptions Fumbles Offensive Disruption Measure of team discipline and ability to keep offensive drives on track Total Penalty Yards Assessed + Outside Rankings Massey Ranking Conglomeration of Computer Polls Las Vegas Spread Made by Las Vegas Book Keepers Goal is to insure equal betting on both teams Negative spread means team is favored + Interaction Effects Offense vs. Defense Total Yards vs. Total Yards Allowed Points Scored per Game vs. Points Allowed per Game Rushing Yards per Game vs. Rushing Yards Allowed per Game Passing Yards per Game vs. Passing Yards Allowed per Game QBR vs. Defense Disruptive Sack Ratio Interception Ratio Fumble Ratio Conversion 1st Down Ratio 3rd Down Conversion Ratio 4th Down Conversion Ratio Penalties Penalty Yard Ratio Ranking Massey Ranking Ratio Vegas Spread + Model Results + Model Results + Model Results—Unexpected + Model Fit Able to correctly categorize 73/79 (92.4%) CFB bowl outcomes from 2014. Vegas 58.3% Massey 61.1% ELO 66.7% S&P+ 63.9% FPI 58.3% + Cross Validation—2013 Ran model for 2013 bowl games Model Accuracy dropped to 57.1% Existing methods also dropped in accuracy Vegas 62.9% Massey 60% ELO 60% S&P+ 54.3% FPI 51.4% + Cross Validation—2012 Ran model for 2013 bowl games Model Accuracy dropped to 55.7% Existing methods also dropped in accuracy Vegas 60% Massey 60% ELO 51% S&P+ 48.2% FPI 45.2% + Conclusions and Recommendations Conclusions Model Works for Categorizing Games a posteriori Not great for a priori predictions Model is competitive with other predictive measures Recommendations Use ANOVA to filter out unwanted parameters Incorporate other predictive measures Path Forward Compare Binomial Multiple Logistic Regression to SVM + Works Cited [1] Rishe, P. (2015, January 15). Reviewing The 2014-15 Bowl Season: Highest Bowl Game Prices, Attendances, And TV Ratings. Retrieved November 30, 2015, from http://www.forbes.com/sites/prishe/2015/01/15/reviewing-the-2014-15-bowl-seasonhighest-bowl-game-prices-attendances-and-tv-ratings/ [2] Purdum, D. (2015, January 30). Wagers, bettor losses set record. Retrieved November 30, 2015, from http://espn.go.com/chalk/story/_/id/12253876/nevada-sports-bettorswagered-lost-more-ever-2014 [3] World Football Elo Ratings: Rating System. (n.d.). Retrieved November 30, 2015, from http://www.eloratings.net/system.html [4] Football Outsiders. (n.d.). Retrieved November 30, 2015, from http://www.footballoutsiders.com/stats/ncaa [5] Hosmer, D., & Lemeshow, S. (1989). Introduction to the Logistic Regression Model. In Applied logistic regression (pp. 1-30). New York: Wiley. [6] Diaz, A., Tomba, E., Lennarson, R., Richard, R., Bagajewicz, M., & Harrison, R. (2010). Prediction of protein solubility in Escherichia coli using logistic regression. Biotechnol. Bioeng. Biotechnology and Bioengineering, 374-383. + Predicting Major League Baseball Championship Winners through SVMs Table 1 Summary of attributes evaluated in this study Attribute Category Total Runs Scored (R) Offensive Stolen Bases (SB) Offensive Batting Average (AVG) Offensive On Base Percentage (OBP) Offensive Slugging Percentage (SLG) Offensive Team Wins Record Team Losses Record Earned Run Average (ERA) Pitching Save Percentage Pitching Strikeouts per nine innings (K/9) Pitching Opponent Batting Average (AVG) Pitching Walks plus hits per inning pitched (WHIP) Pitching Fielding Independent Pitching (FIP) Pitching Double Plays turned (DP) Defensive Fielding Percentage (FP) Defensive Wins Above Replacement (WAR) Baserunning, Hitting, and Fielding + SVM The SVM algorithm, developed by Vapnik (1998), is frequently applied in machine learning. The SVM algorithm for binary classification problems constructs a hyperplane that separates a set of training vectors into two classes (e.g., tornadoes vs. non-tornadoes). The objectives of SVMs for the primal problem are to maximize the margin of separation and to minimize the misclassification error. We utilize the probabilistic outputs for SVMs proposed by Platt (1999). + Illustration of SVM + American League Pennant Model Selection Standard SVM Imbalanced data , bias towards L (majority class) 63 100% 0 0% 20 0 100% 0% + Different RBF Classifiers Gaussian, γ=50 Gaussian, γ=30,000 Gaussian, γ=300 Gaussian, γ=300,000 + Comparison of the accuracy of different Gaussian RBF classifiers Model Cost for Majority (L) Cost for Minority (W) gamma (γ) Accuracy (%) Case 1 1 3 50 80.7 Case 2 1 3 300 71.8 Case 3 1 3 30,000 86.7 Case 4 1 3 300,000 98.8 + Sweet Spot + Classifier selected to perform prediction on the American League pennant race 47 74.6% 16 25.4% 8 12 40.0% 60.0% National League Pennant Model + Selection Sweet Spot Sweet Spot Sweet Spot Sweet Spot + The classifier selected to perform prediction on the National League pennant race 49 14 77.8% 22.2% 5 15 25.0% 75.0% SVM Model Cost for Majority (L) Cost for Minority (W) gamma (γ) Accuracy (%) Gaussian RBF 1 3 30,000 77.1 + 27 15 64.3% 35.7% 21 21 50.0% 50.0% Figure 7 shows the best classifier acquired from the Linear Kernel SVM algorithm 33 9 78.6% 21.4% 24 18 57.1% 42.9% Figure 8 shows the best classifier acquired from the Quadratic Kernel SVM algorithm 36 85.7% 6 14.3% 21 21 50.0% 50.0% Figure 9 shows the best classifier acquired from the Cubic Kernel SVM algorithm 28 14 66.7% 33.3% 12 30 28.6% 71.4% Figure 10 shows the best classifier acquired from the Gaussian Kernel RBFSVM algorithm Table 4 compares the accuracy values of the best classifier for each SVM algorithm. Machine Learning Algorithm Model Accuracy Attribute (1) Attribute (2) Linear Kernel SVM Quadratic Kernel SVM Cubic Kernel SVM Gaussian Kernel RBF 57.1% 60.7% 67.9% 69% Fielding Percentage WHIP Double Plays turned SLG Batting Average ERA Wins Double plays turned American League National League World Series Figure 11 shows the playoff results of the 2015 Major League Baseball season Figure 12 plots the teams competing in the 2015 American League pennant race on the best classifier developed previously Table 5 Prediction results of the 2015 World Series for several classifiers Machine Learning Algorithm Model Accuracy Actual Result - Linear Kernel SVM 57.1% Quadratic Kernel SVM 60.7% Cubic Kernel SVM 67.9% Gaussian RBF Kernel SVM 69% World Series Winner (W) World Series Loser (L) + End of Presentation Contact: [email protected]
© Copyright 2026 Paperzz