The Probability of Winning: Fact, Fantasy and Other Mysteries Clay Graham – Advantage Analytics, LLC Sola Talabi – Pittsburgh Technical, Inc. Risk Conference, November 1-2, 2016 New Orleans “It's tough to make predictions, especially about the future.” Yogi Berra 1 A Case Study in Baseball Sniff and Kick Perspective on Data The Game Models Sniff and Kick 2 Probability of Winning Fundamental to decision making Based upon economics Rooted in operational characteristics Dependent on the context Winning a game Winning a bet Winning a contract Some Methodologies Pythagorean Logistic regression Monte Carlo simulation Bayesian Casual empiricism SWAG 3 Perspective on Data Characteristics of Distributions Comparable data Runs / out vs. Runs / game Create distributions effectively Continuous and discrete 4 Baseball’s Gaming Winning Margin Distributed Normally Runs / Game at Home Distributed as a Negative Normal 5 Runs/Out at Home Distributed as a Gamma Function Runs/Out on Road Distributed Approximately as a Gamma Function 6 The Game Production Function Dependent Variable Runs/out (not run) Independent Variables Get on base or drive those on base around % % % % % Singles Doubles Triples Home runs Base on balls 7 Production Function1 Forced zero (0) intercept 1: Runs/out * 27 = expected value runs/game Following Example Based on 10/13/2016 Game Between Dodgers vs Nationals 8 Create a Discrete Density Function: For Each Batter Pitcher Matchup1 Combine Production Functions Batter Pitcher Pitcher vs Batter %1 %2+3 Scherzer (Nationals) .120 Toles (Dodgers) Resultant 1: %HR %BB %outs .050 .027 .059 .744 .143 .030 .048 .038 .741 .131 .04 .038 .049 .742 Example from Dodger vs Nationals 10/13 game Monte Carlo Simulation 9 Aggregate Batter Pitcher Matchup Dodgers vs. Scherzer Park Factor Batters’ characteristics Matchup characteristics Pitcher’s characteristics Aggregate Batter Pitcher Matchup Nationals vs. Hill 10 Dodgers vs Nationals 10/13 Expected Number of Runs: 3.796 Dodgers vs Nationals 10/13 Expected Number of Runs: 3.78 11 Dodgers vs Nationals 10/13 Overlay Comparison Player Sensitivity: Nationals vs Hill (expected value runs = 3.7796) 12 Player Sensitivity: Dodgers vs Scherzer (expected value runs = 3.7958) Monte Carlo Simulation P(WNationals) = 51.1% 13 Expected Value: Total Runs P(>6.5) = 56% Aggregate Team Matchups Last 3 innings To Pythagorean Model To Logistic Regression 14 Pythagorean Bill James’ “Pythagorean” Probability of Winning Runs Allowed P(W) = Rscored2 / (Rscored2 + Rallowed2) Runs Scored 15 Pythagorean: A More Generalized Approach Runs: Scored by home team (Rh) Allowed by home team (Rr) Scored by road team P(Winh) = Rhβ / (Rhβ + Rrβ) β will vary by sport, magnitude of measurement, type of measurement Dodgers vs. Nationals 10/13 Pythagorean Runs: (Rh) = 3.79 (Rr) = 3.80 Scored by road team P(Winh) = 3.791.8 / (3.791.8 + 3.801.8) = .50 = 50% 16 Pythagorean (from team matchup): Dodgers vs. Nationals Runs: (Rh) = 5.175 (Rr) = 5.689 Scored by road team P(Winh) =5.1751.8/(5.1751.8 +5.6891.8) = .457 = 45.7% Logistic Regression 17 Logistic Regression Equation StatTools Report Analysis: Logistic Regression P(W)home = f(K/BB,OB+S;Rd Hm) Performed By: graham Date: Saturday, March 07, 2015 Updating: Static Variable: Win_Hm Logistic Regression for Win_Hm Summary Measure Null Deviance Model Deviance Improvement p-Value 8681.88924 4166.555074 4515.334166 < 0.0001 Coefficient Regression Coeffic Constant K/BB_Rd OB+S_Rd K/BB_Hm OB+S_Hm Classification Matr -0.0007 0.1066 -10.8169 -0.1419 10.9187 Standard Wald Error Value 0.2166 0.0157 0.3178 0.0174 0.3150 -0.0033 6.8116 -34.0350 -8.1469 34.6631 1 0 Predicted Predicted Correct 2850 498 444 2482 86.52% 83.29% 1 - actual 0 - actual Lower p-Value 0.9974 0.0000 0.0000 0.0000 0.0000 Upper Exp(Coef) Limit Limit -0.4252 0.0759 -11.4398 -0.1761 10.3013 0.4238 0.1373 -10.1940 -0.1078 11.5361 0.9993 1.1125 0.0000 0.8677 55197 Percent Percent Summary Classific Correct Base Improvement 84.99% 52.50% 68.39% Logistic Regression Variable Coefficient Value Constant -0.007 K/BB Rd 0.1066 2.182 OPS Rd -10.8169 .808 K/BB Hm -0.1419 2.397 OPS Hm 10.9187 .631 Probability of Winning Home Team = 12% Probability of Winning Road Team = 88% 18 Bayesian – an integrated approach Bayesian Statistics Dynamic method of calculating the probability of winning Modify probabilities predicated upon new and relevant information Responsive to additional information (daisy chain) 19 Bayesian: An Integrated Approach Change probabilities as a result of new information Weight information over the period covered, i.e., number of innings Bayesian Formula p(A1|B) = p(A1|B)*p(B|A1) p(A1)*p(B|A1) + p(A2)*P(B|A2) 20 Bayesian – Classical View Event A = probability of home team winning as a result of Monte Carlo Simulation p(A1) = .51 P(A2) = (1-.51) = .49 Event B = Probability of home team winning as a result of regression matchups of expected value of runs Bayesian Part 1: Daisy Chain p(A1) p(A2) p(B1) p(B2) = = = = .51 (1-.51) = .49 .46 (1-.46) = .54 Monte Carlo 6 innings Team Aggregates Pythagorean 3 Innings P(A1|B)= (.667*.51)*(.333*.46) (.667*.51)*(.333*.46)+(.333*.49)*(.667*.54)) = .47 = 47% 21 Bayesian Part 2: Daisy Chain p(A1) p(A2) p(B1) p(B2) = = = = .47 (1-.47) = .53 .88 (1-.88) = .12 Daisy Chain 1 For 9 innings Logistic Regression For 3 innings P(A1|B)= ((.75*.47)*(.25*.88)) ((.75*.47)*(.25*.88))+((.25*.53)*(.75*.12)) = .47 = 47% Results Dodgers 4 Bayesian Nationals 3 Probability of a home (Nationals) team win: 47% Probability of a road (Dodgers) team win: 53% Total Runs Expected value 7.5 Line at 6.5 P(over 6.5) = 56.2% 22 Filters to winning If P(W) >=53% win 65% games If P(W) over (or under) > 54% win 70% NOW WHAT? 23 Time to Invest in Players Gambling Money Line Over Under Players What do we now know for each player Expected run production vs right and left handed pitchers Historic matchup Volatility 24 Gambling What we now know Probability of winning game What we can derive From betting lines Implied probability of winning Betting edge Expected ROI Summary Accurate measurement of batter and pitcher effectiveness in meaningful units Monte Carlo model of game Improved Pythagorean formula (empirically calculated exponent) Logistic regression calculated probabilities using significant variables (OPS & K/BB) Integrated and weighted Bayesian probability of winning Measurement and ranking of batter pitcher matchups 25 The Curse of The Billy Goat Prediction 11/2/2016 World Series – Game 7 60.0% Total Runs 6.50 7% 6% 5% 40.0% +∞ 60.0% P(Over 6.5) = 60% Total Runs 4% 3% Minimum 4.296 Maximum 20.627 Mean 7.429 Std Dev 2.171 Values 5000 2% 1% 0% P(W) = 54% Cubs! Note added 11/3: Cubs won 8 - 7 26
© Copyright 2026 Paperzz