The Probability of Winning:

The Probability of Winning:
Fact, Fantasy and Other Mysteries
Clay Graham – Advantage Analytics, LLC
Sola Talabi – Pittsburgh Technical, Inc.
Risk Conference, November 1-2, 2016 New Orleans
“It's tough to make predictions,
especially about the future.”
Yogi Berra
1
A Case Study in Baseball




Sniff and Kick
Perspective on Data
The Game
Models
Sniff and Kick
2
Probability of Winning
 Fundamental to decision making
 Based upon economics
 Rooted in operational characteristics
 Dependent on the context
 Winning a game
 Winning a bet
 Winning a contract
Some Methodologies






Pythagorean
Logistic regression
Monte Carlo simulation
Bayesian
Casual empiricism
SWAG
3
Perspective on Data
Characteristics of Distributions
 Comparable data
 Runs / out vs. Runs / game
 Create distributions effectively
 Continuous and discrete
4
Baseball’s Gaming Winning
Margin Distributed Normally
Runs / Game at Home
Distributed as a Negative Normal
5
Runs/Out at Home
Distributed as a Gamma Function
Runs/Out on Road
Distributed Approximately as a Gamma Function
6
The Game
Production Function
 Dependent Variable
 Runs/out (not run)
 Independent Variables
 Get on base or drive those on base around





%
%
%
%
%
Singles
Doubles
Triples
Home runs
Base on balls
7
Production Function1
Forced zero (0) intercept
1:
Runs/out * 27 = expected value runs/game
Following Example Based on
10/13/2016 Game Between
Dodgers vs Nationals
8
Create a Discrete Density Function:
For Each Batter Pitcher Matchup1
 Combine Production Functions
 Batter
 Pitcher
Pitcher vs
Batter
%1
%2+3
Scherzer
(Nationals)
.120
Toles
(Dodgers)
Resultant
1:
%HR
%BB
%outs
.050
.027
.059
.744
.143
.030
.048
.038
.741
.131
.04
.038
.049
.742
Example from Dodger vs Nationals 10/13 game
Monte Carlo Simulation
9
Aggregate Batter Pitcher
Matchup Dodgers vs. Scherzer
Park
Factor
Batters’
characteristics
Matchup
characteristics
Pitcher’s
characteristics
Aggregate Batter Pitcher
Matchup Nationals vs. Hill
10
Dodgers vs Nationals 10/13
Expected Number of Runs: 3.796
Dodgers vs Nationals 10/13
Expected Number of Runs: 3.78
11
Dodgers vs Nationals 10/13
Overlay Comparison
Player Sensitivity:
Nationals vs Hill (expected value runs = 3.7796)
12
Player Sensitivity:
Dodgers vs Scherzer
(expected value runs = 3.7958)
Monte Carlo Simulation
P(WNationals) = 51.1%
13
Expected Value: Total Runs
P(>6.5) = 56%
Aggregate Team Matchups
Last 3 innings
To Pythagorean
Model
To Logistic
Regression
14
Pythagorean
Bill James’ “Pythagorean”
Probability of Winning
Runs Allowed
P(W) = Rscored2 / (Rscored2 + Rallowed2)
Runs Scored
15
Pythagorean:
A More Generalized Approach
 Runs:
 Scored by home team (Rh)
 Allowed by home team (Rr)
 Scored by road team
 P(Winh) = Rhβ / (Rhβ + Rrβ)
 β will vary by sport, magnitude of
measurement, type of measurement
Dodgers vs. Nationals 10/13
Pythagorean
 Runs:
 (Rh) = 3.79
 (Rr) = 3.80
 Scored by road team
 P(Winh) = 3.791.8 / (3.791.8 + 3.801.8)
= .50 = 50%
16
Pythagorean (from team matchup):
Dodgers vs. Nationals
 Runs:
 (Rh) = 5.175
 (Rr) = 5.689
 Scored by road team
 P(Winh) =5.1751.8/(5.1751.8 +5.6891.8)
= .457 = 45.7%
Logistic Regression
17
Logistic Regression Equation
StatTools Report
Analysis: Logistic Regression P(W)home = f(K/BB,OB+S;Rd Hm)
Performed By: graham
Date: Saturday, March 07, 2015
Updating: Static
Variable: Win_Hm
Logistic Regression for Win_Hm
Summary Measure
Null Deviance
Model Deviance
Improvement
p-Value
8681.88924
4166.555074
4515.334166
< 0.0001
Coefficient
Regression Coeffic
Constant
K/BB_Rd
OB+S_Rd
K/BB_Hm
OB+S_Hm
Classification Matr
-0.0007
0.1066
-10.8169
-0.1419
10.9187
Standard
Wald
Error
Value
0.2166
0.0157
0.3178
0.0174
0.3150
-0.0033
6.8116
-34.0350
-8.1469
34.6631
1
0
Predicted
Predicted
Correct
2850
498
444
2482
86.52%
83.29%
1 - actual
0 - actual
Lower
p-Value
0.9974
0.0000
0.0000
0.0000
0.0000
Upper
Exp(Coef)
Limit
Limit
-0.4252
0.0759
-11.4398
-0.1761
10.3013
0.4238
0.1373
-10.1940
-0.1078
11.5361
0.9993
1.1125
0.0000
0.8677
55197
Percent
Percent
Summary Classific
Correct
Base
Improvement
84.99%
52.50%
68.39%
Logistic Regression
Variable
Coefficient
Value
Constant
-0.007
K/BB Rd
0.1066
2.182
OPS Rd
-10.8169
.808
K/BB Hm
-0.1419
2.397
OPS Hm
10.9187
.631
Probability of Winning Home Team = 12%
Probability of Winning Road Team = 88%
18
Bayesian – an integrated approach
Bayesian Statistics
 Dynamic method of calculating the
probability of winning
 Modify probabilities predicated upon new
and relevant information
 Responsive to additional information
(daisy chain)
19
Bayesian:
An Integrated Approach
 Change probabilities as a result of
new information
 Weight information over the period
covered, i.e., number of innings
Bayesian Formula
 p(A1|B) =
p(A1|B)*p(B|A1)
p(A1)*p(B|A1) + p(A2)*P(B|A2)
20
Bayesian – Classical View
 Event A = probability of home team
winning as a result of Monte Carlo
Simulation
 p(A1) = .51
 P(A2) = (1-.51) = .49
 Event B = Probability of home team
winning as a result of regression
matchups of expected value of runs
Bayesian Part 1: Daisy Chain




p(A1)
p(A2)
p(B1)
p(B2)
=
=
=
=
.51
(1-.51) = .49
.46
(1-.46) = .54
Monte Carlo 6 innings
Team Aggregates
Pythagorean 3 Innings
P(A1|B)=
(.667*.51)*(.333*.46)
(.667*.51)*(.333*.46)+(.333*.49)*(.667*.54))
= .47 = 47%
21
Bayesian Part 2: Daisy Chain




p(A1)
p(A2)
p(B1)
p(B2)
=
=
=
=
.47
(1-.47) = .53
.88
(1-.88) = .12
Daisy Chain 1
For 9 innings
Logistic Regression
For 3 innings
 P(A1|B)=
((.75*.47)*(.25*.88))
((.75*.47)*(.25*.88))+((.25*.53)*(.75*.12))
= .47 = 47%
Results
 Dodgers 4
 Bayesian
Nationals 3
 Probability of a home (Nationals) team win: 47%
 Probability of a road (Dodgers) team win: 53%
 Total Runs
 Expected value 7.5
 Line at 6.5
 P(over 6.5) = 56.2%
22
Filters to winning
 If P(W) >=53% win 65% games
 If P(W) over (or under) > 54% win 70%
NOW WHAT?
23
Time to Invest in
 Players
 Gambling
 Money Line
 Over Under
Players
 What do we now know for each player
 Expected run production
 vs right and left handed pitchers
 Historic matchup
 Volatility
24
Gambling
 What we now know
 Probability of winning game
 What we can derive
 From betting lines
 Implied probability of winning
 Betting edge
 Expected ROI
Summary
 Accurate measurement of batter and pitcher
effectiveness in meaningful units
 Monte Carlo model of game
 Improved Pythagorean formula (empirically
calculated exponent)
 Logistic regression calculated probabilities
using significant variables (OPS & K/BB)
 Integrated and weighted Bayesian probability
of winning
 Measurement and ranking of batter pitcher
matchups
25
The
Curse of
The Billy
Goat
Prediction 11/2/2016
World Series – Game 7
60.0%
Total Runs
6.50
7%
6%
5%
40.0%
+∞
60.0%
P(Over 6.5) = 60%
Total Runs
4%
3%
Minimum
4.296
Maximum 20.627
Mean
7.429
Std Dev
2.171
Values
5000
2%
1%
0%
P(W) = 54%
Cubs!
Note added 11/3: Cubs won 8 - 7
26