Modelling the distribution of first innings runs in T20 Cricket

Modelling the distribution of first innings runs in T20 Cricket
James Kirkby
The joy of smoothing
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
1 / 22
Introduction
Cricket for the uninitiated
Figure : Muralitharan to Gilchrist
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
2 / 22
Introduction
Motivation
Why we might we interested in cricket data?
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Introduction
Motivation
Why we might we interested in cricket data?
Because we love cricket?
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Introduction
Motivation
Why we might we interested in cricket data?
Because we love cricket?
Well some of us do.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Introduction
Motivation
Why we might we interested in cricket data?
Because we love cricket?
Well some of us do.
Because it’s not the Iris or the Old Faithful data
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Introduction
Motivation
Why we might we interested in cricket data?
Because we love cricket?
Well some of us do.
Because it’s not the Iris or the Old Faithful data
There is lots of cricket data. Discrete nature of the game, means that large
quantities of data are available. Statistics are already an important aspect of the
game.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Introduction
Motivation
Why we might we interested in cricket data?
Because we love cricket?
Well some of us do.
Because it’s not the Iris or the Old Faithful data
There is lots of cricket data. Discrete nature of the game, means that large
quantities of data are available. Statistics are already an important aspect of the
game.
Gambling
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Introduction
Motivation
Why we might we interested in cricket data?
Because we love cricket?
Well some of us do.
Because it’s not the Iris or the Old Faithful data
There is lots of cricket data. Discrete nature of the game, means that large
quantities of data are available. Statistics are already an important aspect of the
game.
Gambling
Standing on the shoulders of giants. Working out the odds of dice and card games is
what inspired the first interest in statistics and probability.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
3 / 22
Data
Scope of the Data
There are a vast number of matches played worldwide each year for which data is publicly
available. We are going to restrict attention to the following types of matches:
T20 cricket, i.e. 20 overs per team.
Only ’Top Tier’ competitions: T20 internationals, English County T20s, IPL, Big
Bash, South African T20.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
4 / 22
Data
Scope of the Data
There are a vast number of matches played worldwide each year for which data is publicly
available. We are going to restrict attention to the following types of matches:
T20 cricket, i.e. 20 overs per team.
Only ’Top Tier’ competitions: T20 internationals, English County T20s, IPL, Big
Bash, South African T20.
We are going to be modelling the number runs teams score in an innings, and so we
First Innings (only data for the team that bats first).
Full allocation of overs was available, i.e. not weather affected.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
4 / 22
Data
Scope of the Data
There are a vast number of matches played worldwide each year for which data is publicly
available. We are going to restrict attention to the following types of matches:
T20 cricket, i.e. 20 overs per team.
Only ’Top Tier’ competitions: T20 internationals, English County T20s, IPL, Big
Bash, South African T20.
We are going to be modelling the number runs teams score in an innings, and so we
First Innings (only data for the team that bats first).
Full allocation of overs was available, i.e. not weather affected.
These restrictions lead to a sample of 1138 matches.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
4 / 22
Data
Data Description
We observe the progression of runs that a team scores through the innings. At the
beginning of each over we have the following information:
The number of runs scored in the remainder of the innings.
The number of wickets down / number of batsmen remaining.
The number of overs / balls remaining.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
5 / 22
Data
Data Description
We observe the progression of runs that a team scores through the innings. At the
beginning of each over we have the following information:
The number of runs scored in the remainder of the innings.
The number of wickets down / number of batsmen remaining.
The number of overs / balls remaining.
We will focus on the run rate (runs per over) to ensure that results are comparable with
different numbers of overs remaining.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
5 / 22
Data
Data Description
We observe the progression of runs that a team scores through the innings. At the
beginning of each over we have the following information:
The number of runs scored in the remainder of the innings.
The number of wickets down / number of batsmen remaining.
The number of overs / balls remaining.
We will focus on the run rate (runs per over) to ensure that results are comparable with
different numbers of overs remaining.
Definition
We define the random variable, YW,R as the subsequent run rate a team achieves given
that they are currently W wickets down with R overs remaining in the innings.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
5 / 22
Data
Our Aim
We would like to estimate the distributions of the various YW,R with the following
requirements.
Avoid a full rank method - don’t want be storing the entire data set in order to
evaluate probabilities.
Want to be able to easily evaluate the probabilities from the distribution.
We would like a set of consistent distributions i.e. the probability of achieving any
given run rate should be lower if a team has fewer wickets remaining.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
6 / 22
Data
Observed Data Frequency
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
7 / 22
Data
Empirical Distribution
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
8 / 22
Data
Empirical Distribution
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
9 / 22
Model
Notation
We observe many realisations of each of the YW,R . We will refer to the ith realisation of
YW,R , when W = w and R = r, as yw,r,i .
When it is clear from the context which W and R we are talking about, or if it doesn’t
matter, we will drop the subscripts and use Y and yi .
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
10 / 22
Model
Distribution Assumption
We assume that Y follows a ’spline’ distribution, with pdf given by:
f (y) =
m
X
Bj (y)αj .
(1)
j=1
Sufficient conditions for a valid pdf are:
αj > 0
and
m
X
αj = 1.
(2)
j=1
We can remove the need for the first condition by re-parameterizing to:
f (y) =
m
X
Bj (y) exp(aj ).
(3)
j=1
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
11 / 22
Model
Likelihood
The log-likelihood for our data given the spline distribution
`(a; y) = 1Tn log (B exp(a))
(4)
where

b1 (yi )
 .
B =  ..
b1 (yn )
James Kirkby
···
···

bm (yi )
.. 
. 

and

a1
 . 
a =  .. 
bm (yn )
Modelling the distribution of first innings runs in T20 Cricket
(5)
am
The joy of smoothing
12 / 22
Model
Estimation
Estimation of the parameters can now proceed by finding the roots of the Lagrangian:
L(a, γ) = 1T log (B exp a) + γ 1Tm exp a − 1 .
(6)
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
13 / 22
Model
Estimation
Estimation of the parameters can now proceed by finding the roots of the Lagrangian:
L(a, γ) = 1T log (B exp a) + γ 1Tm exp a − 1 .
(6)
The gradient vectors are:
T
T
∂L
1
1
=
(B diag(exp a)) + γ exp a =
(B diag(α)) + γα
∂a
B exp a
Bα
and
∂L
= 1Tm exp a − 1 = 1Tm α − 1 .
∂γ
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
(7)
(8)
The joy of smoothing
13 / 22
Model
Estimation
The hessian of the our objective function is
Ha,γ L =
diag
∂L
∂a
− γ exp a − VT U−1 V
(exp a)T
exp a
,
0
(9)
where U = diag (B exp a)2 and V = B diag(exp a).
This can be combined with expressions (7) and (8) to find the maximum likelihood
estimate of the coefficients, a, using Newton-Raphson.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
14 / 22
Model
Result
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
15 / 22
Model
Further Smoothing
We would like to impose some smoothness on the distributions, so that when the number
of wickets remaining and overs remaining is similar we have a similar distribution.
We can achieve this by imposing a difference penalty on the parameters of the
neighbouring distributions.
In order to be able to add the penalty we first need to be able to estimate the parameters
jointly, which requires that we make a couple of tweaks to our basis and likelihood.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
16 / 22
Model
Multi-density Basis
In order to model the distributions joint, you would naively define the basis as:

BW =0,R=20

0


0

B=
..

.


0
0
James Kirkby
0
BW =1,R=20
0
..
.
0
0
0
0
BW =2,R=20
..
.
0
0
···
···
···
0
0
0
..
.
···
···
BW =8,R=1
0
Modelling the distribution of first innings runs in T20 Cricket
0
0
0
..
.
0





.



BW =9,R=1
The joy of smoothing
17 / 22
Model
Multi-density Basis
In order to model the distributions joint, you would naively define the basis as:

BW =0,R=20

0


0

B=
..

.


0
0
0
BW =1,R=20
0
..
.
0
0
0
0
BW =2,R=20
..
.
0
0
···
···
···
0
0
0
..
.
···
···
BW =8,R=1
0
0
0
0
..
.
0





.



BW =9,R=1
This part of the basis does not support any data!
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
17 / 22
Model
Multi-density Basis
So after removing columns from the basis which support no observations, we have
something like:


BW =0,R=20
0
0
···
0
0


0
BW =0,R=19
0
···
0
0




0
0
B
·
·
·
0
0
W =1,R=19


B̃ = 
.
..
..
..
..
..


.
.
.
.
.




0
0
0
· · · BW =8,R=1
0
0
0
0
···
0
BW =9,R=1
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
18 / 22
Model
Multi-density Basis
We also need to define a summing matrix to enforce the constraints in the Lagrangian :


1m
0
0
···
0
0
0
1m
0
···
0
0


0
0
1
·
·
·
0
0
m


N= .
..
..
..
..  .
 ..
.
.
.
. 


0
0
0
· · · 1m
0
0
0
0
···
0
1m
Clearly we will need to define an analogue of B̃ for N, which we will refer to as Ñ.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
19 / 22
Model
Bring on the smoothing
Our unpenalised target function becomes
L(ã, γ̃) = 1T log B̃ exp ã + γ̃ T Ñ exp ã − 1 .
(10)
We can then simply add add a difference penalty to impose smoothness across our
distributions:
LP (ã, γ̃) = L(ã, γ̃) − λ exp(ã)T D̃T D̃ exp(ã),
(11)
where D̃ is matrix that has been chopped down from some difference matrix D. For our
example, we will use
D = DW ⊗ DR ⊗ I m .
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
20 / 22
Model
Result
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
21 / 22
Model
Further Work
Would be good to take account of the repeated measurements in the data.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
22 / 22
Model
Further Work
Would be good to take account of the repeated measurements in the data.
Find a way to introduce a parametric component into the model.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
22 / 22
Model
Further Work
Would be good to take account of the repeated measurements in the data.
Find a way to introduce a parametric component into the model.
Performance improvements - Woodbury Matrix Identity / Schur Complement
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
22 / 22
Model
Further Work
Would be good to take account of the repeated measurements in the data.
Find a way to introduce a parametric component into the model.
Performance improvements - Woodbury Matrix Identity / Schur Complement
Alternative penalty structure - add a penalty to ensure the CDFs do not cross.
James Kirkby
Modelling the distribution of first innings runs in T20 Cricket
The joy of smoothing
22 / 22