A Self-Adapting Intelligent Optimized Analytical Model for team

A Self-Adapting Intelligent Optimized Analytical Model for team
selection using player performance utility in Cricket
Bharathan S1, Sundarraj RP2, Abhijeet3, Ramakrishnan S3
Email: [email protected], [email protected], [email protected], [email protected]
1Analytics Manager at Sports Mechanics and PhD scholar in Indian Institute of Technology, Madras
2Professor, Indian Institute of Technology, Madras
3Sports Mechanics India Private Limited
Abstract
Good team selection is vital for success in all sports. Today, team selection is a very subjective decision and is
controlled by coaches and captains using their gut feel or player’s current form. We have proposed a new
methodology for objective evaluation of players for team selection. Player selection involves evaluating a player
across multiple dimensions viz. batting and bowling based on role in the team, context, opponents etc. In our study,
we consider all possible team level metrics that affects the outcome (win or loss) of the match, translate them to
individual player metrics, develop player evaluation utility and use it for team selection. The results of the model are
demonstrated for the game of cricket and can be easily extended to other sports. The model successfully identified
75% of the good performers and team selection accuracy is at 83%. The model can be used as an effective enabler
to rate players; select teams; define salary caps, predict winner etc.
1
Introduction
The statistics of professional sports, players and teams provides numerous opportunities for research [1,2]. The two
popular American sports, namely, Football and Baseball is always been characterized by high degree of analytics. In
contrast, the British game of cricket has not been subjected to the same degree of analytics. Cricket, the second most
popular sport in the world after Soccer with 2-3 billion fans [3] is relatively a new and upcoming research area.
Traditionally, the game of cricket is classified as test cricket which is played between two countries over a duration
of 5 days, one day international (ODI) cricket which is played between two international teams where each team
play 50 overs and first class cricket which is either played between domestic teams or between an international
visiting team and a domestic team for a duration of 3 or 4 days. During the last 3 decades, the game of cricket has
seen several changes and all these changes were directed to make this game more popular among the masses and to
expand the reach of cricket to non-cricket playing nations. However, the most significant change came in the form
of Twenty20 (T20) cricket. T20 is the latest innovation in the game and is a shorter version than ODI cricket. The
total duration of the T20 game is about 3 hours, and each team gets to play 20 overs. The frenetic pace and growing
popularity of the T20 format is changing the way the other formats are being played. For example, when Duckworth
and Lewis came up with the rain-rule in 1999, the average score was 225 runs in 50 overs but today teams score 225
runs in 20 overs in a T20 match. Today teams require players who are efficient in multiple roles. So selection of
players for a team with various constraints is a complex task which can be viewed as a constrained multi-objective
optimization and a multiple criteria decision making problem. In the formation of a good and successful cricket
team, batting and bowling strength of a team are major factors affecting its performance and an optimum trade-off
needs to be attained. There are numerous variables on which a player’s performance is measured today but it is
important to know whether all variables should be considered for player evaluation or only a subset of variables
depending on the role of the player in the team. Currently most of the team selections are done using different
heuristics, past experiences, or at most using some crude methodologies. Generally, in a team selection committee,
each member evaluates the player’s performance individually and vote for inclusion/exclusion from the team.
Negotiations are then conducted to arrive at an agreement among the members as to which cricketer should be
finally selected. In our work, we employ a 2 phase approach whereby we first identify the variables of significance to
evaluate a player, then evaluate a player and select the player for inclusion in team. The rest of the paper is organized
as follows: section 2 presents an overview of literature related to player evaluation and team selection, section 3
explains in detail the methodology and section 4 discusses the results and section 5 concludes the paper with scope
for future work.
2
Related Work
Studies addressing different research issues related to various dimensions of the cricketing sport can be found in the
literature. Articles [4,5,6,7,8,9] discuss and introduce the game of cricket while specifically considering statistical
methods for determining a player’s performance. Performance measurement and classification of players based on
their performance can be considered as a researchers’ delight irrespective of the sport. The physical demands of
English Football Association Premier League soccer of three different positional classifications viz. defender,
midfielder and striker were evaluated in [10]. Elite Cuban baseball players were classified into five categories based
on the roles they played in the field that includes infielders, outfielders, catches, first basemen and pitchers in [11].
Similar studies are discussed in soccer and cricket [12], in rugby [13] and in football [14]. Some recent articles by
[15,16,17,18,19] discuss the various multi-criteria decision making models of player evaluation for multiplayer sports,
namely, Baseball, Basketball, Cricket and Football. Most studies have used Analytical Hierarchy Process (AHP) to
assess the relative importance for each variable and Technique for Order Performance by Similarity to the Ideal
Solution (TOPSIS) to rank players based on their relative closeness to the ideal player. However, these studies have
two major shortcomings 1) weightages used in player evaluation might not be accurate as they are arrived from
subjective inputs and 2) there is no relationship between weightage given and outcome of the match. Selection of a
cricket team under various constraints such as number of batsmen, bowlers, all-rounder and a wicket keeper is a
complex task as coaches are required to consider number of qualitative and quantitative attributes. These attributes
may include the player’s individual skills and performance statistics, combination of players’ physical fitness,
psychological factors, and injuries among others [20]. An integer programming model to select a squad of 15 players
for one day international cricket team was developed in [21]. A neural network approach to predict each cricketer’s
performance in the future based upon their past performance and classification of players into three categories performer, moderate and failure was discussed in [22]. Based on the ratings generated and by applying heuristic rules
they recommended the cricketers to be included in the World Cup 2007. In [23] a method for quantifying a cricket
player’s performance based on his ability to score runs and take wickets was proposed. Then, the performance
measures were used to determine the optimal team using an integer programming. An integer programming model
was used in [24] for cricket team selection based on 2009 ICC Champions trophy data. A Data Envelopment
Analysis (DEA) formulation was proposed in [25] for evaluation and ranking of cricket players in different
capabilities based on DEA scores. The ranking is then used to choose the required number of players for a cricket
team in each cricketing capability.
3
Methodology
The proposed methodology for player evaluation and team selection is schematically shown in Figure 1. There are
five stages starting with data identification, data preparation, data reduction, data modeling, player evaluation and
team selection.
Data Identification
Interviews with experts to
identify variables
Data Preparation
Data cleaning to remove rain
affected matches and data
trasformation
Data Reduction
Principal Component Analysis
to address multicollinearity
Player evaluation and Team
selection
Integer programming model to
select players based on utility
Modelling
Logistic Regression to
determine significant factors
that impacts the outcome of
the match
Figure 1 Framework for player evaluation and team selection
3.1
Data source
Twenty-20 cricket database which consists of ball by ball details of more than 1500 international and domestic T20
matches played between 2008 and 2014, tracked across 80+ variables by SportsMechanics, a sports technology and
performance analytics company is used for the study.
3.2
Data Identification
Alignment with users is a key to success for analytics as they drive or at least strongly influence major decisions. If
even one member of the team isn’t committed, it is unlikely that analytics will receive sustained and serious focus.
So as the first step, the users and domain experts that include coaches, players, members of national team selection
committee and performance analysts are involved to identify the list of variables that should be considered for
player evaluation and team selection. Focus groups sessions and In-depth interviews were conducted and a list of
variables that could possibly influence the outcome of the match is identified. Description of identified variables is
given in Appendix A.
3.3
Data preparation
From ball by ball details, data is aggregated to calculate the performance variables like batting strike rate, dot-ball
percentage, bowling economy etc. at match level for all matches using standard cricketing definitions. Data is then
cleaned to remove all drawn and shortened or rain affected matches. Since the performance of the players depend
on the role of the player in the team, we further divided the variables based on batsman or bowler’s role. Variables
that directly don’t measure the performance of a player like venue, ground, pitch-type and opponent are used as
filters to restrict the data depending on the opponent and venue for which team selection is being done. Preliminary
scatter plot and correlation analysis is done to understand the distribution of data and relationship between
independent variables and outcome of the match. Based on the analysis, variables like partnerships, playing order etc
having weak correlation (coefficient<0.3) is removed from further modeling. The final list of variables used in the
study and their naming convention is as given in Table 2 and 3 for batsman and bowlers respectively.
Table 2 Batting variables under study
Variable
Runs
Strike Rate
Dotball %
Boundary %
Boundary frequency
RSS
uncomfortables
3.4
Top
batopnar
batopnsr
batopndbp
batopnbp
batopnbf
batopnrss
batopuc
Middle
batmidar
batmidsr
batmiddbp
batmidbp
batmidbf
batmidrss
batmiduc
Lowermiddle
Batlmar
Batlmsr
Batlmdbp
Batlmbp
Batlmbbf
Batlmbrss
Batlmbuc
Table 3 Bowling variables under study
Variable
Fast
Spin
Economy
Average runs conceded
Bowling Strike rate
Dotball %
Boundary %
Boundary frequency
Fsteco
fstblavg
fstblsr
fstbldb
Fsblbp
Fsblbf
spneco
spnblavg
spnblsr
spnbldb
spnblbp
spnblbf
Principal Component Analysis
Since most of the variables are derived from two basic variables namely, runs and balls, we felt that there could be
multicollinearity among the independent variables. The correlation matrix of the data set was obtained to assess the
measure of pair wise association in variables. Appendix B shows the Pearson correlation matrix with statistically
significant correlation coefficients (P<0.01) highlighted. Some interesting observations from analysis is absence of
relationship between a team’s batting and bowling performance and high correlation among variables within the
particular role of the player. Since the number of variables is high and there is no correlation between variables of
different roles, correlation results are split by batting and bowling roles. From Appendix B, we find that there is a
significant amount of correlation among the predictors, judging by the strength of the correlation coefficients
between them. To remove multicollinearity, the observed variables are reduced into a smaller number of principal
components (PCs) using Principal Component Analysis (PCA).
PCA is a special case of factor analysis which transforms the original set of inter-correlated variables into a new set
of an equal number of independent uncorrelated variables or PCs that are linear combinations of the original
variables. The principal components are ordered in such a way that the first PC explains most of the variance in the
data, and each subsequent one accounts for the largest proportion of variability that has not been accounted for by
its predecessors. We also used principal component methods for selecting subsets of variables for a regression
equation using varimax rotation of the principal components that retains a subset of the original variables associated
with each of the first few components. Tables 3 and 4 summarize the results of the varimax rotation on the 15
principal components together with the amount of
variance explained by each component. The higher
the loading of a variable, the more that variable
contributes to the variation accounted for by the
particular principal component. A principal
component with an eigenvalue greater than or
equal to 1, is usually considered as being of
statistical significance (the Kaiser criterion).
Principal component analysis was done using
STATA to obtain orthogonal scores accounting for
the variance in the attributes. A 10 component
solution was chosen based on the scree plot and
interpretability
of
the
Varimax
rotated
components. The ten components accounted for
Figure 2 Scree plot
85.98% of the variance. The ten PCs, variables loaded on each of them, eigen values and cumulative percentage of
variance explained is given in Table 4. From Table 4, we find that although boundary percentage is derived from
runs and boundary frequency is derived from balls faced, it is interesting to note that these two variables are strongly
correlated for top order batsmen. Similarly for fast bowlers we see that economy rate which is a factor of total runs
is correlated to number of fours conceded by the bowler. Most of the variance in data is being explained by middle
and lower middle order performance than top order batsman. Also, we find the number of uncomfortable balls
faced by a batsman or bowled by a bowler is almost insignificant
Table 3 Rotated principal component loadings
Variable
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
PC11
PC12
PC13
PC14
PC15
batopnar
0.01
0.16
0.06
-0.77
0.01
0.03
-0.01
-0.03
0.00
-0.10
-0.02
-0.04
-0.03
0.24
-0.32
batopnsr
0.00
0.04
0.52
-0.79
0.06
0.05
0.02
-0.06
0.02
-0.04
-0.06
-0.07
-0.02
-0.05
0.08
batopndbp
0.02
-0.06
0.20
0.93
-0.04
-0.02
-0.01
0.05
-0.01
0.04
0.05
0.07
0.00
0.06
-0.07
batopnbp
0.04
0.01
0.92
0.07
0.09
0.09
0.01
-0.01
0.02
0.01
-0.03
0.00
0.03
0.00
-0.02
batopuc
0.09
-0.01
0.04
0.13
0.07
0.03
0.02
-0.01
-0.05
0.01
0.04
0.01
0.57
-0.02
0.02
batopnrss
0.03
-0.01
0.89
0.07
0.06
0.06
0.00
-0.01
0.01
0.00
-0.02
-0.03
-0.02
0.00
0.00
batopnbf
-0.01
-0.02
-0.75
0.41
-0.08
-0.05
-0.01
0.02
-0.02
0.01
0.01
0.02
-0.04
0.00
-0.03
batmidar
0.36
0.14
0.06
0.22
0.01
0.01
-0.02
-0.64
0.01
-0.11
-0.01
-0.03
0.00
0.11
0.40
batmidsr
0.65
0.11
0.02
-0.09
0.06
0.06
0.00
-0.70
-0.01
-0.07
-0.06
-0.08
-0.01
0.01
-0.03
batmiddbp
0.01
-0.10
-0.01
0.12
-0.03
-0.03
0.02
0.95
0.01
0.11
0.09
0.05
0.02
0.01
0.05
batmidbp
0.93
0.05
0.03
0.03
0.09
0.07
0.01
0.03
-0.01
0.00
0.00
0.00
0.02
-0.01
0.02
batmiduc
0.03
-0.02
0.03
-0.03
0.09
0.00
0.00
0.10
0.00
0.01
0.03
0.00
0.63
0.00
-0.03
batmidrss
0.95
0.04
0.02
-0.01
0.05
0.06
0.02
-0.03
-0.01
0.02
0.00
-0.06
0.02
0.02
0.01
batmdbf
-0.78
-0.04
-0.02
-0.02
-0.09
-0.03
-0.01
0.25
-0.01
0.04
0.03
-0.01
-0.01
-0.01
-0.01
batlmar
-0.07
0.17
0.04
0.20
0.04
0.04
0.00
0.19
0.03
-0.33
-0.05
-0.02
-0.02
-0.47
-0.03
batlmsr
0.05
0.74
0.00
-0.09
-0.03
0.02
-0.01
-0.13
-0.01
-0.59
-0.03
-0.04
-0.02
0.01
-0.02
batlmdbpr
-0.02
-0.10
0.00
0.09
0.01
0.01
0.01
0.15
0.00
0.92
0.05
0.06
0.02
0.03
-0.01
batlmbp
0.06
0.92
0.01
-0.04
0.04
0.04
-0.02
-0.03
0.00
0.07
0.01
-0.01
0.00
0.00
0.02
batlmbuc
0.03
-0.07
0.02
0.02
0.11
-0.03
0.02
0.00
-0.02
0.11
-0.02
0.00
0.53
0.03
0.05
batlmrss
0.06
0.92
0.01
-0.06
-0.03
0.03
-0.01
-0.08
-0.01
0.00
0.02
0.00
-0.03
0.00
-0.01
Batlmdbf
-0.04
-0.77
0.02
0.07
-0.04
-0.01
0.03
0.09
0.01
0.20
0.03
0.05
-0.02
0.06
-0.02
fsteco
0.11
0.03
0.09
-0.08
0.64
0.07
0.30
-0.08
0.12
-0.01
-0.07
-0.59
0.05
-0.01
-0.01
fstblavg
0.03
-0.01
0.02
-0.02
0.23
0.04
0.94
0.00
0.05
0.00
-0.03
-0.18
0.01
0.00
-0.01
fstblsr
-0.01
-0.03
-0.01
0.01
0.09
0.04
0.98
0.03
0.02
0.01
-0.03
-0.05
0.00
0.00
0.01
fstdbpr
-0.05
-0.05
-0.02
0.13
-0.09
-0.01
-0.24
0.10
-0.09
0.09
0.11
0.83
0.02
0.00
-0.01
fsblbp
0.11
0.02
0.09
-0.01
0.92
0.10
0.15
-0.02
0.07
0.03
0.00
0.09
0.03
-0.01
0.00
fsblbf
-0.08
0.02
-0.09
0.05
-0.85
-0.07
-0.17
0.04
-0.08
0.02
0.06
0.20
-0.04
0.00
-0.01
spneco
0.08
0.01
0.08
-0.07
0.09
0.67
0.07
-0.10
0.20
-0.02
-0.61
-0.05
-0.01
-0.01
0.02
spnblavg
0.01
0.00
0.03
-0.03
0.09
0.23
0.05
-0.02
0.93
-0.01
-0.21
-0.04
-0.01
-0.01
0.01
spnblsr
-0.02
-0.01
0.00
0.01
0.07
0.01
0.03
0.03
0.98
0.01
-0.02
-0.05
-0.01
0.01
0.00
spndbpr
-0.02
0.00
-0.03
0.10
-0.05
-0.07
-0.05
0.13
-0.22
0.07
0.84
0.11
0.02
0.01
0.01
spnblbp
0.09
0.05
0.09
-0.01
0.09
0.94
0.04
0.00
0.08
0.02
0.06
0.01
0.00
0.00
0.00
spnblbf
-0.07
-0.03
-0.05
0.04
-0.07
-0.82
-0.03
0.03
-0.13
0.02
0.14
0.04
0.00
0.00
0.01
Table 4 Principal components and variables loaded
PCs
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
3.5
Role
Middle order performance
Lower middle order performance
Top order performance
Top order performance
Fast bowler performance
Spin bowler performance
Fast bowler performance
Middle order performance
Spin bowler performance
Middle order performance
Variables loaded
Batmidbp, batmidrss, batmdbf
Batlmsr, batlmbp, batlmrss, batlmbf
batopnbp, batopnrss, batopnbf
Batopnar, batopnsr, batopndbp
Fsteco, fsblbp, fsblbf
Spneco, spnblbp, spnblbf
Fstblavg, fstblsr
Batmidar, batmidsr, batmiddbp
Spnblavg, spnblsr
batlmdbpr
Eigen value
3.012
2.949
2.568
2.452
2.156
2.110
2.049
2.032
1.972
1.419
cumulative variance
11.40%
22.56%
32.28%
41.56%
49.72%
57.70%
65.46%
73.15%
80.61%
85.98%
Logistic Regression
Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent
variables that determine a dichotomous outcome. The goal of logistic regression is to find the best fitting model to
describe the relationship between the dichotomous characteristic of interest and a set of independent variables.
Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict
a logit transformation of the probability of presence of the characteristic of interest. The simple logistic model has
the form
( )
(
)
( |
)
The probability of the outcome of interest is given as
(
|
)
where the coefficient determine the direction of the relationship between independent variable and the logit of Y.
Outcome of a cricket match is a binary variable (either win or loss) and is influenced by various factors. Here, we
have used principal component scores from Section 3.4 as independent variables in stepwise logistic regression
analysis to determine the relationship between PCs and outcome of the match. The model is significant and the
coefficient of determination for the model, pseudo R2 is 0.725 i.e. the proportion of the variation in the outcome of
the match explained by the independent variables in the model is 72.5 percent.
Table 5 Regression modeling results
3.6
0.71
Std
Coef.
0.88
Std.
Err.
0.09
0.97
1.18
PCs
Coef.
PC1
PC2
Prob
[95% Conf. Interval]
0.00
0.53
0.88
0.11
0.00
0.77
1.18
PC3
0.33
0.50
0.09
0.00
0.16
0.50
PC4
-0.93
-0.75
0.10
0.00
-1.12
-0.75
PC5
-1.31
-1.09
0.11
0.00
-1.53
-1.09
PC6
-0.71
-0.53
0.09
0.00
-0.89
-0.53
PC7
-0.99
-0.80
0.10
0.00
-1.18
-0.80
PC8
-1.01
-0.81
0.10
0.00
-1.21
-0.81
PC9
-0.76
-0.58
0.09
0.00
-0.94
-0.58
PC10
-0.85
-0.65
0.10
0.00
-1.05
-0.65
constant
-0.01
0.15
0.08
0.92
-0.17
0.15
Player evaluation and Team selection
One of the greatest challenge when it comes to player evaluation in cricket is the performance metrics that should
be considered to evaluate a player. The question is whether one should consider all variables or only a subset of
variables and whether one should evaluate all players on the same set of variables or it should depend on players’
role. There are various performance metrics to measure the performance of batsman and bowlers. In our case, we
are going to evaluate players based on their role and within each role, the significant variables from Table 4 are
considered for player evaluation. An integer programming model is proposed for team selection whereby players are
selected in the team using binary response decision variables. The decision variables are defined to determine
whether or not an individual is good enough for selection based on the optimization of a linear function for predefined criteria. Utility is used as the objective function which is to be maximized. To evaluate the utility of a player,
we need to scale the values of the variables in order to measure the multidimensional attributes independent of their
units and ranges. The set of performance variables can be classified into two subsets: positive and negative ones.
The values of positive ones need to be maximized (strikerate, batting average etc.) while the values of negative ones
(economy, bowling average etc.) need to be minimized. To normalize, the attributes are scaled in the range of 0-1
with 1 representing the best player and 0 representing the worst player using min-max normalization procedure as
defined in equation 1 and 2. Let
be the observed values of the
player for the
variable (i.e. Strike rate,
runs etc.) for the
role (i.e. top, middle, etc.).
If a variable represents a positive dimension, then normalized variable is
(
(
)
)
(
(1)
)
and if a variable represents a negative dimension, then normalized variable is
(
(
(
)
)
(
)
)
(2)
Now utility of a player is defined using the principles of TOPSIS whereby we select the player that is closet the best
performance and farest from the worst performance. So utility for player p is
∑
(
(
(
(
)
)
) (
(
))
)
(3)
where
is the weightage assigned to particular performance variable for a particular role based on logistic
regression results described in Table 4. Standardized beta coefficients are used arrive at the relative weightages such
that the sum of weights is equal to 1.
Figure 3 Weightages for performance variables by player role based on IPL 2014 data
The next step is to select a team based on performance utility. We have used isnteger programming model with the
following decision variables:
{
}
{
}
The objective value of the model is to maximize the utility of players in the team. Utility of the team is defined as
the sum of player’s utility across all roles. Each variable within each role is assigned a particular weightage depending
on how much they influence the outcome of the match( ).
∑
∑
(
(
(
(
)
)
) (
(
))
)
(4)
The constraints of the model could be changed according to the requirements of the selection. Constraint 1
(equation 5) ensures that atleast
players are selected for a particular role r. Generally
depends on the team
composition required and should be greater than or equal to 2 for top order batsman, 3 for middle order batsman, 2
for lower middle order batsman, 2 for fast bowlers and 2 for spin bowlers. Constraint 2 makes sure that maximum
number of players selected is 15. Constraint 3 ensures that a player is selected only once. This is important in cricket
as players play multiple roles as both batsman and bowler. Additional constraints, like availability or national quota
rules etc. subject to the requirements of the selectors, could easily be added to this model.
∑
∑
∑
∑
(5)
5
(6)
(7)
(
4
)
(8)
Model validation and Results
Indian Premier League (IPL) is one of the finest Twenty20 competitions in the world of cricket based on the lines
of English Premier League and the National Basketball League where players from all cricket playing countries play
for franchise. The tournament was inaugurated in 2008 and has taken Indian cricket to new heights. To illustrate the
proposed model, data captured from the 2013 and 2014 IPL is used to rate players based on their role. The final
output of player ranking for 2013 from the model is graphically shown in Figure 3. The player rankings are shown
based on the player role in x-axis and player utility along y-axis. Similar results were obtained for 2014.
Figure 3 Utility of players in IPL 2013
The optimization model proposed in Section 3.6 is run using Lingo, optimization software on same dataset to find
an optimal team. The optimal team suggested by the model based on players performance in IPL 2014 is Lendl
Simmons, Robin Uthappa, David Warner (top order), Glenn Maxwell, AB De Villers, Suresh Raina (middle order),
MS Dhoni, James Faulkner, Yusuf Pathan (lower middle order), Lasith Malinga, Bhuvaneshwar Kumar, Mitchell
Starc (fast), Sunil Narine, Axar Patel and Harbhajan Singh (spin). Few interesting observation from the output of the
model is selection of James Faulkner as a lower middle order batsman, omission of purple cap holder Mohit
Sharma, selection of Harbhajan Singh and Yusuf Pathan, who are not part of international Indian team. The
possible reason for selection of James Faulkner is his allrounder capability and rejection of Mohit sharma is poor
bowling performance except wickets taken.
To validate the optimization model, the actual performance of selected players in matches played post IPL 2014 is
compared with the actual performance of other players. However, other players are restricted to only players in
similar role and those who played in IPL and were not recommended by the model. A two-sample t-test was
performed to check if there is significant difference in average runs, strike rate and boundary percent for batsman
and for bowlers on strike rate, economy, bowling average and boundary frequency. Two-sample t-test results are
summarized in Table 6 and 7 for bowling and batting respectively.
Table 6 Two-sample t-test results for bowling
Role
Bowl
strike
rate
Bowl
economy
Bowling
average
Fast
Spin
Boundary
frequency
Table 7 Two-sample t-test results for batting
Role
Average
runs
SR
Boundary
percent
Top
Middle
LowerMiddle
- Significant at 0.001
From Table 6 and 7, we infer that there is statistically significant difference in players’ performance between players
identified by the model and other players. However, we also see that there is no statistical difference for certain
variables like economy for spin bowlers, boundary frequency for both fast and spin bowlers, runs scored by middle
order batsman and SR for lowermiddle order batsman. The possible reason for this could be the close competition
for slots in the team and players optimally selected by the model based on overall performance and not just one
performance variable. The optimization model was run for every team selection done by various teams for all tours
and tournaments in 2014. The results were compared with the actual selection done by different selection
committees. Our model’s team selection accuracy was at 83% i.e 8 out of 10 players selected by committees were
rightly predicted. Overall, we found that 75% of good performers were rightly predicted i.e. there was a significant
difference in performance between the players selected by the model and other players.
We also validated how the model can be used to predict the rankings of the team based on player potential. For this,
the players were first ranked based upon their past one year performance using the proposed methodology. Then
IPL 2014 teams were scored based upon their player’s score computed in the previous step and the weightages given
to each role. The overall team score is then compared to their actual performance in IPL 2014 season. The team’s
actual ranks were in close resemblance to their relative scores obtained from the model.
4
Conclusion
We have proposed a new methodology for objective evaluation of players for team selection. The proposed integer
programming model based on player utility is more realistic than traditional evaluation and selection methods. This
is first of its kind in the game of cricket because the model considers multiple variables across dimensions viz.
batting and bowling that impact the outcome of the match and distribute weightages accordingly, based on roles to
evaluate players making it a self-adaptive intelligent model. Our model’s team selection accuracy is at 83% and 75%
of the good performers were rightly identified. The proposed model can be used as an effective enabler to rate
players; select teams; define salary caps etc. The results of the model are demonstrated for cricket and can be easily
extended to other sports. Our future work includes use of players’ fitness and workload information for player
evaluation and team selection.
5
Acknowledgements
We would like to thank CKM Dhananjai and Gaurav Sundararaman for their help and support for this work.
6
References
[1] De Silva, Basil M., and Tim B. Swartz. Winning the coin toss and the home team advantage in one-day international cricket
matches. Department of Statistics and Operations Research, Royal Melbourne Institute of Technology, 1998.
[2] Durbach, Ian N., and Jani Thiart. "On a common perception of a random sequence in cricket: application." South
African Statistical Journal 41.2 (2007): 161-187.
[3] http://sporteology.com/top-10-popular-sports-world/
[4] Lemmer, Hermanus H. "The combined bowling rate as a measure of bowling performance in cricket." South
African Journal for Research in Sport, Physical Education and Recreation 24.2 (2002): p-37.
[5]Lemmer, Hermanus H. "A measure for the batting performance of cricket players." South African Journal for
Research in Sport, Physical Education and Recreation 26.1 (2004): p-55.
[6] Lemmer, Hermanus H. "A measure of the current bowling performance in cricket." South African Journal for
Research in Sport, Physical Education and Recreation 28.2 (2006): p-91.
[7] Beaudoin, David, and Tim Swartz. "The best batsmen and bowlers in one-day cricket: general." South African
Statistical Journal 37.2 (2003): 203-222.
[8] Barr, G. D. I., C. G. Holdsworth, and B. S. Kantor. "Evaluating performances at the 2007 cricket world
cup." South African Statistical Journal 42.2 (2008): 125-142.
[9] Bracewell Paul, J., and Ruggiero Katya. "A parametric control chart for monitoring individual batting
performances in cricket." Journal of Quantitative Analysis in Sports 5.3 (2009): 1-21.
[10] Bloomfield, Jonathan, Remco Polman, and Peter O'Donoghue. "Physical demands of different positions in FA
Premier League soccer." Journal of sports science & medicine 6.1 (2007): 63.
[11] Carvajal, Wiliam, et al. "Body type and performance of elite Cuban baseball players." MEDICC review 11.2
(2009): 15-20.
[12] Clerke, S. R. "Performance Modeling in Sports." Unpublished Ph. D dissertation, Submitted to the School of
Mathematical Sciences, Swinburne University of Technology (1997).
[13] Gabbett, T. J. "Physiological characteristics of junior and senior rugby league players." British Journal of Sports
Medicine 36.5 (2002): 334-339.
[14] McGee, Kimberly J., and Lee N. Burkett. "The National Football League combine: a reliable predictor of draft
status?." The Journal of Strength & Conditioning Research 17.1 (2003): 6-11.
[15] Bozbura, F. Tunç, Ahmet Beşkese, and Tuna Sorgun Kaya. "TOPSIS METHOD ON PLAYER SELECTION
IN MBA."
[16] Lee, Chih-Cheng Chen1 Yung-Tan, and Chung-Ming Tsai. "A Hybrid Assessment Method for Evaluating the
Performance of Starting Pitchers in a Professional Baseball Team." (2013).
[17] Dey, Pabitra Kumar, Dipendra Nath Ghosh, and Abhoy Chand Mondal. "A MCDM Approach for Evaluating
Bowlers Performance in IPL." Journal of Emerging Trends in Computing and Information Sciences 2.11 (2011).
[18] Dey, Pabitra Kumar, Dipendra Nath Ghosh, and Abhoy Chand Mondal. "Statistical Based Multi-Criteria
Decision Making Analysis for Performance Measurement of Batsmen in Indian Premier League." International Journal
of Advanced Research in Computer Science 3.4 (2012).
[19] Tavana, Madjid, et al. "A fuzzy inference system with application to player selection and team formation in
multi-player sports." Sport Management Review 16.1 (2013): 97-110.
[20] Arnason, Arni, et al. "Risk factors for injuries in football." The American Journal of Sports Medicine 32.1 suppl
(2004): 5S-16S.
[21] Gerber, Hannah, and Gary D. Sharp. "Selecting a limited overs cricket squad using an integer programming
model." South African Journal for Research in Sport, Physical Education and Recreation 28.2 (2006): p-81.
[22] Iyer, Subramanian Rama, and Ramesh Sharda. "Prediction of athletes performance using neural networks: An
application in cricket team selection."Expert Systems with Applications 36.3 (2009): 5510-5522.
[23] Sharp, G. D., et al. "Integer optimisation for the selection of a Twenty20 cricket team." Journal of the Operational
Research Society 62.9 (2011): 1688-1694.
[24] Lemmer, Hermanus Hofmeyr. "Team selection after a short cricket series."European Journal of Sport Science 13.2
(2013): 200-206.
[25] Amin, Gholam R., and Sujeet kumar Sharma. "Cricket team selection using data envelopment
analysis." European journal of sport science 14.sup1 (2014): S369-S376.
Appendix A
Variable
inningsno
Batting Role
Bowling Role
MatchResult
Team Score
Team Wickets
CompetitionName
GroundName
Countryname
Pitchtype
PlayedFor
Opponent
BatsmanRuns
BatsmanBalls
OverallBatsmanDotBalls
Runs per Scoring Shot (RSS)
PowerPlayRuns
OtherOverRuns
PlayingOrder
Fours
Sixes
SR
BatsmanChances
BowlerRuns
Wicket
BowlerLegalBalls
BowlerWides
BowlerNoballs
Bowler Strike Rate
Total_Catch
Total_Stump
Runs Saved
Player 30+ Partnership
30+ Partnership Breaker
30 + Scores
50+ Scores
Description
Inningsno in which the player batted and/or bowled
The role assigned to the batsmen
The role assigned as a bowler
What is the result of the match with respect to the player team
End score (only runs) of the team with respect to the player's team
Total Wickets lost by the team in the innings with respect to the player's team
Name of the competition in which the match took place
Ground name in which the match took place
Country in which the match took place
Green top/rank turner / flat track
Team name of the player
Name of the opponent team against whom the player played
Number of runs scored by the player as batsman
Number of balls faced by the player as batsman
Dot balls(no runs scored) faced by the player
Number of runs per scoring ball
Number of runs scored by the player in the powerplay overs (1 – 6)
Number of runs scored by the player other than powerplay overs (7 – 20)
Batting position of the player
Number of fours hit by the player as batsman
Number of sixes hit by the player as batsman
Strike rate of the player as batsman (Runs/balls)x100
Number of lives for the player as batsman
Number of runs conceded by the player as bowler
Number of wickets taken by the player as bowler
Number of legal balls bowled by the player as bowler
Number of wides conceded by the player as bowler
Number of No Balls bowled by the player as bowler
Number of balls taken for each wicket for the player (Runs/wicket)
Total catches taken by the player as fielder
Total stumpings made by the player as wicket keeper
Net Number of runs saved in the field by the player
Number of 30+ partnerships a player was involved in
Number of wickets taken by the player as bowler after 30+ partnership
Number of 30 + scores by an individual player
Number of 50 + scores by an individual player
Appendix B
Table A1 Bivariate correlation table for fast bowlers
Variables
fsteco
fstblavg
fsteco
fstblavg
fstblsr
fstdbpr
fsblbp
fsblbf
1
0.57
1
fstblsr
0.37
0.96
1
fstdbpr
-0.68
-0.40
-0.28
1
fsblbp
0.64
0.36
0.22
-0.03
1
fsblbf
-0.73
-0.40
-0.27
0.33
-0.84
1
Table A2 Bivariate correlation table for spin bowlers
Variables
spneco
spnblavg
spnblsr
spndbpr
spnblbp
spnblbf
1
spneco
spnblavg
0.51
1
spnblsr
0.20
0.93
1
spndbpr
-0.65
-0.42
-0.24
1
spnblbp
0.67
0.29
0.08
-0.01
1
spnblbf
-0.64
-0.34
-0.15
0.25
-0.80
1
Table A3 Bivariate correlation table for top order batsman
Variables
batopnar
batopnsr
batopndbp
batopnbp
batopuc
batopnrss
batopnar
1
Batopnsr
0.6296
1
batopndbp
-0.6908
-0.6652
1
Batopnbp
0.0216
0.4076
0.2547
1
Batopuc
-0.1338
-0.0945
0.1379
0.0871
1
Batopnrss
0.0067
0.4686
0.2708
0.8123
0.0381
1
batopnbdryfq
-0.3587
-0.6967
0.261
-0.738
-0.0079
-0.5816
batmidbp
batmiduc
batopnbdryfq
1
Table A4 Bivariate correlation table for middle order batsman
Variable
batmidar
batmidar
1
batmidsr
batmiddbp
batmidrss
Batmidsr
0.6972
1
batmiddbp
-0.5986
-0.7024
1
batmidbp
0.3479
0.5856
0.0297
1
batmiduc
-0.0955
-0.0482
0.105
0.0586
1
batmidrss
0.3797
0.7024
-0.0122
0.8686
0.04
1
batmdbdryfq
-0.4597
-0.6496
0.2764
-0.8019
-0.015
-0.6743
batmdbdryfq
1
Table A5 Bivariate correlation table for lower middle order batsman
Variable
batlmar
batlmar
batlmsr
batlmdbpr
batlmbp
batlmbuc
batlmrss
batlmbdryfq
1
Batlmsr
0.2773
1
batlmdbpr
-0.3063
-0.6595
1
Batlmbp
0.1204
0.6204
-0.037
1
batlmbuc
-0.0734
-0.13
0.1265
-0.0474
1
Batlmrss
0.1288
0.763
-0.0928
0.8153
-0.092
1
batlmbdryfq
-0.2151
-0.6272
0.3192
-0.7958
0.0528
-0.6225
1