Huyen_Dung_Girish_JMP_poster_2015_Final

Examining factors that influence English Premier Soccer Results
Using JMP® Pro 11
Huyen Nguyen, Dung Phan, and Girish Shirodkar
Oklahoma State University, Stillwater, OK 74078
Introduction
Soccer is the most popular sport in the world with more than 250
millions players in over 200 countries. English Premier League is
broadcasted in 212 territories to 643 million homes and 4.7 billion TV
audience. It is therefore of great general importance to determine
what attributes drive English Premier League game results. Very few
concrete studies have been done to explore the influencial factors to
soccer game results. This study, which is based on 10 annual
seasons of English Premier League games data, attempts to explore
from the perspective of Home Teams. JMP ® Pro 11 is utilized for
data preparation, data analysis, and predictive modeling.
Data Preparation
Fig. 1a: Forward Logistics
Regression Confusion Matrix
Fig. 1b: Forward Logistics Regression Odds
Ratios
Fig. 2a: Neural Network
Confusion Matrix
Fig. 2b: Neural Network
The English Premier League games dataset consists of 3680
observations and 23 variables. The target variable Home Team
Results is derived from the two variables: Full Time Home Goal and
Full Time Away Goal. It is a binary variable, with 0 meaning Home
Team loses or draws a tie, and 1 meaning Home Team wins. Using
JMP ® Pro 11 the data were consolidated and prepared before
Predictive Modeling were utilized. Variable Selection were performed
using domain knowledge and statistical methods. 21 key variables
were selected.
Predictive Modeling
Predictive models including Stepwise Logistics Regression Model,
Forward Logistics Regression Model, Decision Tree and Neural
Network have been used and competing models were analyzed and
compared with each other.
Fig. 3a: Decision Tree
Fig. 3b: Decision Tree
Confusion Matrix
Examining factors that influence English Premier Soccer Results
Using JMP® Pro 11
Huyen Nguyen, Dung Phan, and Girish Shirodkar
Oklahoma State University, Stillwater, OK 74078
Model
Misclassification
rate
Generalized
R square
AICc
BIC
Logistic
Regression 1
22.15%
48.13%
1056.5
1106.39
Logistic
Regression 2
22.88%
47.56%
1059.76
1099.7
Decision Tree
21.78%
46.58%
N/A
N/A
Neural Network
20.92%
53.00%
N/A
N/A
Based on Misclassification Rate Criterion, Stepwise Logistics
Regression Model outperforms other models with Misclassification rate
of 22.15% . Stepwise Logistics Regression Model points out that
factors such as Half Time Home Goal, Half Time Away Goal, Home
Team Red Cards, Away Team Red Cards, Home Team Shots, Away
Team Shot are the most important predictors in determining game
results of English Premier League. Stepwise Logistics Regression
Model yeilds a sensitivity of 86.20%, and a speficity of 87.18%.
Fig. 4c: Stepwise Logistics Regression model results
Conclusion and Discussion
•Stepwise Logistics Regression Model is selected as the final model.
Fig. 4b: Stepwise Logistics Regression model results
The effects of influential factors to the Soccer Game results can be
quantified. For each additional goal Away Team scores by the second
half of the game, they stand 264% more chance of winning, whereas
for each additional goal Home Team scores, the chance of losing or
calling it a tie only decreases by 79.3%. The same pattern is also
observed in the effects of Red Cards on the full time results of the
game. If Home Team gets an additional Red Card, the chance of losing
or calling it a tie goes up by 122% while it is 32.3% for Away Team.
•Half Time Home Goal, Half Time Away Goal, Home Team Red
Cards, Away Team Red Cards, Home Team Shots, Away Team Shot
are the most important predictors in determining game results of
English Premier League.
•It is feasible to predict with high accuracy game results after the first
half of the game.
Reference
•http://www.football-data.co.uk/englandm.php
The differences in how these factors drive the results of the games can
be put down to the influence of Home Playground. Whereas Home
Teams have certain advantage of playing on their stadium, the
quantified effects mentioned above point to the fact that Home Team is
also under more pressure, therefore the effects of Half Time Goal and
Red Card are diluted when it comes to Home Team.
Fig. 4a: Stepwise Logistics Regression ROC
Acknowledgements
•Dr. Goutam Chakraborty, founder of SAS and OSU Business
Analytics Program at Oklahoma State University, for his continued
support
and
guidance.