first exercise of the conference!

Smartersig Conference September 2012
Introduction to Betting Data Modelling with R
Dr Alun Owen
Part 1: Using R, Managing Data Files, Initial Data Exploration and a First Look at
Logistic Regression
1.1
Preamble and Pre-Conference Work!
At the conference we will not actually work through the material contained on pages 1 to 15
as we will assume that you have worked through the material on these pages before you
arrive. Hence these pages have been supplied to you in advance of the conference. Can you
therefore please work through this material on those pages prior to arriving at the
conference?
If there are any aspects of the material on pages 1 to 15 that you would like some
clarification on please email me (Alun Owen) at [email protected]
Instructions for downloading and installing R should already have been emailed to you or if
not can be obtained from Mark.
We will be using a data set here called aiplus2004.csv which is available on the utilities
section of the Smartersig website or again can be obtained from Mark.
When you arrive at the conference, you will be given a longer document that includes the
material here but extends this work from page 16 onwards. We will kick off the first R
session with a 5 minute resume of pages 1 to 15, but will then go almost straight into page
16 and move onwards from there.
1.2
Using R
R can actually be used in a number of ways; it can be used interactively by typing
commands in the R Console window (after the > prompt) shown below:
Alternatively, R can be used in “batch” mode, by typing a series of commands in an external
script (text) file and then running some/all the commands in that file at once. This option
opens up the real power of the software.
1.3
Using R Interactively
To submit commands to R interactively, simply type the relevant command (after the >
prompt) in the R Console window and hit the enter button.
As an example, type in the following commands (hitting the enter key after each line):
x <- c(6,7.8,7.7,10.8,6.4,5.7,6.6,12.4)
y <- c(1,2,3,4,5,6,7,8)
plot(x,y)
The <- symbol in the first two lines of commands above is used to assign values to “objects”
and can be thought of as representing “=”. In the first line above, we assigned values to a
column of data named x. These values were Combined together using the c() function.
These columns of data are usually referred to as “variables”. So in the second line above,
we combined a list of values together and assigned these to another variable we named y.
The plot(x,y)command opens up an additional (graphics) window and displays a simple
scatter plot of y against x , with y on the vertical axis and x on the horizontal axis.
You can see what x and y look like simply by typing their names in R (again hitting the
enter key after each one) as follows:
x
y
If we wanted to assign a single value to a single variable (i.e. not a column of data), for
example you may have a starting betting bank of £1,000, then we could assign the value
1,000 to a variable we will call bank as follows:
bank <- 1000
If we wanted to increase the value of bank, by say £500, you could type:
bank <- bank+500
To see that value of the variable bank has increased to £1,500, try typing:
bank
At this point it may be worth mentioning that R provides a mechanism for recalling, editing
and re-executing previous commands. The vertical arrow keys on the keyboard can usually
be used to scroll forward and backward through a command history. Once a command is
located in this way, the cursor can be moved within the command using the horizontal
arrow keys, so that you can edit the command if required and then hit enter key to reexecute the revised command. Try this to see what x looks like again.
Now, the variables x and y and the variable bank are all what R refers to as objects. We
can take a look at what variables (objects) we have created during this R session by typing
the following command (followed by the enter key remember!):
ls()
Note this is a lower case “L” and a lower case “S” as in “lima” “sierra”!
This lists all the objects we have created in the current R session. You should see bank, x
and y listed on the screen in alphabetical order.
Next close R down so we can start again.
Do this either by clicking on File  Exit, or by clicking the usual
in the top-right of the R
screen. Note that just clicking the
in the R console window will just close the console
window and not close R itself.
When you try to close R you will probably see the dialogue box below:
Answer “No” to this as we do not need to save the session or any session history as such
– we will discuss this in more detail at the conference.
1.4
Using R in Batch Mode and Writing R Code
Instead of typing commands and executing these one at a time in interactive mode, we can
write a series of commands in an R script file, which is simply a text file holding our R
commands. This is useful if we want to save our work and re-use it later, or to develop
models written using R code that we can run repeatedly etc. To create a new blank R script
file, start R again and click on File  New Script as shown below:
This will open up a new window into which you can write your R commands:
In the next section we will enter our R commands into this script window and run them from
there, rather than typing them directly into the R Console window, so keep your R session
open.
1.5
Using External Data Files (and Folders) with R
So far we have entered our data by hand via the keyboard, but what about reading data in
from an external source, such as an Excel spreadsheet, comma separated (.csv) file or even
a text file?
In fact the data we entered for x and y above comes from the first few rows of the comma
separated file named aiplus2004.csv. This relates to data on horse racing and contains data
on over 22,000 horses. This data set is available on the utilities section of the Smartersig
website (or can be obtained from Mark).
This data contains six columns of data as follows:






position3
position2
position1
days
sireSR
position
-
finishing position three races ago (1, 2, 3 or 4, 0 = anywhere else)
finishing position two races ago (1, 2, 3 or 4, 0 = anywhere else)
finishing position in the previous race (1, 2, 3 or 4, 0 = anywhere else)
days since last race
win rate achieved by the horses Sire with its offspring prior to this race
finishing position in this race
Each row refers to a single horse in each race.
We will look at how to read this data into R. First you need to create a new folder
somewhere in the directory structure on your own PC. You will use this to place the data file
aiplus2004.csv, and also to save your work during the conference and during any further
work you do in this pre-conference session. For example you might create a folder called
“C:\R Stuff”.
Once you have created your new folder, copy or move the file named aiplus2004.csv to this
new folder noting the directory reference of where this folder is located on your PC, such as
for example “C:\R Stuff”.
When using R, we need to direct R to point to this directory. To do this we will set the
Working Directory we want R to use. In your R script file window (i.e. NOT the R console
window we have used so far!) type the following command:
setwd("C:/R Stuff")
noting the use of the forward slash instead of the usual back slash!
Also replace C:/R Stuff with whatever directory you are using if this is different!
To then execute this command in R, highlight the command in your script file (in the usual
way using your mouse) and then click on Edit  Run line or selection. Or you can use the
short-cut which is to hold down the CTRL and R keys together (with your command
highlighted).
If this has been successful, all you will see in the console window is that same command in
red having been run as follows:
> setwd("C:/R Stuff")
If this was not successful, then you will probably see the following statement:
> setwd("C:/R Stuff")
> Error in setwd("C:/R Stuff") : cannot change working directory
The most likely reason is that you haven’t stated the name of the directory correctly or may
have used a back slash in the command when you need to use a forward slash.
This section is optional!
Another approach you could use instead of setting the working directory is to put the full
directory location of the source file in the actual read.csv command. For example:
horse.data<-read.csv("C:/R Stuff/aiplus2004.csv")
However, the advantage of using the setwd() command approach is that for the current R
session, R will always look in that directory for data and other files and will save your work
to this directory also, keeping this tidy for us!
There is also a third way approach you could use instead of setting the working directory. In
this case we modify the properties associated with the short-cut icon on the desk-top we
use to launch R. We can modify those properties to change the default directory where R
points to. I’m not suggesting you use this approach, but for information if you wanted
to follow this approach, you need to right click, using your mouse, on the R short-cut icon
on your desktop, in order to display the properties dialogue box shown overleaf. On the
Shortcut tab type the name of the new sub-directory in the “Start-in:” field as shown below.
If you have created sub-directory with a different name and/or in a different location, use
that instead.
Click
Apply and OK to save this change.
Once you are happy that the setwd() command has been run successfully, type the
following command into your script window.
horse.data<-read.csv("aiplus2004.csv")
Then run it as before by highlighting the command and then clicking on Edit  Run line or
selection (or using CTRL and R).
The read.csv command actually reads data in a table-like format and from this creates what
R refers to as a data frame. We have chosen to call this data frame horse.data. You can
think of a data frame as being a data set that contains a number of columns of data. In this
case, the data frame horse.data contains the six columns of data from aiplus2004.csv.
There are other variants of the read function, should you need them. For example
read.table(), which can be used if your data has previously been saved in ASCII (flat file)
format, such as that created by windows NotePad. There is also read.csv2 for when the
data are separated by semi-colons as opposed to commas.
Okay, let’s see what our data in the horse.data data frame looks like. To do this type the
following command into your script file and then the run it (Edit  Run line or selection, or
use CTRL and R):
horse.data
You may see that only the first 16,666 rows (of the 22,000+ in the data frame) are
displayed! This is simply the default maximum number of rows that will be printed, which
can be altered if required.
To view just say the first 20 rows of the data frame horse.data, copy and paste the last
command you entered into your script file horse.data onto a new line further down in your
sxript file and edit it so that it reads as follows
horse.data[1:20,]
(making sure you use square brackets, and include the comma!)
Run this command (Edit  Run line or selection, or use CTRL and R) and it will show rows 1
to 20, but all columns, of horse.data. You should therefore see the output shown below
(note that this has been edited here to save space):
position3 position2 position1 days sireSR position
1
1
4
2
60
6.0
1
2
3
1
3
13
7.8
2
3
0
4
2
43
7.7
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
NA
1
0
566
10.8
18
19
0
2
0
20
7.5
19
20
0
0
4
9
4.8
20
Note that the file aiplus2004.csv had column names in the first row and so by default R has
used these as the column names for the data frame that it created.
The first number that appears in each row is simply a row number reference that has been
printed by R and is not actually stored as an actual column in the data frame.
Note also that the column position3 has a value of NA in row 18. This is how R represents
missing values and indicates that this value is Not A number. In the original data set this
value would have simply been missing.
You may have noticed that the first 8 rows of sireSR and position are actually the same
as those that we typed in manually for the variables x and y previously. To see just the first
8 rows of data for sireSR and position (i.e. columns 5 and 6) copy and edit the last line in
your R script file so that it reads:
horse.data[1:8,5:6]
and then run this (Edit  Run line or selection, or use CTRL and R)
Two ways of viewing say, the last column in the data frame, are to use
horse.data[,6]
or
horse.data$position
Try adding these to your R script file and then running them.
In both cases you should see that the R console window has been filled with all the 22,000+
values in the column named position from the data frame horse.data.
See if you can now view the data in the column of our data set named position by adding
the following command to your R script file and then running it
position
You should find that R cannot find this column. This is because at the moment the columns
in the data frame horse.data are not yet visible to R as objects in their own right. They are
just simply columns of the data frame object called horse.data. We can use the
attach()function to make the columns of the data frame “visible to R”. In your script file
add the following command and the run it.
attach(horse.data)
Now see if you can view the column named position by running the following command
again from your script file:
position
You should now see that this is now visible to R!
Note that we need to take care with the attach() function. Whilst this has made the
columns of the data frame horse.data visible to R, this does not create them as individual
variables or objects. Note that if we already had an object named position, the attach()
function would not overwrite that old one with the new one as you might expect it to!
1.6
Saving and Opening R Script Files
Before go any further lets save our script file. To do this, make sure your script file is the
“active window” in R – you can make sure of this by clicking anywhere in the script file
window. Then click on File  Save as and then saving it using a suitable name, but you
will also need to include a .R file extension at the end of the file name so that R
knows this to be an R script file.
Once you have saved your script file, close R down, again answering “No” when you see the
dialogue box below:
Now re-start R and click on File  Open script, and locate your script file save above to
open it again.
If you can’t locate it and you are sure you are looking in the right place, you may have
forgotten to add the .R file extension in the name of you script file when you saved it. Since
R by default only looks for script files with a .R file extension this may be why it isn’t listed.
You can check this by changing the “Files of type” option it is using to all files as follows:
Okay, we will now assume you have managed to locate your R script file.
1.7
Using R to Obtain Simple Data Summaries
Throughout the rest of this document I will assume that all the R commands you run will be
first typed into your R script file and then run in batch mode (using Edit  Run line or
selection, or CTRL and R).
Assuming you now have started a new R session and you have managed to open up your
saved R script file, run the following commands from that R script file to make sure we have
set our working directory, read in our aiplus data set as a data frame called horse.data
and have used the attach() command so that all the columns of horse.data are visible to
R:
setwd("C:/R Stuff")
(using a forward slash and replacing C:/R Stuff as required)
horse.data<-read.csv("aiplus2004.csv")
attach(horse.data)
During any data analysis or model development work, you should start with an examination
of your data. Let’s start by obtaining a frequency table on the data in the column named
position in our data set by typing the following:
table(position)
You should see the following output:
position
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1782 1778 1780 1774 1770 1737 1675 1597 1468 1333 1178 1033
870
722
577
461
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
362
269
193
115
36
28
22
22
16
14
10
8
6
4
3
2
33
34
1
1
This shows us that out of the 22,000+ rows of data (i.e. horses), we have a horse finishing
1st on 1,782 occasions, 2nd on 1,778 occasions and 3rd on 1,780 occasions, etc. and
indicates that we have approximately 1,780 races in our data set.
Let’s now obtain frequency table for another of our variables; days by typing:
table(days)
This time the table you will see is rather large and not quite so helpful. This is because there
are so many different possible values for the number of days since a horse last ran.
Another way of summarising the data on days is to produce a histogram (i.e. a bar chart for
continuous data). To obtain this type:
hist(days)
The histogram, which will appear in the separate graphics window, should look like that
shown overleaf.
10000
0
5000
Frequency
15000
20000
Histogram of days
0
500
1000
1500
days
This chart is also not that helpful, since most horses previously ran within approximately
100 days of their last race. The fact that some horses ran last more than 100 days ago and
some even ran up to 1,673 days ago, has significantly increased the scale on the horizontal
axis and made it difficult to see the detail in the bulk of the data below 100 days. We may
therefore want to focus some attention on the reduced set of horses which ran within 100
days. To do this, create a new variable by typing:
days.reduced <- days[days<=100]
This creates a new variable named days.reduced, which has been assigned with only those
values from the variable days where days is less than or equal to 100. Note the use of the
square brackets which encloses the selection criteria. Let’s now view the histogram of this
new variable by typing:
hist(days.reduced)
You should then see the following chart:
0
1000
Frequency
2000
3000
4000
Histogram of days.reduced
0
20
40
60
days.reduced
80
100
Exercise 1
Use R to produce a frequency table and histogram of the data on sireSR. (If you need help
the R commands required are shown over the page!)
You should again find that the frequency table is very large and not that helpful, since
sireSR is often given to one decimal place and so the range of possible values is much
greater. You should also find that the histogram shows that the values of sireSR vary
between 0% and 100% (of course), but that the majority of horses had a sireSR of less
than 20%.
See if you can therefore create a new variable called sireSR.reduced which contains only
the data where sireSR is less than or equal to 20%, and then produce a histogram for this
new reduced data set.
Your histogram should look as follows:
2000
0
1000
Frequency
3000
4000
Histogram of sireSR.reduced
0
5
10
15
20
sireSR.reduced
Note that this plot reveals an interesting feature of the data, in that there are around 885
horses that had a sireSR of zero. You should investigate why there are so many zeros and
whether a zero indicates that the horses sire truly has never won a race, or perhaps the sire
has never entered a race or it might be because the sireSR is not known?
R Commands for Exercise 1
table(sireSR)
hist(sireSR)
sireSR.reduced<-sireSR[sireSR<=20]
hist(sireSR.reduced)
End of Exercise 1!
With histograms, it is possible to control things like the width of the bars (class intervals),
title, axis labels etc. To get more help with how to do extra things like this, with a function
such as hist(), click on the “Help” menu and then select “R functions (text)” as shown
below:
In the dialogue box that appears as shown below, type hist and then OK as shown below to
get more help with that function:
Another useful part of the help facility is the document available there called “An
Introduction to R”. This can be accessed by clicking on the menu options from the Help
menu as shown below:
That document also has a useful sample R session as an Appendix.
Okay, back to our session here on Simple Data Summaries. We will now look at some
useful summary statistics of the days variable, by using the command:
summary(days)
Try that command and you should see the following output:
Min. 1st Qu.
0.0
11.0
Median
17.0
Mean 3rd Qu.
38.3
31.0
Max.
1673.0
Note that we have undertaken this analysis on the full set of data on days and not the
reduced set! This shows us that the number of days since each horse last raced varied from
0 (zero!) to 1,673 (4.5 years!). Again we should consider checking that all of these
represent sensible and valid values.
The output above also shows us that the average (mean) was 38.3 days. In fact the data
for days (and days.reduced) had a few very large values which will have inflated the mean
to 38.3. In these circumstances the median (“middle” value) of 17 days will more likely give
a fairer reflection of the average.
The 1st Qu. and 3rd Qu. values are what are referred to as the first and third quartiles
respectively. If all the data values are listed in order, these are the values a quarter of the
way along the data and three-quarters of the way along the data. Hence they indicate the
range of values in the middle 50% of the data, which in this case is from 11 days to 31
days. This is also known as the Inter-Quartile Range.
Exercise 2
Obtain the summary statistics on the full sireSR data (i.e. not the reduced set).
End of Exercise 2!
Back again to our session on Simple Data Summaries.
So far we haven’t looked at the data in the columns named position1, position2 and
position3. Since these contain only a few possible values each (0, 1, 2, 3 or 4), we can
usefully produce frequency tables for each of these.
For example for position1 try the command:
table(position1)
Your results should be the same as those shown below:
position1
0
1
13841 2529
2
2087
3
2114
4
2076
This shows, for example, that in our data set we had 2,529 horses who had won (i.e.
finished 1st!) in their previous race, and 2,087 horses who had finished second in their
previous race. This also shows that there are 13,841 horses in the data set which did not
finish in the top four in their previous race.
Exercise 3
Obtain frequency tables for position2 and position3.
You should obtain the results shown below.
position2
0
1
13394 2539
2
2254
3
2183
4
2182
position3
0
1
13147 2523
2
2249
3
2198
4
2172
This shows that there are 2,539 horses who won their race two races previously etc. This
also shows that there are 13,394 horses who did not finish in the top four in their race two
races previously etc.
End of Exercise 3
Okay, this is the end of the pre-conference work on R. Make sure you save your R script file
as we will start the conference by opening this up and running some of the commands
contained in it. If you have any questions email me (Alun Owen) at [email protected].
I look forward to seeing you at the conference.
Start of Conference Work!
1.8
Identifying Potential Information to Develop of Model of Race Outcomes
This is where we start the conference work proper and we will assume that you have
completed the pre-conference work in Sections 1.1 to 1.7, which was provided prior to the
start of the conference.
Having previously undertaken an initial examination of our data, we come to the point
where we want to look to see whether any of the information contained in our data can help
us with the development of a model of race outcomes.
A common approach to modeling any type of race outcomes, whether its horses, cars,
motorcycles or people etc., is to develop models that allow us to assess how various factors
impact on the chances or probability of any competitor winning. In these circumstances, we
would need historical data that indicates whether each competitor in each race won or not.
In this case we are modeling a variable which can take only two values; win/lose, and this
type of variable is often called a binary variable. A model we can use for this type of binary
outcome is the binary logistic regression model.
Binary logistic regression models are what we will focus on over the rest of the conference.
Note that a more sophisticated model that we could use instead is the multinomial logistic
regression model. This also allows us to model the probability that a horse will win, but at
the same time allows us to model the probability a horse will be placed second, third etc.
However, before we can even begin to think about understanding and implementing
multinomial logistic models we need to understand how to develop and implement the
simpler binary logistic regression model – hence this is what we will look at during this
conference. Perhaps in a follow up conference we can cover multinomial models?
Now, the information on the outcome of each race is contained in the variable named
position, which indicates the position the horse finished in each race in the data set.
Therefore, we first need to create a new variable which indicates whether the horse in each
row of our data set won that particular race or not. To do this type the following command
in your R script file and run it :
win<-ifelse(position==1,1,0)
This uses the ifelse function in R. The logic should be reasonably clear; when position is
equal to 1 (i.e. the horse won) then this assigns the value 1 to our new variable called win,
and where position takes any other value (i.e. the horse did not win) then it assigns a value
of zero to our new variable called win.
We can therefore use this new variable called win as an indicator of which horses won i.e.
when (win =1) and which horses did not win i.e. when (win =0).
We can use this to investigate here how the probability of winning a race is related other
variables we have historical information on, such as for example, the horse’s finishing
position in each of its’ previous three races (position1, position2 and position3), it’s
sire strike rate (sireSR) and the number of days since the horse last ran (days).
First we consider position1. Since this variable only covers a limited number of values (0,
1, 2, 3 or 4), we will form a two-way table (contingency table) of win against position1.
This will show how often a horse won, given its’ finishing position in its’ previous race. To do
this type:
table(position1,win)
This gives the two-way table of win against position1 as shown overleaf.
win
position1
0
0 13066
1 2174
2 1838
3 1887
4 1900
1
775
355
249
227
176
This two-way table indicates, for example, that when a horse won its’ previous race (i.e.
position1 = 1), the horse then went on to win its’ next race on 355 occasions, but didn’t
go on to win its’ next race on 2,174 occasions.
However, we would like to have the table in a format that indicates the proportion of time
the horse went on to win its’ next race. For example, when a horse won its’ previous race
(i.e. position1 = 1), the proportion of time it won its’ next race was 355/(355+2,174) =
0.1403717 or 14%.
We can obtain this proportion easily for each of the values for the variable position1 by
typing:
position1.win<-table(position1,win)
100*prop.table(position1.win,1)
The first line above creates a new object called position1.win which is in fact the two-way
table shown above.
The second line above works out the proportions in the cells of the two-way table called
postion1.win. The value of 1 used after the comma at the end of the second command line
above, means that the proportions are based on the row totals, rather than column totals (if
we wanted to base the proportions on column total this value would need to be a 2). This
therefore gives us the proportion of wins for each value of position1. In this way the
proportions in a row add up to 1.
We multiply by 100 in the second command line above so that the proportions are out of
100 and hence represent percentages.
The output you see should look as follows:
win
position1
0
1
0 94.400694 5.599306
1 85.962831 14.037169
2 88.068999 11.931001
3 89.262062 10.737938
4 91.522158 8.477842
This shows, that when a horse finished first in its’ last race (indicated by postition1 = 1)
it went on to win 14.0% of the time.
When a horse finished second in its’ last race it went on to win 11.9% of the time.
When a horse finished third in its’ last race it went on to win 10.7% of the time.
When a horse finished fourth in its’ last race it went on to win 8.5% of the time.
When a horse was placed elsewhere in its’ last race it went on to win 5.6% of the time.
This suggests that the better a horse did last time out the more often it won its next race.
Hence the finishing position of a horse’s previous race does indeed provide some potentially
useful information on which to build a model for determining future race probabilities of a
win.
Note that it would be helpful to maybe round the values above to say 1 decimal place. We
can do this by modifying the line of code:
100*prop.table(position1.win,1)
to:
round(100*prop.table(position1.win,1),1)
The output in this case looks as follows:
win
position1
0
1
2
3
4
0
1
94.4 5.6
86.0 14.0
88.1 11.9
89.3 10.7
91.5 8.5
Next we will investigate the relationship between the probability of winning and the data on
sireSR and days. However, we have to deal with these two variables in a slightly different
way, since they both contain a wide range of possible values. One useful thing would be to
investigate whether the proportion of horses winning a race increases or decreases in some
way with increasing values of sireSR and/or days.
There is a very useful plot function in R called cdplot(). This actual produces what is called
a Conditional Density plot but we won’t worry too much what that means. To produce this
type of plot in to illustrate how win proportion depends on sireSR, we could try the
following commands in R (try running these!):
win.f<-as.factor(win)
cdplot(win.f~sireSR)
I’ll explain at the conference what these are doing. The main things to note here are that we
need to declare the win variable as a factor and so we create a new variable called win.f
for this. In the cdplot() command above, the ~ character (which we refer to as “twiddles”)
in the statement win.f~sireSR, is short for “depends on”. This indicates that in our plot we
want win.f to depend on sireSR.
Back to the plot this produces! First we note that these commands will probably produce a
“rubbish” plot! The reason for this is down to the fact that we are asking R to estimate how
win proportion depends on sireSR, but R is having problems for larger values of sireSR.
This is because there are not enough horses with larger values of sireSR (i.e. above 20% if
you recall the earlier work we did!) to accurately estimate things well enough! So instead
we will look at the relationship between win proportion and sireSR but only where sireSR is
less than or equal to 20%. We can do this by modifying the above commands to (try
running these):
win.reduced<-as.factor(win[sireSR<=20])
sireSR.reduced<-sireSR[sireSR<=20]
cdplot(win.reduced~sireSR.reduced)
You should see the plot show below:
This plots the estimated proportion of horses not winning (i.e. where win.reduced=0) and
winning (i.e. where win.reduced =1) as the Sire Strike Rate increases from zero to 20%.
We can improve the plot further by plotting the win proportion underneath the lose
proportion by changing the cdplot() command to:
cdplot(win.reduced~sireSR.reduced,ylevels=2:1)
This plots the second level of our y variable (i.e. where win.reduced =1) at the bottom of
the plot and then the second level of our y variable (i.e. where win.reduced =0) above that
to give the plot as follows:
We can now add labels “Won” and “Did not Win” instead of the default 1 and 0 by modifying
the cdplot() command to:
cdplot(win.reduced~sireSR.reduced,ylevels=2:1,yaxlabels=c("Won","Did
not Win"))
The revised plot is:
Finally we should amend the scale on the y axis so that it stops, say, at 25%. We can do
this by modifying the cdplot() command to:
cdplot(win.reduced~sireSR.reduced,ylevels=2:1,yaxlabels=c("Won","Did
not Win"),ylim=c(0,0.25))
The revised plot is:
So what does this tell us? Well, as the sireSR increases from about 3% up to about 12%,
the proportion of horses winning increases. Where sireSR is below 3%, there is uncertainty
due to the unusual increase (spike) in win proportion. Above 12% the win proportion looks
to vary more but on average appears to be roughly leveling out? Above 15% there is also
greater uncertainty due to the sudden drop in win proportion and this variability actually
continues beyond this point! This is due to the lack of data beyond a SR of 15%.
Okay, now have a go at Exercises 4 and 5 (the first ones of the conference!).
Exercise 4 (first exercise of the conference!)
The solution and commands should you need them are on the next page!
Referring back to the work we did on pages (17 and 18), obtain a two-way table for win
against position2 rounded to 1 d.p. if you can!
Note that perhaps the easiest way to do this is to copy and paste the two commands below
that we used earlier in relation to position1:
Position1.win<-table(position1,win)
round(100*prop.table(position1.win,1),1)
but amend these, for example, to relate to position2.
Using your results, comment on whether you think position2 offers additional good
information in relation to predicting future race outcomes.
Then do the same for position3.
Exercise 5
The solution and commands should you need them are on the next page!
Produce a suitable plot of win proportion against the number of days since the horses last
ran (i.e. the days variable). You could start by trying the following command:
cdplot(win.f~days)
But this will also give you a rubbish plot. Do you know why? Make suitable amendments to
produce a more suitable plot.
Hint: If you think back to our earlier work on days since last run, this was below 100 for
most horses and only a few had values above this.
Solution to Exercise 4
For position2 the commands required for this exercise are:
position2.win<-table(position2,win)
round(100*prop.table(position2.win,1),1)
from which you should obtain the following results:
win
position2
0
1
0 93.5 6.5
1 88.8 11.2
2 89.1 10.9
3 91.0 9.0
4 91.7 8.3
Again this indicates that the data on position2 also provides potentially useful information
on which to build a model for determining future race probabilities of a win.
Repeating this position3 the commands required are:
position3.win<-table(position3,win)
round(100*prop.table(position3.win,1),1)
and you should obtain the results shown below:
win
position3
0
1
0 93.1 6.9
1 89.8 10.2
2 90.0 10.0
3 91.2 8.8
4 92.3 7.7
Again this suggests that the data on position3 also provides potentially useful information
on which to build a model for determining future race probabilities of a win.
Solution to Exercise 5
R commands are:
win.reduced<-as.factor(win[days<=100])
days.reduced<-days[days<=100]
cdplot(win.reduced~days.reduced,ylevels=2:1,yaxlabels=c("Won","Did not
Win"),ylim=c(0,0.15))
The resulting plot is shown overleaf (note the y axis was restricted to a maximum of 15% to
see the detail).
This plot contains some hint that the win proportion reduces as the number of days since
the horse last ran increases. However, beyond 40 days there is much more variation in the
plot and hence greater uncertainty. The variation above around 40 days is again due to the
lack of data above that point.
End of Exercises 4 and 5!
Okay to summarise so far:
We have looked at the relationship between the win percentage and the other (predictor)
variables and found that there is some suggestion of an increase in the win proportion with
values of position1, position2 and position3 nearer 1st place.
We have also seen an increase in the win proportion with higher values of sireSR but only
where the sireSR was less than or equal to 12%. Above this figure win proportion seems to
perhaps level out on average somewhat and beyond 15% there is not enough data to be
totally sure about the impact of increasing sireSR. One option is to build into any model the
notion that win probability increases with sireSR up to 12% but a sireSR above 12% has
the same impact as a sireSR=12%.
Finally we have seen that a decrease in the win proportion seems to perhaps be associated
with higher values of days, but again this applies only when the days since last run varies
from 3 to around 40. Above 40 there appears to be less of an indication of an impact and
also there is too much variation to make any clear judgments.
So we have explored our data so let’s fit the model!
1.9
Fitting the Binary Logistic Regression Model
A binary logistic regression model would appear to be a suitable model worth investigating
here for a number of reasons;

Firstly our outcome (win) is a binary variable in the sense that there are only two
possible values, 1 (if the horse won) or 0 (if it didn’t);

We require as an output from our model, the probability of a horse winning a race;

The win probability is a natural output from such a logistic model, since the
reciprocal of the win probability gives us the “fair” odds of winning which we can
compare with the odds offered by bookmakers to facilitate betting decisions.
A typical binary logistic regression model can specified algebraically as follows:
 p 
   0  1 x1   2 x2  ...   p x p
log 
1

p


p is the probability of the outcome that we are interested in, which in our case is a
win.
The ratio p/(1-p) is therefore the probability of a win divided by the probability of not
winning, and is therefore the odds.
Hence log(p/1-p) is the log-odds (or log of the odds) which where the name logistic
regression comes from! Note the word “log” refers to the logarithm function which
we will explain in more detail shortly!
The terms x1 x2,…, xp represent the predictor variables, which in our case would be
sireSR, days, position1, position2 and position3.
Note that in our case the variables sireSR and days are continuous variables since
they are measured on a continuous scale. However the variables position1,
position2 and position3 should really be treated as categorical variables, since the
position a horse finished in a previous race falls into one of five categories (first,
second, third, fourth or unplaced). We will discuss this issue again later and during
the conference.
The terms β1, β2,…, βp in the model represent what we refer to as parameters that
we let the software estimate to give us the best fitting model. We multiply the values
of the predictors (x1 x2,…, xp) by these parameters (β1, β2,…, βp) in the model.
Note that the term β0 represents a constant parameter that accounts for when the
values of all the predictor variables are zero.
So what does this model look like? Well lets first consider the simplest model where we
have just one continuous predictor, x, say. The binary logistic regression model in this case
is:
 p 
   0  1 x
log 
1

p


Now, it turns out (if we use some mathematical algebra) that this model can be re-stated as
follows:
p
exp(  0  1 x)
.
1  exp(  0  1 x)
This is the equation of a typically “s-shaped” curve an example of which is shown below:
So how does this relate to what we have? Consider again our cdplot of win proportions
against sireSR. Can you see how the model above does bear some resemblance to our cd
plot below?
So how can we use R to find the equation of the (logistic) curve that best fits our data?
Well, we first state our simple model as follows:
 p 
   0  1  sireSR.trunc
log 
1

p


The parameters β0 and β1will be estimated by R using a model fitting technique called
maximum likelihood estimation (no details of maximum likelihood estimation are discussed
here).
We can fit the model by typing the following commands:
win<-ifelse(position==1,1,0)
sireSR.trunc<-ifelse(sireSR<=12,sireSR,12)
glm(win~sireSR.trunc,family=binomial)
The first line of code above has already been run but is again repeated for completeness.
The second line of code allows us to use all the data for all horses in the model, but where a
horse’s sireSR is above 12% this is truncated (reduced) down to 12. This is different to
sireSR.reduced – can you see how?
The third line above uses the glm command to fit the model since the binary logistic
regression model is actually an example of what is called a Generalized Linear Model (GLM).
In the glm() command, the ~ character (which we refer to as “twiddles”) in the statement
win~sireSR.trunc is short for “depends on”. This indicates that in our model we want win
to depend on sireSR.trunc.
The statement family=binomial tells R that we wish the output (dependent) variable (win)
to be treated as if it is binary (and so follows the binomial probability distribution).
Some of the output from running the above glm() command in R is:
Coefficients:
(Intercept)
sireSR.trunc
-2.84005
0.04847
In this output, the coefficient listed under “Intercept” is -2.84005 and this is the estimate of
β0 in our model.
The coefficient 0.04847 listed under sireSR.trunc is the estimate of
is the parameter we multiply sireSrR.tunc by.
β1 in our model which
Therefore the best fitting model for our data is:
 p 
  2.84005  0.04847  sireSR .trunc
log 
1 p 
Recall that this is the equation of a curve and so we can plot this curve on top of our data to
see how well it corresponds to our data. This plot looks as shown overleaf.
Note that the model ends where the sireSR reaches 12%. Above this level we assume the
predicted win proportion is the same as that at 12%. Hence the model would continue as a
flat horizontal line where the sireSR is above 12%.
We can produce the above plot using the following commands in R:
ourmodel<-glm(win~sireSR.trunc,family=binomial)
win.f<-as.factor(win)
cdplot(win.f~sireSR.trunc,ylevels=2:1,yaxlabels=c("Won","Did not
Win"),ylim=c(0,0.25))
x<-seq(from=0,to=12,by=0.01)
pred.sireSR<-data.frame(sireSR.trunc=x)
p<-predict(ourmodel,newdata=pred.sireSR,type="response")
lines(p~x,type="l")
The first line above uses the glm command to fit the model, but this time the results of the
model fit are assigned to an object we choose to call ourmodel.
The second and third lines produce the actual cdplot of the data, but note this time it is
based on sireSR.trunc rather than sireSR.reduced.
The fourth line above creates a temporary variable which we choose to call x. This uses the
seq command to create a very long column of values from 0 to 12 which increase in
increments of 0.01. We will use these values to calculate “predicted” probabilities from our
fitted model for each value of x.
The fifth line defines a data frame which has just a single column called sireSR.trunc
which contains x which is the long sequence of values we obtained in the line above. Note
that in this data frame we must name the column sireSR.trunc, since this is the name
of the sireSR variable in our model.
The sixth line in the R code above uses the predict() command to extract predicted values
using our model, but rather than using the actual data we use new data which we declare
to be the data frame which contains the sequence of values from 0 to 12 in increments of
0.01. The type=”response” requests that the predictions are returned as probabilities. These
predicted probabilities are then assigned to a new variable we call p.
The last (seventh) line in the R code above uses the lines() command to add a line to the
plot. The lines command actually adds to the current plot p plotted against all the many
values of x, such that all the points are joined by very small straight lines to give the
appearance of a curve.
1.10
Forecasting and Betting with the Binary Logistic Regression Model
So how do we then use our model for betting?
Suppose in a future race a horse running has a sireSR value of 10%. The estimated logodds from the model are then given by:
 p 
  2.84005  0.04847  10  2.35535
log - odds  log 
1 p 
But how do we then find the odds from this?
Well, we simply calculate the exponential of this, which is
to the power of minus 2.35535”.
e -2.35535 and reads as “e
raised
This might look hard to calculate, but e is simply a special number in mathematics like the
number . The number e is approximately 2.718, but most calculators, as well as in R and
Microsoft Excel, have the number e in its fixed memory. Hence they have the capacity to
raise the number e to whatever power you require.
We can calculate
e -2.35535
by pressing the following typical keys on your calculator:
ex
-2.35535
=
Or alternatively you can use the following command in R:
exp(-2.35535)
Either way you should get the answer to be 0.095
This means that fair odds of the horse winning would be 1 to 0.095, i.e. 10.5:1, or decimal
odds of 11.5 if you include the stake.
The probability of winning, which is p, can be derived from 1/odds (where the odds include
the stake). Hence the probability of winning in this case is 1/11.5 = 0.087.
Hence, we have used the model to say that a horse with a sireSR value of 10%, would win
its next race with a probability of 0.087, and fair decimal odds would be 11.5.
We can obtain model-predicted probability and odds for five horses in a future race, with
sireSR values of 4%, 8%, 10%, 15% and 20%, again using the predict command in R as
follows:
ourmodel<-glm(win~sireSR.trunc,family=binomial)
new.sireSR<-c(4,8,10,12,12)
newrace.sireSR<-data.frame(sireSR.trunc=new.sireSR)
prob<-predict(ourmodel,newdata=newrace.sireSR,type="response")
odds<-1/prob
odds
prob
Note that the second line above creates a new column of data which we choose to call
new.sireSR, which contains the sireSR values for the new race. Note how the last two have
both been truncated from 15% and 20% down to 12%.
Note also again that in the third line, we define a data frame we choose to call
newrace.SR, but remember that we have to call the column in this data frame that
contains the sireSR data sireSR.trunc since that is the name of the variable we have used
in our model.
The fourth line uses the predict command to extract predicted values, this time using the
new data which we declare to be the data frame which contains the sireSR values for the
future race. The type=”response” again requests that the predictions are returned as
probabilities.
The remaining lines then simply calculate the odds and ask R to display the probabilities and
odds.
The output from running these commands is shown below. Note how the third set of values
agree with the “fair” odds and win probability calculated earlier!
> odds
1
2
3
4
5
15.10018 12.61528 11.54222 10.56829 10.56829
> prob
1
2
3
4
5
0.06622439 0.07926898 0.08663846 0.09462266 0.09462266
1.11
Model Assesment
Well before we would use the model for predictive purposes, we need to consider whether
sireSR is a good predictor of win probability.
We can assess this using the summary command as follows:
ourmodel<-glm(win~sireSR.trunc,family=binomial)
summary(ourmodel)
Part of the output from this command is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.840053
0.079872 -35.558 < 2e-16 ***
sireSR.trunc 0.048467
0.009523
5.089 3.59e-07 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
One of the first things we should look at in the above is the value of Pr(>z) for the
sireSR.trunc predictor. We refer to this as the p-value and if this value is <0.05 then that
predictor is considered to be a significant (good) predictor of win probability.
The p-value for the sireSR.trunc predictor in this case is 3.59e-07 which is interpreted as
being 3.59 x 10-07. This is simply given in standard form (remember your O level/GCSE
maths?) and is therefore actually 0.000000359. Since this is < 0.05 we can conclude that
sireSR.trunc is significant (good) predictor of win probability!
Notice the following asterisk notation in the output above:
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the output, the p-value of 3.59e-07 for sireSR.trunc is annotated with *** which is
another way of highlighting that the p-value is between 0 and 0.001.
Exercise 6
In this exercise you will extend the model so that it includes both sireSR and days as
predictors. Note the solution is shown overleaf!
Recall that the general form of the binary logistic regression model can be stated as
 p 
   0  1 x1   2 x2  ...   p x p
log 
1

p


We could consider the following model with sireSR and days as predictors as follows:
 p 
   0  1  sireSR.trunc   2  days
log 
1  p 
You can fit this model in R by modifying the model fit command to:
ourmodel<-glm(win~sireSR.trunc+days,family=binomial)
Use the above command to fit the model and then use the summary() command to obtain
suitable information to assess whether days is also a significant (good) predictor of win
probability. Use the output from your model to write down the equation of your model in the
following log-odds terms (i.e. fill in the missing values indicated by a ?):
 p 
  ? ? sireSR .trunc  ? days
log - odds  log 
1 p 
Use this to calculate the estimated log-odds for a horse winning a future race if it has a
sireSR value of 10% and it has been 20 days since it last ran. Hence obtain the exponential
of this to calculate the probability of this horse winning and fair odds for a bet on this horse.
Solution to Exercise 6
R commands required are:
ourmodel<-glm(win~sireSR.trunc+days,family=binomial)
summary(ourmodel)
The output is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-2.7455900
sireSR.trunc
days
0.0815090 -33.685
< 2e-16 ***
0.0489606
0.0095547
5.124 2.99e-07 ***
-0.0029181
0.0005204
-5.607 2.05e-08 ***
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The p-value for days is 2.05e-08, i.e. 0.0000000205 which is < 0.05 and so days is a
significant (good) predictor of win probability. This is similarly indicated by the fact that this
is annotated in the output with ***.
The log-odds equation of the model is
 p 
  2.74559  0.0489606  sireSR.trunc  0.0029181  days
log - odds  log 
1 p 
The estimated log-odds from the model are then given by:
 p 
  2.74559  0.0489606  10  0.0029181  20  2.31435
log - odds  log 
1 p 
We then calculate the exponential of this using R with the following command:
exp(-2.31435) =
The answer to this is 0.099 and so fair odds of the horse winning would be 1 to 0.099
Or in usual terms is 10.12:1 (obtain from 1/0.099) to 1.
Which gives fair decimal odds of 11.12 if you include the stake.
The probability of winning can be derived from 1/odds where the odds include the stake.
Hence the probability of winning in this case is 1/11.12 = 0.0899.
1.12
Extending our Binary Logistic Regression Model
We now consider adding position1 as another predictor of win probability in our model.
We could consider the following model with position1 included as a predictor in our model
as follows:
 p 
   0  1  sireSR.trunc   2  days   3  position1
log 
1  p 
However, at the moment we are assuming that position1 is a continuous variable i.e. that
the values 0, 1, 2, 3 and 4 that position1 takes are meaningful on that scale.
This makes some rather strong (and probably unrealistic) assumptions that a change in
position1 going from 1 to 2 (i.e the difference between a horse winning its last race and
coming second in its last race) has the same effect on win probability, as a change in
position1 going from 2 to 3 (i.e the difference between a horse coming second in its last
race or finishing third in its last race), and so on. A similar assumption would also be made
in the model if position2 and position3 were being included in the model.
We can avoid making these strong assumptions by treating position1 (and if relevant
position2 and position3) as a categorical variable (or factor) rather than as a continuous
variable.
If we treat position1 as a factor then our model really needs to be re-stated as follows:
 p 
   0  1  sireSR .trunc   2  days
log 
1  p 
 11  pos11  12  pos12  13  pos13  14  pos14
This may at first look a little overwhelming, so let’s look at it line by line:
The first line is simply the constant (intercept) term β0, plus a term for
sireSR.trunc multiplied by the estimated model parameter β1, followed by a term
for the days variable multiplied by the estimated model parameter
β2 .
The second line in the model above relates to the variable position1 as follows:
pos11 is a ‘dummy’ variable that = 1 if the value of position1 is 1, otherwise
pos11 is zero;
pos12 is a ‘dummy’ variable that = 1 if the value of position1 is 2, otherwise
pos12 is zero;
pos13 is a ‘dummy’ variable that = 1 if the value of position1 is 3, otherwise
pos13 is zero;
pos14 is a ‘dummy’ variable that = 1 if the value of position1 is 4, otherwise
pos14 is zero;
So, for example, if a horse finished second in its previous race, it would have
pos11 = 0, pos12 = 1, pos13 = 0 and pos14 = 0.
Can you see that if pos11 = 0, pos12 = 0, pos13 = 0 and
horse did not finish in the top four in its previous race?
pos14 = 0, then the
Before we fit the model, we need to define the dummy variables; pos11, pos12, pos13,
pos14 (for position1). This is actually what R assumes we want if we simply define
position1 as a factor using the command:
and
pos1<-factor(position1)
We can then fit this model using
ourmodel<-glm(win~sireSR.trunc+days+pos1,binomial)
summary(ourmodel)
The output obtained is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)
-3.0284817
sireSR.trunc
0.0848857 -35.677
< 2e-16 ***
0.0381184
0.0095645
3.985 6.74e-05 ***
-0.0024011
0.0005039
-4.765 1.89e-06 ***
pos11
0.9762834
0.0685220
14.248
< 2e-16 ***
pos12
0.7907706
0.0772282
10.239
< 2e-16 ***
pos13
0.6859132
0.0795358
8.624
< 2e-16 ***
pos14
0.4240642
0.0871682
4.865 1.15e-06 ***
days
--Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Can you see that all terms in the model are significant? Hence position1 is also a
significant (good) predictor of win probability.
Exercise 7
Extend the model to now include position2 and position3 in the model. Remember to
define suitable factors for these predictors (position2 has been done for you in the script
file).
Obtain the usual summary output to assess whether position2 and position3 are
significant predictors of win probability.
Again the solution is shown overleaf.
Solution to Exercise 7
The model is now:
 p 
   0  1  sireSR   2  days
log 
1 p 
 11  pos11  12  pos12  13  pos13  14  pos14
  21  pos 21   22  pos 22   23  pos 23   24  pos 24
  31  pos31   32  pos32   33  pos33   34  pos34
We can then fit this model by typing:
Pos1<-factor(position1)
pos2<-factor(position2)
pos3<-factor(position3)
ourmodel<-(glm(win~sireSR.trunc+days+pos1+pos2+pos3,binomial))
summary(ourmodel)
The output obtained is:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.1607119 0.0895411 -35.299 < 2e-16 ***
sireSR.trunc 0.0322118 0.0097749
3.295 0.000983 ***
days
-0.0026558 0.0005274 -5.035 4.77e-07 ***
pos11
0.8593258 0.0720222 11.931 < 2e-16 ***
pos12
0.7071949 0.0781914
9.044 < 2e-16 ***
pos13
0.6280146 0.0802878
7.822 5.20e-15 ***
pos14
0.3838465 0.0878972
4.367 1.26e-05 ***
pos21
0.4201704 0.0751272
5.593 2.23e-08 ***
pos22
0.3688256 0.0789903
4.669 3.02e-06 ***
pos23
0.2157665 0.0848243
2.544 0.010969 *
pos24
0.1723176 0.0865667
1.991 0.046528 *
pos31
0.3078373 0.0757830
4.062 4.86e-05 ***
pos32
0.2404234 0.0802307
2.997 0.002730 **
pos33
0.1557823 0.0839587
1.855 0.063530 .
pos34
0.0290673 0.0885978
0.328 0.742850
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this case the only variable that does NOT have an asterisk or dot alongside the p-value is
pos34. This would suggest that whether a horse finished fourth three races ago has little
effect on the win percentage. Also the dot alongside the p-value for pos33 suggests that
whether a horse finished third three races ago has only a borderline effect on the win
percentage. However, pos31 and pos32, which relate to the horses finishing first or
second three races ago respectively, are still important.
Overall, I would perhaps argue that the information relating to position3 seems valuable
and should probably be kept in our model. But, we do need to be wary of over-fitting. This
is when we have a model which fits our data very well, but when deployed to new data
performs badly in terms of future prediction.
Our final model can therefore be specified as:
 p 
  3.161  0.0322  sireSR  0.00266  days
log 
1 p 
 0.859  pos11  0.707  pos12  0.628  pos13  0.384  pos14
 0.420  pos 21  0.369  pos 22  0.216  pos 23  0.172  pos 24
 0.308  pos31  0.240  pos32  0.156  pos33  0.029  pos34
Note that the estimate for the days variable is the only one which is negative. This indicates
that the log-odds actually decrease as the number of days since last run increases. This
means that the win probability decreases as the number of days since last run increases,
which reflects the nature of the relationship suggested earlier. Note there were a number of
issues with number of days since last run which we have ignored so far and should really
investigate before finalizing our model.
To use the model for prediction, suppose we have a horse which has a sireSR value of
10%, it last ran 5 days ago, finished 1st in its previous race, finished 3rd two races ago and
was unplaced three races ago. The predicted log-odds can then be calculated as follows:
Since the horse finished 1st in its previous race it has values of
pos13=0 and pos14=0.
pos11=1, pos12=0,
Similarly, since the horse finished 3rd two races ago, it has values of
pos22=0, pos23=1 and pos24=0.
pos21=0,
Finally, since the horse was unplaced three races ago, it has values of
pos32=0, pos33=0 and pos34=0.
pos31=0,
Hence the model predicted log-odds are given by:
 p 
  3.161  0.0322 10  0.00266  5
log 
1 p 
 0.859 1  0.707  0  0.628  0  0.384  0
 0.420  0  0.369  0  0.216 1  0.172  0
 0.308  0  0.240  0  0.156  0  0.029  0
= -1.777
We can therefore calculate the predicted odds in R by typing:
exp(-1.777)
The answer is 0.169. This means that fair odds of the horse winning would be 1 to 0.169,
i.e. 5.9 to 1, or decimal odds of 6.9 if you include the stake. Hence the probability of
winning is 1/6.9 = 0.145.
Again we can use R to calculate these predicted odds and probabilities as follows:
newrace.data<data.frame(sireSR.trunc=10,days=5,pos1="1",pos2="3",pos3="0")
prob<-predict(ourmodel,newdata=newrace.data,type="response")
odds<-1/prob
odds
prob
Note in the second line above, the values for pos1, pos2 and pos3 are given in quotes. This
is to ensure that they are treated as if they are categorical observations rather than
measured on a scale.
If we wanted model predicted probabilities for the horses in the data set we used to fit the
model, we can obtain these and plot them in a histogram using:
prob<-predict(ourmodel,type="response")
hist(prob)
1.13
More on Model Assessment
One thing we should consider is how well our model fits our data. One way is to compare
the predicted probabilities against the actual proportion of time horses with those
probabilities won - these should be similar if our model is appropriate. A plot for our model
is shown below:
This plot suggests the model performs quite well on average where the predicted
probabilities are lower (less than 0.15). Above this level things do seem to vary above the
expected levels but this is to some degree going to be more difficult to judge as there are
smaller numbers of horses with predicted win probabilities above 0.15.
The above plot can be produced using the R code shown overleaf.
ourmodel<-glm(win~sireSR.trunc+days+pos1+pos2+pos3,
family=binomial,na.action=na.exclude)
prob.model<-predict(ourmodel,type="response")
prob.cat<-cut(prob.model,breaks=seq(0,1,by=0.01))
table(prob.cat)
prob.actual<-tapply(win,prob.cat,mean)
names.prob.cat<-seq(0.005,0.995,by=0.01)
plot(names.prob.cat,prob.actual,xlim=c(0,0.3),ylim=c(0,0.3),
xlab="Pedicted Probability",ylab="Actual Win Proportion")
abline(0,1,lty=2)
The first line above re-fits our model, but notice that here we have added an additional
statement na.action=na.exclude. This simply makes sure that where there is some
missing data in any row of our data set then a fitted value is returned from the model as
missing value which R records as NA. This ensures there is a predicted value even if this is
recorded as missing (i.e. NA) for every row of our actual data set.
The second line above obtains the predicted probabilities using the predict() command we
saw earlier, but here we assign these predicted probabilities to a variable we call
prob.model, since we will also have the observed probabilities or proportions.
The third line uses the cut() function to determine what are referred to as “cut-points”
which we will use to divide the probability scale from 0 to 1 up, in this into slices of length
0.01. The cut-points are thus 0, 0.01, 0.02 and 0.03 etc. and this creates a categorical
variable (which we call prob.cat) that identifies intervals that start with 0 to 0.01, then
0.01 to 0.02 and 0.02 to 0.03 and so on up to 0.99 to 1.00. This command then identifies
which of these intervals each of our model probabilities falls in.
The fourth line above uses the table() function so we can see how many horses have a
predicted probability that falls in each cut-point interval. This command actually produces
the table in the R console window, part of which is reproduced below. Notice that (perhaps
as expected) there are no horses with predicted win probabilities above 0.24.
prob.cat
(0,0.01]
13
(0.06,0.07]
2649
(0.12,0.13]
831
(0.18,0.19]
153
(0.24,0.25]
0
...
(0.01,0.02]
59
(0.07,0.08]
1953
(0.13,0.14]
602
(0.19,0.2]
96
(0.25,0.26]
0
(0.02,0.03]
296
(0.08,0.09]
1444
(0.14,0.15]
504
(0.2,0.21]
64
(0.26,0.27]
0
(0.03,0.04]
839
(0.09,0.1]
1435
(0.15,0.16]
396
(0.21,0.22]
46
(0.27,0.28]
0
(0.04,0.05]
4051
(0.1,0.11]
1165
(0.16,0.17]
307
(0.22,0.23]
21
(0.28,0.29]
0
(0.05,0.06]
4214
(0.11,0.12]
936
(0.17,0.18]
213
(0.23,0.24]
2
(0.29,0.3]
0
Back now to our R code to produce the model fit plot, and the fifth line uses a very useful
function in R called tapply(). This allows us to obtain various data summaries on a
variable, but to have this broken down by another grouping variable. In this case we are
looking to obtain the mean of the variable win, which takes values of 0 or 1 so that the
mean of this gives us the observed proportion (i.e. actual probabilities) of horses that won.
Here we are using tapply to obtain this actual win proportion amongst horses that had
model probabilities in each of the intervals we defined earlier by prob.cat. We then assign
these actual win proportions to a new variable we choose to call prob.actual.
The sixth line uses the seq() command to define a sequence of values that start from
0.005 and increase up to 0.995 in increments of 0.001. This is because these reflect the
mid-points of the cut-point intervals we defined earlier. We assume that if our model is a
good one then the actual win proportions in each cut-point interval should be approximately
similar to these mid-points. We assign these mid-points to a variable we choose to call
names.prob.cat.
The last few lines then plot the model probabilities against the actual win proportions. Since
these should be the same we add a line with a slope of 1 and intercept of zero to reflect
where the plotted points should lie if we have a good model. The resulting plot is shown
below.
We can also assess other aspects of our model using other parts of the output produced by
the summary(ourmodel) command. More of the results from that command are shown
below:
Deviance Residuals:
Min
1Q
Median
-0.7273 -0.4408 -0.3569
3Q
-0.3147
Max
2.9718
MODEL PARAMETER ESTIMATES OMITTED HERE
Null deviance: 12240 on 22288 degrees of freedom
Residual deviance: 11854 on 22274 degrees of freedom
(358 observations deleted due to missingness)
AIC: 11884
Another part of the output we can use to decide whether our model is of any use is the
value of the Residual deviance (often simply referred to as the deviance). This is measure of
how much error there is in our model that remains unexplained despite having our
predictors in the model.
As a general rule of thumb where we have large amounts of data (as we do) the residual
deviance should be less than the residual degrees of freedom if our model is worth pursuing
further.
For our model the residual deviance = 11,854 and the residual degrees of freedom is
22,274 and so this provides support to suggest we have some value in our model.
We can also use the residual deviance to compare different models, but this is a little more
complicated and so we overlook this for now.
We can also use the AIC value at the bottom of the previous output (here AIC = 11,884) to
compare models. This stands for Akaike Information Criterion, and is a measure of
goodness of fit of our model. However it’s not much use on its own and is only useful to
compare two competing models. The better model is the one with a lower AIC. If you were
look at the AIC for all the models we have looked at earlier, these all had a higher AIC value
than the one here. Hence our last model is our best.
Finally, the output above lists a summary of what are referred to as deviance residuals. A
deviance residual is calculated by R for each observation in the data set and the summary
shows the smallest (i.e. largest negative) and largest deviance residual, as well as the value
of the deviance residuals at the median and 1st and 3rd quartiles. These can be used to
indicate how much error in the model is contributed each observation in the data set. Hence
they can be used to identify possible cases where our model fits badly. However, in logistic
regression models, since our outcome variable (win) is binary (0 or 1) these residuals are
not that useful and so we will not look at these any further here.
1.14
Problems with Binary Logistic Regression Models in Race Modelling
We must not leave logistic regression without pointing out some of the problems this model
has in the context of modeling the win probabilities for horse racing. There are two key
problems as follows:

There is nothing in the model to constrain the win probabilities so that they sum to
one across a race!

The model makes no use of the information that exists in the structure of the data in
terms of finishing positions. That is it makes no use of the fact that the 2nd placed
horse beat the 3rd placed horse, and so on.
Both of these can actually be improved upon by making use of multinomial logistic
regression, instead of binary logistic regression. This is actually the model that forms the
basis for papers such as those by Bolton and Chapman (1986) and also Chapman (1994)
and Benter (1994). However, we needed to understand binary logistic regression before we
could even begin to think about moving to this more complex model.
Recommended Books on R
Introductory Statistics with R, Peter Dalgaard, http://www.amazon.com/IntroductoryStatistics-R-Computing/dp/0387954759
A Handbook of Statistical Analyses Using R, Brian Everitt and Torsetn Hothorn,
http://www.amazon.co.uk/Handbook-Statistical-Analyses-Using/dp/1584885394
Free online version available – see below.
Online Resources for R
A free online version of A Handbook of Statistical Analyses Using R, Brian Everitt and
Torsetn Hothorn, available at
http://cran.r-project.org/web/packages/HSAUR/vignettes/
logistic regression chapter is at
http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf
Recommended papers for further reading
Bolton, R. N. and Chapman, R., G. (1986), Searching for Positive Returns at the Track: A
Multinomial Logit Model for Handicapping Horse Races, Management Science, 32 (8),
pp1040-1060. http://www.ruthnbolton.com/Publications/Track.pdf
Chapman, R., G. (1994), Still Searching for Positive Returns at the Track: Empirical Results
from 2,000 Hong Kong Races, in Efficiency of Race Track Betting Markets, eds. Haush, Lo
and Ziemba, pp 173-181. Re-printed 2008.
Benter, W. (1994), Computer-based Horse Race Handicapping and Wagering Systems, in
Efficiency of Race Track Betting Markets, eds. Haush, Lo and Ziemba, pp 183-198. Reprinted 2008.