Smartersig Conference September 2012 Introduction to Betting Data Modelling with R Dr Alun Owen Part 1: Using R, Managing Data Files, Initial Data Exploration and a First Look at Logistic Regression 1.1 Preamble and Pre-Conference Work! At the conference we will not actually work through the material contained on pages 1 to 15 as we will assume that you have worked through the material on these pages before you arrive. Hence these pages have been supplied to you in advance of the conference. Can you therefore please work through this material on those pages prior to arriving at the conference? If there are any aspects of the material on pages 1 to 15 that you would like some clarification on please email me (Alun Owen) at [email protected] Instructions for downloading and installing R should already have been emailed to you or if not can be obtained from Mark. We will be using a data set here called aiplus2004.csv which is available on the utilities section of the Smartersig website or again can be obtained from Mark. When you arrive at the conference, you will be given a longer document that includes the material here but extends this work from page 16 onwards. We will kick off the first R session with a 5 minute resume of pages 1 to 15, but will then go almost straight into page 16 and move onwards from there. 1.2 Using R R can actually be used in a number of ways; it can be used interactively by typing commands in the R Console window (after the > prompt) shown below: Alternatively, R can be used in “batch” mode, by typing a series of commands in an external script (text) file and then running some/all the commands in that file at once. This option opens up the real power of the software. 1.3 Using R Interactively To submit commands to R interactively, simply type the relevant command (after the > prompt) in the R Console window and hit the enter button. As an example, type in the following commands (hitting the enter key after each line): x <- c(6,7.8,7.7,10.8,6.4,5.7,6.6,12.4) y <- c(1,2,3,4,5,6,7,8) plot(x,y) The <- symbol in the first two lines of commands above is used to assign values to “objects” and can be thought of as representing “=”. In the first line above, we assigned values to a column of data named x. These values were Combined together using the c() function. These columns of data are usually referred to as “variables”. So in the second line above, we combined a list of values together and assigned these to another variable we named y. The plot(x,y)command opens up an additional (graphics) window and displays a simple scatter plot of y against x , with y on the vertical axis and x on the horizontal axis. You can see what x and y look like simply by typing their names in R (again hitting the enter key after each one) as follows: x y If we wanted to assign a single value to a single variable (i.e. not a column of data), for example you may have a starting betting bank of £1,000, then we could assign the value 1,000 to a variable we will call bank as follows: bank <- 1000 If we wanted to increase the value of bank, by say £500, you could type: bank <- bank+500 To see that value of the variable bank has increased to £1,500, try typing: bank At this point it may be worth mentioning that R provides a mechanism for recalling, editing and re-executing previous commands. The vertical arrow keys on the keyboard can usually be used to scroll forward and backward through a command history. Once a command is located in this way, the cursor can be moved within the command using the horizontal arrow keys, so that you can edit the command if required and then hit enter key to reexecute the revised command. Try this to see what x looks like again. Now, the variables x and y and the variable bank are all what R refers to as objects. We can take a look at what variables (objects) we have created during this R session by typing the following command (followed by the enter key remember!): ls() Note this is a lower case “L” and a lower case “S” as in “lima” “sierra”! This lists all the objects we have created in the current R session. You should see bank, x and y listed on the screen in alphabetical order. Next close R down so we can start again. Do this either by clicking on File Exit, or by clicking the usual in the top-right of the R screen. Note that just clicking the in the R console window will just close the console window and not close R itself. When you try to close R you will probably see the dialogue box below: Answer “No” to this as we do not need to save the session or any session history as such – we will discuss this in more detail at the conference. 1.4 Using R in Batch Mode and Writing R Code Instead of typing commands and executing these one at a time in interactive mode, we can write a series of commands in an R script file, which is simply a text file holding our R commands. This is useful if we want to save our work and re-use it later, or to develop models written using R code that we can run repeatedly etc. To create a new blank R script file, start R again and click on File New Script as shown below: This will open up a new window into which you can write your R commands: In the next section we will enter our R commands into this script window and run them from there, rather than typing them directly into the R Console window, so keep your R session open. 1.5 Using External Data Files (and Folders) with R So far we have entered our data by hand via the keyboard, but what about reading data in from an external source, such as an Excel spreadsheet, comma separated (.csv) file or even a text file? In fact the data we entered for x and y above comes from the first few rows of the comma separated file named aiplus2004.csv. This relates to data on horse racing and contains data on over 22,000 horses. This data set is available on the utilities section of the Smartersig website (or can be obtained from Mark). This data contains six columns of data as follows: position3 position2 position1 days sireSR position - finishing position three races ago (1, 2, 3 or 4, 0 = anywhere else) finishing position two races ago (1, 2, 3 or 4, 0 = anywhere else) finishing position in the previous race (1, 2, 3 or 4, 0 = anywhere else) days since last race win rate achieved by the horses Sire with its offspring prior to this race finishing position in this race Each row refers to a single horse in each race. We will look at how to read this data into R. First you need to create a new folder somewhere in the directory structure on your own PC. You will use this to place the data file aiplus2004.csv, and also to save your work during the conference and during any further work you do in this pre-conference session. For example you might create a folder called “C:\R Stuff”. Once you have created your new folder, copy or move the file named aiplus2004.csv to this new folder noting the directory reference of where this folder is located on your PC, such as for example “C:\R Stuff”. When using R, we need to direct R to point to this directory. To do this we will set the Working Directory we want R to use. In your R script file window (i.e. NOT the R console window we have used so far!) type the following command: setwd("C:/R Stuff") noting the use of the forward slash instead of the usual back slash! Also replace C:/R Stuff with whatever directory you are using if this is different! To then execute this command in R, highlight the command in your script file (in the usual way using your mouse) and then click on Edit Run line or selection. Or you can use the short-cut which is to hold down the CTRL and R keys together (with your command highlighted). If this has been successful, all you will see in the console window is that same command in red having been run as follows: > setwd("C:/R Stuff") If this was not successful, then you will probably see the following statement: > setwd("C:/R Stuff") > Error in setwd("C:/R Stuff") : cannot change working directory The most likely reason is that you haven’t stated the name of the directory correctly or may have used a back slash in the command when you need to use a forward slash. This section is optional! Another approach you could use instead of setting the working directory is to put the full directory location of the source file in the actual read.csv command. For example: horse.data<-read.csv("C:/R Stuff/aiplus2004.csv") However, the advantage of using the setwd() command approach is that for the current R session, R will always look in that directory for data and other files and will save your work to this directory also, keeping this tidy for us! There is also a third way approach you could use instead of setting the working directory. In this case we modify the properties associated with the short-cut icon on the desk-top we use to launch R. We can modify those properties to change the default directory where R points to. I’m not suggesting you use this approach, but for information if you wanted to follow this approach, you need to right click, using your mouse, on the R short-cut icon on your desktop, in order to display the properties dialogue box shown overleaf. On the Shortcut tab type the name of the new sub-directory in the “Start-in:” field as shown below. If you have created sub-directory with a different name and/or in a different location, use that instead. Click Apply and OK to save this change. Once you are happy that the setwd() command has been run successfully, type the following command into your script window. horse.data<-read.csv("aiplus2004.csv") Then run it as before by highlighting the command and then clicking on Edit Run line or selection (or using CTRL and R). The read.csv command actually reads data in a table-like format and from this creates what R refers to as a data frame. We have chosen to call this data frame horse.data. You can think of a data frame as being a data set that contains a number of columns of data. In this case, the data frame horse.data contains the six columns of data from aiplus2004.csv. There are other variants of the read function, should you need them. For example read.table(), which can be used if your data has previously been saved in ASCII (flat file) format, such as that created by windows NotePad. There is also read.csv2 for when the data are separated by semi-colons as opposed to commas. Okay, let’s see what our data in the horse.data data frame looks like. To do this type the following command into your script file and then the run it (Edit Run line or selection, or use CTRL and R): horse.data You may see that only the first 16,666 rows (of the 22,000+ in the data frame) are displayed! This is simply the default maximum number of rows that will be printed, which can be altered if required. To view just say the first 20 rows of the data frame horse.data, copy and paste the last command you entered into your script file horse.data onto a new line further down in your sxript file and edit it so that it reads as follows horse.data[1:20,] (making sure you use square brackets, and include the comma!) Run this command (Edit Run line or selection, or use CTRL and R) and it will show rows 1 to 20, but all columns, of horse.data. You should therefore see the output shown below (note that this has been edited here to save space): position3 position2 position1 days sireSR position 1 1 4 2 60 6.0 1 2 3 1 3 13 7.8 2 3 0 4 2 43 7.7 3 . . . . . . . . . . . . . . 18 NA 1 0 566 10.8 18 19 0 2 0 20 7.5 19 20 0 0 4 9 4.8 20 Note that the file aiplus2004.csv had column names in the first row and so by default R has used these as the column names for the data frame that it created. The first number that appears in each row is simply a row number reference that has been printed by R and is not actually stored as an actual column in the data frame. Note also that the column position3 has a value of NA in row 18. This is how R represents missing values and indicates that this value is Not A number. In the original data set this value would have simply been missing. You may have noticed that the first 8 rows of sireSR and position are actually the same as those that we typed in manually for the variables x and y previously. To see just the first 8 rows of data for sireSR and position (i.e. columns 5 and 6) copy and edit the last line in your R script file so that it reads: horse.data[1:8,5:6] and then run this (Edit Run line or selection, or use CTRL and R) Two ways of viewing say, the last column in the data frame, are to use horse.data[,6] or horse.data$position Try adding these to your R script file and then running them. In both cases you should see that the R console window has been filled with all the 22,000+ values in the column named position from the data frame horse.data. See if you can now view the data in the column of our data set named position by adding the following command to your R script file and then running it position You should find that R cannot find this column. This is because at the moment the columns in the data frame horse.data are not yet visible to R as objects in their own right. They are just simply columns of the data frame object called horse.data. We can use the attach()function to make the columns of the data frame “visible to R”. In your script file add the following command and the run it. attach(horse.data) Now see if you can view the column named position by running the following command again from your script file: position You should now see that this is now visible to R! Note that we need to take care with the attach() function. Whilst this has made the columns of the data frame horse.data visible to R, this does not create them as individual variables or objects. Note that if we already had an object named position, the attach() function would not overwrite that old one with the new one as you might expect it to! 1.6 Saving and Opening R Script Files Before go any further lets save our script file. To do this, make sure your script file is the “active window” in R – you can make sure of this by clicking anywhere in the script file window. Then click on File Save as and then saving it using a suitable name, but you will also need to include a .R file extension at the end of the file name so that R knows this to be an R script file. Once you have saved your script file, close R down, again answering “No” when you see the dialogue box below: Now re-start R and click on File Open script, and locate your script file save above to open it again. If you can’t locate it and you are sure you are looking in the right place, you may have forgotten to add the .R file extension in the name of you script file when you saved it. Since R by default only looks for script files with a .R file extension this may be why it isn’t listed. You can check this by changing the “Files of type” option it is using to all files as follows: Okay, we will now assume you have managed to locate your R script file. 1.7 Using R to Obtain Simple Data Summaries Throughout the rest of this document I will assume that all the R commands you run will be first typed into your R script file and then run in batch mode (using Edit Run line or selection, or CTRL and R). Assuming you now have started a new R session and you have managed to open up your saved R script file, run the following commands from that R script file to make sure we have set our working directory, read in our aiplus data set as a data frame called horse.data and have used the attach() command so that all the columns of horse.data are visible to R: setwd("C:/R Stuff") (using a forward slash and replacing C:/R Stuff as required) horse.data<-read.csv("aiplus2004.csv") attach(horse.data) During any data analysis or model development work, you should start with an examination of your data. Let’s start by obtaining a frequency table on the data in the column named position in our data set by typing the following: table(position) You should see the following output: position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1782 1778 1780 1774 1770 1737 1675 1597 1468 1333 1178 1033 870 722 577 461 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 362 269 193 115 36 28 22 22 16 14 10 8 6 4 3 2 33 34 1 1 This shows us that out of the 22,000+ rows of data (i.e. horses), we have a horse finishing 1st on 1,782 occasions, 2nd on 1,778 occasions and 3rd on 1,780 occasions, etc. and indicates that we have approximately 1,780 races in our data set. Let’s now obtain frequency table for another of our variables; days by typing: table(days) This time the table you will see is rather large and not quite so helpful. This is because there are so many different possible values for the number of days since a horse last ran. Another way of summarising the data on days is to produce a histogram (i.e. a bar chart for continuous data). To obtain this type: hist(days) The histogram, which will appear in the separate graphics window, should look like that shown overleaf. 10000 0 5000 Frequency 15000 20000 Histogram of days 0 500 1000 1500 days This chart is also not that helpful, since most horses previously ran within approximately 100 days of their last race. The fact that some horses ran last more than 100 days ago and some even ran up to 1,673 days ago, has significantly increased the scale on the horizontal axis and made it difficult to see the detail in the bulk of the data below 100 days. We may therefore want to focus some attention on the reduced set of horses which ran within 100 days. To do this, create a new variable by typing: days.reduced <- days[days<=100] This creates a new variable named days.reduced, which has been assigned with only those values from the variable days where days is less than or equal to 100. Note the use of the square brackets which encloses the selection criteria. Let’s now view the histogram of this new variable by typing: hist(days.reduced) You should then see the following chart: 0 1000 Frequency 2000 3000 4000 Histogram of days.reduced 0 20 40 60 days.reduced 80 100 Exercise 1 Use R to produce a frequency table and histogram of the data on sireSR. (If you need help the R commands required are shown over the page!) You should again find that the frequency table is very large and not that helpful, since sireSR is often given to one decimal place and so the range of possible values is much greater. You should also find that the histogram shows that the values of sireSR vary between 0% and 100% (of course), but that the majority of horses had a sireSR of less than 20%. See if you can therefore create a new variable called sireSR.reduced which contains only the data where sireSR is less than or equal to 20%, and then produce a histogram for this new reduced data set. Your histogram should look as follows: 2000 0 1000 Frequency 3000 4000 Histogram of sireSR.reduced 0 5 10 15 20 sireSR.reduced Note that this plot reveals an interesting feature of the data, in that there are around 885 horses that had a sireSR of zero. You should investigate why there are so many zeros and whether a zero indicates that the horses sire truly has never won a race, or perhaps the sire has never entered a race or it might be because the sireSR is not known? R Commands for Exercise 1 table(sireSR) hist(sireSR) sireSR.reduced<-sireSR[sireSR<=20] hist(sireSR.reduced) End of Exercise 1! With histograms, it is possible to control things like the width of the bars (class intervals), title, axis labels etc. To get more help with how to do extra things like this, with a function such as hist(), click on the “Help” menu and then select “R functions (text)” as shown below: In the dialogue box that appears as shown below, type hist and then OK as shown below to get more help with that function: Another useful part of the help facility is the document available there called “An Introduction to R”. This can be accessed by clicking on the menu options from the Help menu as shown below: That document also has a useful sample R session as an Appendix. Okay, back to our session here on Simple Data Summaries. We will now look at some useful summary statistics of the days variable, by using the command: summary(days) Try that command and you should see the following output: Min. 1st Qu. 0.0 11.0 Median 17.0 Mean 3rd Qu. 38.3 31.0 Max. 1673.0 Note that we have undertaken this analysis on the full set of data on days and not the reduced set! This shows us that the number of days since each horse last raced varied from 0 (zero!) to 1,673 (4.5 years!). Again we should consider checking that all of these represent sensible and valid values. The output above also shows us that the average (mean) was 38.3 days. In fact the data for days (and days.reduced) had a few very large values which will have inflated the mean to 38.3. In these circumstances the median (“middle” value) of 17 days will more likely give a fairer reflection of the average. The 1st Qu. and 3rd Qu. values are what are referred to as the first and third quartiles respectively. If all the data values are listed in order, these are the values a quarter of the way along the data and three-quarters of the way along the data. Hence they indicate the range of values in the middle 50% of the data, which in this case is from 11 days to 31 days. This is also known as the Inter-Quartile Range. Exercise 2 Obtain the summary statistics on the full sireSR data (i.e. not the reduced set). End of Exercise 2! Back again to our session on Simple Data Summaries. So far we haven’t looked at the data in the columns named position1, position2 and position3. Since these contain only a few possible values each (0, 1, 2, 3 or 4), we can usefully produce frequency tables for each of these. For example for position1 try the command: table(position1) Your results should be the same as those shown below: position1 0 1 13841 2529 2 2087 3 2114 4 2076 This shows, for example, that in our data set we had 2,529 horses who had won (i.e. finished 1st!) in their previous race, and 2,087 horses who had finished second in their previous race. This also shows that there are 13,841 horses in the data set which did not finish in the top four in their previous race. Exercise 3 Obtain frequency tables for position2 and position3. You should obtain the results shown below. position2 0 1 13394 2539 2 2254 3 2183 4 2182 position3 0 1 13147 2523 2 2249 3 2198 4 2172 This shows that there are 2,539 horses who won their race two races previously etc. This also shows that there are 13,394 horses who did not finish in the top four in their race two races previously etc. End of Exercise 3 Okay, this is the end of the pre-conference work on R. Make sure you save your R script file as we will start the conference by opening this up and running some of the commands contained in it. If you have any questions email me (Alun Owen) at [email protected]. I look forward to seeing you at the conference. Start of Conference Work! 1.8 Identifying Potential Information to Develop of Model of Race Outcomes This is where we start the conference work proper and we will assume that you have completed the pre-conference work in Sections 1.1 to 1.7, which was provided prior to the start of the conference. Having previously undertaken an initial examination of our data, we come to the point where we want to look to see whether any of the information contained in our data can help us with the development of a model of race outcomes. A common approach to modeling any type of race outcomes, whether its horses, cars, motorcycles or people etc., is to develop models that allow us to assess how various factors impact on the chances or probability of any competitor winning. In these circumstances, we would need historical data that indicates whether each competitor in each race won or not. In this case we are modeling a variable which can take only two values; win/lose, and this type of variable is often called a binary variable. A model we can use for this type of binary outcome is the binary logistic regression model. Binary logistic regression models are what we will focus on over the rest of the conference. Note that a more sophisticated model that we could use instead is the multinomial logistic regression model. This also allows us to model the probability that a horse will win, but at the same time allows us to model the probability a horse will be placed second, third etc. However, before we can even begin to think about understanding and implementing multinomial logistic models we need to understand how to develop and implement the simpler binary logistic regression model – hence this is what we will look at during this conference. Perhaps in a follow up conference we can cover multinomial models? Now, the information on the outcome of each race is contained in the variable named position, which indicates the position the horse finished in each race in the data set. Therefore, we first need to create a new variable which indicates whether the horse in each row of our data set won that particular race or not. To do this type the following command in your R script file and run it : win<-ifelse(position==1,1,0) This uses the ifelse function in R. The logic should be reasonably clear; when position is equal to 1 (i.e. the horse won) then this assigns the value 1 to our new variable called win, and where position takes any other value (i.e. the horse did not win) then it assigns a value of zero to our new variable called win. We can therefore use this new variable called win as an indicator of which horses won i.e. when (win =1) and which horses did not win i.e. when (win =0). We can use this to investigate here how the probability of winning a race is related other variables we have historical information on, such as for example, the horse’s finishing position in each of its’ previous three races (position1, position2 and position3), it’s sire strike rate (sireSR) and the number of days since the horse last ran (days). First we consider position1. Since this variable only covers a limited number of values (0, 1, 2, 3 or 4), we will form a two-way table (contingency table) of win against position1. This will show how often a horse won, given its’ finishing position in its’ previous race. To do this type: table(position1,win) This gives the two-way table of win against position1 as shown overleaf. win position1 0 0 13066 1 2174 2 1838 3 1887 4 1900 1 775 355 249 227 176 This two-way table indicates, for example, that when a horse won its’ previous race (i.e. position1 = 1), the horse then went on to win its’ next race on 355 occasions, but didn’t go on to win its’ next race on 2,174 occasions. However, we would like to have the table in a format that indicates the proportion of time the horse went on to win its’ next race. For example, when a horse won its’ previous race (i.e. position1 = 1), the proportion of time it won its’ next race was 355/(355+2,174) = 0.1403717 or 14%. We can obtain this proportion easily for each of the values for the variable position1 by typing: position1.win<-table(position1,win) 100*prop.table(position1.win,1) The first line above creates a new object called position1.win which is in fact the two-way table shown above. The second line above works out the proportions in the cells of the two-way table called postion1.win. The value of 1 used after the comma at the end of the second command line above, means that the proportions are based on the row totals, rather than column totals (if we wanted to base the proportions on column total this value would need to be a 2). This therefore gives us the proportion of wins for each value of position1. In this way the proportions in a row add up to 1. We multiply by 100 in the second command line above so that the proportions are out of 100 and hence represent percentages. The output you see should look as follows: win position1 0 1 0 94.400694 5.599306 1 85.962831 14.037169 2 88.068999 11.931001 3 89.262062 10.737938 4 91.522158 8.477842 This shows, that when a horse finished first in its’ last race (indicated by postition1 = 1) it went on to win 14.0% of the time. When a horse finished second in its’ last race it went on to win 11.9% of the time. When a horse finished third in its’ last race it went on to win 10.7% of the time. When a horse finished fourth in its’ last race it went on to win 8.5% of the time. When a horse was placed elsewhere in its’ last race it went on to win 5.6% of the time. This suggests that the better a horse did last time out the more often it won its next race. Hence the finishing position of a horse’s previous race does indeed provide some potentially useful information on which to build a model for determining future race probabilities of a win. Note that it would be helpful to maybe round the values above to say 1 decimal place. We can do this by modifying the line of code: 100*prop.table(position1.win,1) to: round(100*prop.table(position1.win,1),1) The output in this case looks as follows: win position1 0 1 2 3 4 0 1 94.4 5.6 86.0 14.0 88.1 11.9 89.3 10.7 91.5 8.5 Next we will investigate the relationship between the probability of winning and the data on sireSR and days. However, we have to deal with these two variables in a slightly different way, since they both contain a wide range of possible values. One useful thing would be to investigate whether the proportion of horses winning a race increases or decreases in some way with increasing values of sireSR and/or days. There is a very useful plot function in R called cdplot(). This actual produces what is called a Conditional Density plot but we won’t worry too much what that means. To produce this type of plot in to illustrate how win proportion depends on sireSR, we could try the following commands in R (try running these!): win.f<-as.factor(win) cdplot(win.f~sireSR) I’ll explain at the conference what these are doing. The main things to note here are that we need to declare the win variable as a factor and so we create a new variable called win.f for this. In the cdplot() command above, the ~ character (which we refer to as “twiddles”) in the statement win.f~sireSR, is short for “depends on”. This indicates that in our plot we want win.f to depend on sireSR. Back to the plot this produces! First we note that these commands will probably produce a “rubbish” plot! The reason for this is down to the fact that we are asking R to estimate how win proportion depends on sireSR, but R is having problems for larger values of sireSR. This is because there are not enough horses with larger values of sireSR (i.e. above 20% if you recall the earlier work we did!) to accurately estimate things well enough! So instead we will look at the relationship between win proportion and sireSR but only where sireSR is less than or equal to 20%. We can do this by modifying the above commands to (try running these): win.reduced<-as.factor(win[sireSR<=20]) sireSR.reduced<-sireSR[sireSR<=20] cdplot(win.reduced~sireSR.reduced) You should see the plot show below: This plots the estimated proportion of horses not winning (i.e. where win.reduced=0) and winning (i.e. where win.reduced =1) as the Sire Strike Rate increases from zero to 20%. We can improve the plot further by plotting the win proportion underneath the lose proportion by changing the cdplot() command to: cdplot(win.reduced~sireSR.reduced,ylevels=2:1) This plots the second level of our y variable (i.e. where win.reduced =1) at the bottom of the plot and then the second level of our y variable (i.e. where win.reduced =0) above that to give the plot as follows: We can now add labels “Won” and “Did not Win” instead of the default 1 and 0 by modifying the cdplot() command to: cdplot(win.reduced~sireSR.reduced,ylevels=2:1,yaxlabels=c("Won","Did not Win")) The revised plot is: Finally we should amend the scale on the y axis so that it stops, say, at 25%. We can do this by modifying the cdplot() command to: cdplot(win.reduced~sireSR.reduced,ylevels=2:1,yaxlabels=c("Won","Did not Win"),ylim=c(0,0.25)) The revised plot is: So what does this tell us? Well, as the sireSR increases from about 3% up to about 12%, the proportion of horses winning increases. Where sireSR is below 3%, there is uncertainty due to the unusual increase (spike) in win proportion. Above 12% the win proportion looks to vary more but on average appears to be roughly leveling out? Above 15% there is also greater uncertainty due to the sudden drop in win proportion and this variability actually continues beyond this point! This is due to the lack of data beyond a SR of 15%. Okay, now have a go at Exercises 4 and 5 (the first ones of the conference!). Exercise 4 (first exercise of the conference!) The solution and commands should you need them are on the next page! Referring back to the work we did on pages (17 and 18), obtain a two-way table for win against position2 rounded to 1 d.p. if you can! Note that perhaps the easiest way to do this is to copy and paste the two commands below that we used earlier in relation to position1: Position1.win<-table(position1,win) round(100*prop.table(position1.win,1),1) but amend these, for example, to relate to position2. Using your results, comment on whether you think position2 offers additional good information in relation to predicting future race outcomes. Then do the same for position3. Exercise 5 The solution and commands should you need them are on the next page! Produce a suitable plot of win proportion against the number of days since the horses last ran (i.e. the days variable). You could start by trying the following command: cdplot(win.f~days) But this will also give you a rubbish plot. Do you know why? Make suitable amendments to produce a more suitable plot. Hint: If you think back to our earlier work on days since last run, this was below 100 for most horses and only a few had values above this. Solution to Exercise 4 For position2 the commands required for this exercise are: position2.win<-table(position2,win) round(100*prop.table(position2.win,1),1) from which you should obtain the following results: win position2 0 1 0 93.5 6.5 1 88.8 11.2 2 89.1 10.9 3 91.0 9.0 4 91.7 8.3 Again this indicates that the data on position2 also provides potentially useful information on which to build a model for determining future race probabilities of a win. Repeating this position3 the commands required are: position3.win<-table(position3,win) round(100*prop.table(position3.win,1),1) and you should obtain the results shown below: win position3 0 1 0 93.1 6.9 1 89.8 10.2 2 90.0 10.0 3 91.2 8.8 4 92.3 7.7 Again this suggests that the data on position3 also provides potentially useful information on which to build a model for determining future race probabilities of a win. Solution to Exercise 5 R commands are: win.reduced<-as.factor(win[days<=100]) days.reduced<-days[days<=100] cdplot(win.reduced~days.reduced,ylevels=2:1,yaxlabels=c("Won","Did not Win"),ylim=c(0,0.15)) The resulting plot is shown overleaf (note the y axis was restricted to a maximum of 15% to see the detail). This plot contains some hint that the win proportion reduces as the number of days since the horse last ran increases. However, beyond 40 days there is much more variation in the plot and hence greater uncertainty. The variation above around 40 days is again due to the lack of data above that point. End of Exercises 4 and 5! Okay to summarise so far: We have looked at the relationship between the win percentage and the other (predictor) variables and found that there is some suggestion of an increase in the win proportion with values of position1, position2 and position3 nearer 1st place. We have also seen an increase in the win proportion with higher values of sireSR but only where the sireSR was less than or equal to 12%. Above this figure win proportion seems to perhaps level out on average somewhat and beyond 15% there is not enough data to be totally sure about the impact of increasing sireSR. One option is to build into any model the notion that win probability increases with sireSR up to 12% but a sireSR above 12% has the same impact as a sireSR=12%. Finally we have seen that a decrease in the win proportion seems to perhaps be associated with higher values of days, but again this applies only when the days since last run varies from 3 to around 40. Above 40 there appears to be less of an indication of an impact and also there is too much variation to make any clear judgments. So we have explored our data so let’s fit the model! 1.9 Fitting the Binary Logistic Regression Model A binary logistic regression model would appear to be a suitable model worth investigating here for a number of reasons; Firstly our outcome (win) is a binary variable in the sense that there are only two possible values, 1 (if the horse won) or 0 (if it didn’t); We require as an output from our model, the probability of a horse winning a race; The win probability is a natural output from such a logistic model, since the reciprocal of the win probability gives us the “fair” odds of winning which we can compare with the odds offered by bookmakers to facilitate betting decisions. A typical binary logistic regression model can specified algebraically as follows: p 0 1 x1 2 x2 ... p x p log 1 p p is the probability of the outcome that we are interested in, which in our case is a win. The ratio p/(1-p) is therefore the probability of a win divided by the probability of not winning, and is therefore the odds. Hence log(p/1-p) is the log-odds (or log of the odds) which where the name logistic regression comes from! Note the word “log” refers to the logarithm function which we will explain in more detail shortly! The terms x1 x2,…, xp represent the predictor variables, which in our case would be sireSR, days, position1, position2 and position3. Note that in our case the variables sireSR and days are continuous variables since they are measured on a continuous scale. However the variables position1, position2 and position3 should really be treated as categorical variables, since the position a horse finished in a previous race falls into one of five categories (first, second, third, fourth or unplaced). We will discuss this issue again later and during the conference. The terms β1, β2,…, βp in the model represent what we refer to as parameters that we let the software estimate to give us the best fitting model. We multiply the values of the predictors (x1 x2,…, xp) by these parameters (β1, β2,…, βp) in the model. Note that the term β0 represents a constant parameter that accounts for when the values of all the predictor variables are zero. So what does this model look like? Well lets first consider the simplest model where we have just one continuous predictor, x, say. The binary logistic regression model in this case is: p 0 1 x log 1 p Now, it turns out (if we use some mathematical algebra) that this model can be re-stated as follows: p exp( 0 1 x) . 1 exp( 0 1 x) This is the equation of a typically “s-shaped” curve an example of which is shown below: So how does this relate to what we have? Consider again our cdplot of win proportions against sireSR. Can you see how the model above does bear some resemblance to our cd plot below? So how can we use R to find the equation of the (logistic) curve that best fits our data? Well, we first state our simple model as follows: p 0 1 sireSR.trunc log 1 p The parameters β0 and β1will be estimated by R using a model fitting technique called maximum likelihood estimation (no details of maximum likelihood estimation are discussed here). We can fit the model by typing the following commands: win<-ifelse(position==1,1,0) sireSR.trunc<-ifelse(sireSR<=12,sireSR,12) glm(win~sireSR.trunc,family=binomial) The first line of code above has already been run but is again repeated for completeness. The second line of code allows us to use all the data for all horses in the model, but where a horse’s sireSR is above 12% this is truncated (reduced) down to 12. This is different to sireSR.reduced – can you see how? The third line above uses the glm command to fit the model since the binary logistic regression model is actually an example of what is called a Generalized Linear Model (GLM). In the glm() command, the ~ character (which we refer to as “twiddles”) in the statement win~sireSR.trunc is short for “depends on”. This indicates that in our model we want win to depend on sireSR.trunc. The statement family=binomial tells R that we wish the output (dependent) variable (win) to be treated as if it is binary (and so follows the binomial probability distribution). Some of the output from running the above glm() command in R is: Coefficients: (Intercept) sireSR.trunc -2.84005 0.04847 In this output, the coefficient listed under “Intercept” is -2.84005 and this is the estimate of β0 in our model. The coefficient 0.04847 listed under sireSR.trunc is the estimate of is the parameter we multiply sireSrR.tunc by. β1 in our model which Therefore the best fitting model for our data is: p 2.84005 0.04847 sireSR .trunc log 1 p Recall that this is the equation of a curve and so we can plot this curve on top of our data to see how well it corresponds to our data. This plot looks as shown overleaf. Note that the model ends where the sireSR reaches 12%. Above this level we assume the predicted win proportion is the same as that at 12%. Hence the model would continue as a flat horizontal line where the sireSR is above 12%. We can produce the above plot using the following commands in R: ourmodel<-glm(win~sireSR.trunc,family=binomial) win.f<-as.factor(win) cdplot(win.f~sireSR.trunc,ylevels=2:1,yaxlabels=c("Won","Did not Win"),ylim=c(0,0.25)) x<-seq(from=0,to=12,by=0.01) pred.sireSR<-data.frame(sireSR.trunc=x) p<-predict(ourmodel,newdata=pred.sireSR,type="response") lines(p~x,type="l") The first line above uses the glm command to fit the model, but this time the results of the model fit are assigned to an object we choose to call ourmodel. The second and third lines produce the actual cdplot of the data, but note this time it is based on sireSR.trunc rather than sireSR.reduced. The fourth line above creates a temporary variable which we choose to call x. This uses the seq command to create a very long column of values from 0 to 12 which increase in increments of 0.01. We will use these values to calculate “predicted” probabilities from our fitted model for each value of x. The fifth line defines a data frame which has just a single column called sireSR.trunc which contains x which is the long sequence of values we obtained in the line above. Note that in this data frame we must name the column sireSR.trunc, since this is the name of the sireSR variable in our model. The sixth line in the R code above uses the predict() command to extract predicted values using our model, but rather than using the actual data we use new data which we declare to be the data frame which contains the sequence of values from 0 to 12 in increments of 0.01. The type=”response” requests that the predictions are returned as probabilities. These predicted probabilities are then assigned to a new variable we call p. The last (seventh) line in the R code above uses the lines() command to add a line to the plot. The lines command actually adds to the current plot p plotted against all the many values of x, such that all the points are joined by very small straight lines to give the appearance of a curve. 1.10 Forecasting and Betting with the Binary Logistic Regression Model So how do we then use our model for betting? Suppose in a future race a horse running has a sireSR value of 10%. The estimated logodds from the model are then given by: p 2.84005 0.04847 10 2.35535 log - odds log 1 p But how do we then find the odds from this? Well, we simply calculate the exponential of this, which is to the power of minus 2.35535”. e -2.35535 and reads as “e raised This might look hard to calculate, but e is simply a special number in mathematics like the number . The number e is approximately 2.718, but most calculators, as well as in R and Microsoft Excel, have the number e in its fixed memory. Hence they have the capacity to raise the number e to whatever power you require. We can calculate e -2.35535 by pressing the following typical keys on your calculator: ex -2.35535 = Or alternatively you can use the following command in R: exp(-2.35535) Either way you should get the answer to be 0.095 This means that fair odds of the horse winning would be 1 to 0.095, i.e. 10.5:1, or decimal odds of 11.5 if you include the stake. The probability of winning, which is p, can be derived from 1/odds (where the odds include the stake). Hence the probability of winning in this case is 1/11.5 = 0.087. Hence, we have used the model to say that a horse with a sireSR value of 10%, would win its next race with a probability of 0.087, and fair decimal odds would be 11.5. We can obtain model-predicted probability and odds for five horses in a future race, with sireSR values of 4%, 8%, 10%, 15% and 20%, again using the predict command in R as follows: ourmodel<-glm(win~sireSR.trunc,family=binomial) new.sireSR<-c(4,8,10,12,12) newrace.sireSR<-data.frame(sireSR.trunc=new.sireSR) prob<-predict(ourmodel,newdata=newrace.sireSR,type="response") odds<-1/prob odds prob Note that the second line above creates a new column of data which we choose to call new.sireSR, which contains the sireSR values for the new race. Note how the last two have both been truncated from 15% and 20% down to 12%. Note also again that in the third line, we define a data frame we choose to call newrace.SR, but remember that we have to call the column in this data frame that contains the sireSR data sireSR.trunc since that is the name of the variable we have used in our model. The fourth line uses the predict command to extract predicted values, this time using the new data which we declare to be the data frame which contains the sireSR values for the future race. The type=”response” again requests that the predictions are returned as probabilities. The remaining lines then simply calculate the odds and ask R to display the probabilities and odds. The output from running these commands is shown below. Note how the third set of values agree with the “fair” odds and win probability calculated earlier! > odds 1 2 3 4 5 15.10018 12.61528 11.54222 10.56829 10.56829 > prob 1 2 3 4 5 0.06622439 0.07926898 0.08663846 0.09462266 0.09462266 1.11 Model Assesment Well before we would use the model for predictive purposes, we need to consider whether sireSR is a good predictor of win probability. We can assess this using the summary command as follows: ourmodel<-glm(win~sireSR.trunc,family=binomial) summary(ourmodel) Part of the output from this command is: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.840053 0.079872 -35.558 < 2e-16 *** sireSR.trunc 0.048467 0.009523 5.089 3.59e-07 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 One of the first things we should look at in the above is the value of Pr(>z) for the sireSR.trunc predictor. We refer to this as the p-value and if this value is <0.05 then that predictor is considered to be a significant (good) predictor of win probability. The p-value for the sireSR.trunc predictor in this case is 3.59e-07 which is interpreted as being 3.59 x 10-07. This is simply given in standard form (remember your O level/GCSE maths?) and is therefore actually 0.000000359. Since this is < 0.05 we can conclude that sireSR.trunc is significant (good) predictor of win probability! Notice the following asterisk notation in the output above: Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 In the output, the p-value of 3.59e-07 for sireSR.trunc is annotated with *** which is another way of highlighting that the p-value is between 0 and 0.001. Exercise 6 In this exercise you will extend the model so that it includes both sireSR and days as predictors. Note the solution is shown overleaf! Recall that the general form of the binary logistic regression model can be stated as p 0 1 x1 2 x2 ... p x p log 1 p We could consider the following model with sireSR and days as predictors as follows: p 0 1 sireSR.trunc 2 days log 1 p You can fit this model in R by modifying the model fit command to: ourmodel<-glm(win~sireSR.trunc+days,family=binomial) Use the above command to fit the model and then use the summary() command to obtain suitable information to assess whether days is also a significant (good) predictor of win probability. Use the output from your model to write down the equation of your model in the following log-odds terms (i.e. fill in the missing values indicated by a ?): p ? ? sireSR .trunc ? days log - odds log 1 p Use this to calculate the estimated log-odds for a horse winning a future race if it has a sireSR value of 10% and it has been 20 days since it last ran. Hence obtain the exponential of this to calculate the probability of this horse winning and fair odds for a bet on this horse. Solution to Exercise 6 R commands required are: ourmodel<-glm(win~sireSR.trunc+days,family=binomial) summary(ourmodel) The output is: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.7455900 sireSR.trunc days 0.0815090 -33.685 < 2e-16 *** 0.0489606 0.0095547 5.124 2.99e-07 *** -0.0029181 0.0005204 -5.607 2.05e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The p-value for days is 2.05e-08, i.e. 0.0000000205 which is < 0.05 and so days is a significant (good) predictor of win probability. This is similarly indicated by the fact that this is annotated in the output with ***. The log-odds equation of the model is p 2.74559 0.0489606 sireSR.trunc 0.0029181 days log - odds log 1 p The estimated log-odds from the model are then given by: p 2.74559 0.0489606 10 0.0029181 20 2.31435 log - odds log 1 p We then calculate the exponential of this using R with the following command: exp(-2.31435) = The answer to this is 0.099 and so fair odds of the horse winning would be 1 to 0.099 Or in usual terms is 10.12:1 (obtain from 1/0.099) to 1. Which gives fair decimal odds of 11.12 if you include the stake. The probability of winning can be derived from 1/odds where the odds include the stake. Hence the probability of winning in this case is 1/11.12 = 0.0899. 1.12 Extending our Binary Logistic Regression Model We now consider adding position1 as another predictor of win probability in our model. We could consider the following model with position1 included as a predictor in our model as follows: p 0 1 sireSR.trunc 2 days 3 position1 log 1 p However, at the moment we are assuming that position1 is a continuous variable i.e. that the values 0, 1, 2, 3 and 4 that position1 takes are meaningful on that scale. This makes some rather strong (and probably unrealistic) assumptions that a change in position1 going from 1 to 2 (i.e the difference between a horse winning its last race and coming second in its last race) has the same effect on win probability, as a change in position1 going from 2 to 3 (i.e the difference between a horse coming second in its last race or finishing third in its last race), and so on. A similar assumption would also be made in the model if position2 and position3 were being included in the model. We can avoid making these strong assumptions by treating position1 (and if relevant position2 and position3) as a categorical variable (or factor) rather than as a continuous variable. If we treat position1 as a factor then our model really needs to be re-stated as follows: p 0 1 sireSR .trunc 2 days log 1 p 11 pos11 12 pos12 13 pos13 14 pos14 This may at first look a little overwhelming, so let’s look at it line by line: The first line is simply the constant (intercept) term β0, plus a term for sireSR.trunc multiplied by the estimated model parameter β1, followed by a term for the days variable multiplied by the estimated model parameter β2 . The second line in the model above relates to the variable position1 as follows: pos11 is a ‘dummy’ variable that = 1 if the value of position1 is 1, otherwise pos11 is zero; pos12 is a ‘dummy’ variable that = 1 if the value of position1 is 2, otherwise pos12 is zero; pos13 is a ‘dummy’ variable that = 1 if the value of position1 is 3, otherwise pos13 is zero; pos14 is a ‘dummy’ variable that = 1 if the value of position1 is 4, otherwise pos14 is zero; So, for example, if a horse finished second in its previous race, it would have pos11 = 0, pos12 = 1, pos13 = 0 and pos14 = 0. Can you see that if pos11 = 0, pos12 = 0, pos13 = 0 and horse did not finish in the top four in its previous race? pos14 = 0, then the Before we fit the model, we need to define the dummy variables; pos11, pos12, pos13, pos14 (for position1). This is actually what R assumes we want if we simply define position1 as a factor using the command: and pos1<-factor(position1) We can then fit this model using ourmodel<-glm(win~sireSR.trunc+days+pos1,binomial) summary(ourmodel) The output obtained is: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.0284817 sireSR.trunc 0.0848857 -35.677 < 2e-16 *** 0.0381184 0.0095645 3.985 6.74e-05 *** -0.0024011 0.0005039 -4.765 1.89e-06 *** pos11 0.9762834 0.0685220 14.248 < 2e-16 *** pos12 0.7907706 0.0772282 10.239 < 2e-16 *** pos13 0.6859132 0.0795358 8.624 < 2e-16 *** pos14 0.4240642 0.0871682 4.865 1.15e-06 *** days --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Can you see that all terms in the model are significant? Hence position1 is also a significant (good) predictor of win probability. Exercise 7 Extend the model to now include position2 and position3 in the model. Remember to define suitable factors for these predictors (position2 has been done for you in the script file). Obtain the usual summary output to assess whether position2 and position3 are significant predictors of win probability. Again the solution is shown overleaf. Solution to Exercise 7 The model is now: p 0 1 sireSR 2 days log 1 p 11 pos11 12 pos12 13 pos13 14 pos14 21 pos 21 22 pos 22 23 pos 23 24 pos 24 31 pos31 32 pos32 33 pos33 34 pos34 We can then fit this model by typing: Pos1<-factor(position1) pos2<-factor(position2) pos3<-factor(position3) ourmodel<-(glm(win~sireSR.trunc+days+pos1+pos2+pos3,binomial)) summary(ourmodel) The output obtained is: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.1607119 0.0895411 -35.299 < 2e-16 *** sireSR.trunc 0.0322118 0.0097749 3.295 0.000983 *** days -0.0026558 0.0005274 -5.035 4.77e-07 *** pos11 0.8593258 0.0720222 11.931 < 2e-16 *** pos12 0.7071949 0.0781914 9.044 < 2e-16 *** pos13 0.6280146 0.0802878 7.822 5.20e-15 *** pos14 0.3838465 0.0878972 4.367 1.26e-05 *** pos21 0.4201704 0.0751272 5.593 2.23e-08 *** pos22 0.3688256 0.0789903 4.669 3.02e-06 *** pos23 0.2157665 0.0848243 2.544 0.010969 * pos24 0.1723176 0.0865667 1.991 0.046528 * pos31 0.3078373 0.0757830 4.062 4.86e-05 *** pos32 0.2404234 0.0802307 2.997 0.002730 ** pos33 0.1557823 0.0839587 1.855 0.063530 . pos34 0.0290673 0.0885978 0.328 0.742850 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 In this case the only variable that does NOT have an asterisk or dot alongside the p-value is pos34. This would suggest that whether a horse finished fourth three races ago has little effect on the win percentage. Also the dot alongside the p-value for pos33 suggests that whether a horse finished third three races ago has only a borderline effect on the win percentage. However, pos31 and pos32, which relate to the horses finishing first or second three races ago respectively, are still important. Overall, I would perhaps argue that the information relating to position3 seems valuable and should probably be kept in our model. But, we do need to be wary of over-fitting. This is when we have a model which fits our data very well, but when deployed to new data performs badly in terms of future prediction. Our final model can therefore be specified as: p 3.161 0.0322 sireSR 0.00266 days log 1 p 0.859 pos11 0.707 pos12 0.628 pos13 0.384 pos14 0.420 pos 21 0.369 pos 22 0.216 pos 23 0.172 pos 24 0.308 pos31 0.240 pos32 0.156 pos33 0.029 pos34 Note that the estimate for the days variable is the only one which is negative. This indicates that the log-odds actually decrease as the number of days since last run increases. This means that the win probability decreases as the number of days since last run increases, which reflects the nature of the relationship suggested earlier. Note there were a number of issues with number of days since last run which we have ignored so far and should really investigate before finalizing our model. To use the model for prediction, suppose we have a horse which has a sireSR value of 10%, it last ran 5 days ago, finished 1st in its previous race, finished 3rd two races ago and was unplaced three races ago. The predicted log-odds can then be calculated as follows: Since the horse finished 1st in its previous race it has values of pos13=0 and pos14=0. pos11=1, pos12=0, Similarly, since the horse finished 3rd two races ago, it has values of pos22=0, pos23=1 and pos24=0. pos21=0, Finally, since the horse was unplaced three races ago, it has values of pos32=0, pos33=0 and pos34=0. pos31=0, Hence the model predicted log-odds are given by: p 3.161 0.0322 10 0.00266 5 log 1 p 0.859 1 0.707 0 0.628 0 0.384 0 0.420 0 0.369 0 0.216 1 0.172 0 0.308 0 0.240 0 0.156 0 0.029 0 = -1.777 We can therefore calculate the predicted odds in R by typing: exp(-1.777) The answer is 0.169. This means that fair odds of the horse winning would be 1 to 0.169, i.e. 5.9 to 1, or decimal odds of 6.9 if you include the stake. Hence the probability of winning is 1/6.9 = 0.145. Again we can use R to calculate these predicted odds and probabilities as follows: newrace.data<data.frame(sireSR.trunc=10,days=5,pos1="1",pos2="3",pos3="0") prob<-predict(ourmodel,newdata=newrace.data,type="response") odds<-1/prob odds prob Note in the second line above, the values for pos1, pos2 and pos3 are given in quotes. This is to ensure that they are treated as if they are categorical observations rather than measured on a scale. If we wanted model predicted probabilities for the horses in the data set we used to fit the model, we can obtain these and plot them in a histogram using: prob<-predict(ourmodel,type="response") hist(prob) 1.13 More on Model Assessment One thing we should consider is how well our model fits our data. One way is to compare the predicted probabilities against the actual proportion of time horses with those probabilities won - these should be similar if our model is appropriate. A plot for our model is shown below: This plot suggests the model performs quite well on average where the predicted probabilities are lower (less than 0.15). Above this level things do seem to vary above the expected levels but this is to some degree going to be more difficult to judge as there are smaller numbers of horses with predicted win probabilities above 0.15. The above plot can be produced using the R code shown overleaf. ourmodel<-glm(win~sireSR.trunc+days+pos1+pos2+pos3, family=binomial,na.action=na.exclude) prob.model<-predict(ourmodel,type="response") prob.cat<-cut(prob.model,breaks=seq(0,1,by=0.01)) table(prob.cat) prob.actual<-tapply(win,prob.cat,mean) names.prob.cat<-seq(0.005,0.995,by=0.01) plot(names.prob.cat,prob.actual,xlim=c(0,0.3),ylim=c(0,0.3), xlab="Pedicted Probability",ylab="Actual Win Proportion") abline(0,1,lty=2) The first line above re-fits our model, but notice that here we have added an additional statement na.action=na.exclude. This simply makes sure that where there is some missing data in any row of our data set then a fitted value is returned from the model as missing value which R records as NA. This ensures there is a predicted value even if this is recorded as missing (i.e. NA) for every row of our actual data set. The second line above obtains the predicted probabilities using the predict() command we saw earlier, but here we assign these predicted probabilities to a variable we call prob.model, since we will also have the observed probabilities or proportions. The third line uses the cut() function to determine what are referred to as “cut-points” which we will use to divide the probability scale from 0 to 1 up, in this into slices of length 0.01. The cut-points are thus 0, 0.01, 0.02 and 0.03 etc. and this creates a categorical variable (which we call prob.cat) that identifies intervals that start with 0 to 0.01, then 0.01 to 0.02 and 0.02 to 0.03 and so on up to 0.99 to 1.00. This command then identifies which of these intervals each of our model probabilities falls in. The fourth line above uses the table() function so we can see how many horses have a predicted probability that falls in each cut-point interval. This command actually produces the table in the R console window, part of which is reproduced below. Notice that (perhaps as expected) there are no horses with predicted win probabilities above 0.24. prob.cat (0,0.01] 13 (0.06,0.07] 2649 (0.12,0.13] 831 (0.18,0.19] 153 (0.24,0.25] 0 ... (0.01,0.02] 59 (0.07,0.08] 1953 (0.13,0.14] 602 (0.19,0.2] 96 (0.25,0.26] 0 (0.02,0.03] 296 (0.08,0.09] 1444 (0.14,0.15] 504 (0.2,0.21] 64 (0.26,0.27] 0 (0.03,0.04] 839 (0.09,0.1] 1435 (0.15,0.16] 396 (0.21,0.22] 46 (0.27,0.28] 0 (0.04,0.05] 4051 (0.1,0.11] 1165 (0.16,0.17] 307 (0.22,0.23] 21 (0.28,0.29] 0 (0.05,0.06] 4214 (0.11,0.12] 936 (0.17,0.18] 213 (0.23,0.24] 2 (0.29,0.3] 0 Back now to our R code to produce the model fit plot, and the fifth line uses a very useful function in R called tapply(). This allows us to obtain various data summaries on a variable, but to have this broken down by another grouping variable. In this case we are looking to obtain the mean of the variable win, which takes values of 0 or 1 so that the mean of this gives us the observed proportion (i.e. actual probabilities) of horses that won. Here we are using tapply to obtain this actual win proportion amongst horses that had model probabilities in each of the intervals we defined earlier by prob.cat. We then assign these actual win proportions to a new variable we choose to call prob.actual. The sixth line uses the seq() command to define a sequence of values that start from 0.005 and increase up to 0.995 in increments of 0.001. This is because these reflect the mid-points of the cut-point intervals we defined earlier. We assume that if our model is a good one then the actual win proportions in each cut-point interval should be approximately similar to these mid-points. We assign these mid-points to a variable we choose to call names.prob.cat. The last few lines then plot the model probabilities against the actual win proportions. Since these should be the same we add a line with a slope of 1 and intercept of zero to reflect where the plotted points should lie if we have a good model. The resulting plot is shown below. We can also assess other aspects of our model using other parts of the output produced by the summary(ourmodel) command. More of the results from that command are shown below: Deviance Residuals: Min 1Q Median -0.7273 -0.4408 -0.3569 3Q -0.3147 Max 2.9718 MODEL PARAMETER ESTIMATES OMITTED HERE Null deviance: 12240 on 22288 degrees of freedom Residual deviance: 11854 on 22274 degrees of freedom (358 observations deleted due to missingness) AIC: 11884 Another part of the output we can use to decide whether our model is of any use is the value of the Residual deviance (often simply referred to as the deviance). This is measure of how much error there is in our model that remains unexplained despite having our predictors in the model. As a general rule of thumb where we have large amounts of data (as we do) the residual deviance should be less than the residual degrees of freedom if our model is worth pursuing further. For our model the residual deviance = 11,854 and the residual degrees of freedom is 22,274 and so this provides support to suggest we have some value in our model. We can also use the residual deviance to compare different models, but this is a little more complicated and so we overlook this for now. We can also use the AIC value at the bottom of the previous output (here AIC = 11,884) to compare models. This stands for Akaike Information Criterion, and is a measure of goodness of fit of our model. However it’s not much use on its own and is only useful to compare two competing models. The better model is the one with a lower AIC. If you were look at the AIC for all the models we have looked at earlier, these all had a higher AIC value than the one here. Hence our last model is our best. Finally, the output above lists a summary of what are referred to as deviance residuals. A deviance residual is calculated by R for each observation in the data set and the summary shows the smallest (i.e. largest negative) and largest deviance residual, as well as the value of the deviance residuals at the median and 1st and 3rd quartiles. These can be used to indicate how much error in the model is contributed each observation in the data set. Hence they can be used to identify possible cases where our model fits badly. However, in logistic regression models, since our outcome variable (win) is binary (0 or 1) these residuals are not that useful and so we will not look at these any further here. 1.14 Problems with Binary Logistic Regression Models in Race Modelling We must not leave logistic regression without pointing out some of the problems this model has in the context of modeling the win probabilities for horse racing. There are two key problems as follows: There is nothing in the model to constrain the win probabilities so that they sum to one across a race! The model makes no use of the information that exists in the structure of the data in terms of finishing positions. That is it makes no use of the fact that the 2nd placed horse beat the 3rd placed horse, and so on. Both of these can actually be improved upon by making use of multinomial logistic regression, instead of binary logistic regression. This is actually the model that forms the basis for papers such as those by Bolton and Chapman (1986) and also Chapman (1994) and Benter (1994). However, we needed to understand binary logistic regression before we could even begin to think about moving to this more complex model. Recommended Books on R Introductory Statistics with R, Peter Dalgaard, http://www.amazon.com/IntroductoryStatistics-R-Computing/dp/0387954759 A Handbook of Statistical Analyses Using R, Brian Everitt and Torsetn Hothorn, http://www.amazon.co.uk/Handbook-Statistical-Analyses-Using/dp/1584885394 Free online version available – see below. Online Resources for R A free online version of A Handbook of Statistical Analyses Using R, Brian Everitt and Torsetn Hothorn, available at http://cran.r-project.org/web/packages/HSAUR/vignettes/ logistic regression chapter is at http://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf Recommended papers for further reading Bolton, R. N. and Chapman, R., G. (1986), Searching for Positive Returns at the Track: A Multinomial Logit Model for Handicapping Horse Races, Management Science, 32 (8), pp1040-1060. http://www.ruthnbolton.com/Publications/Track.pdf Chapman, R., G. (1994), Still Searching for Positive Returns at the Track: Empirical Results from 2,000 Hong Kong Races, in Efficiency of Race Track Betting Markets, eds. Haush, Lo and Ziemba, pp 173-181. Re-printed 2008. Benter, W. (1994), Computer-based Horse Race Handicapping and Wagering Systems, in Efficiency of Race Track Betting Markets, eds. Haush, Lo and Ziemba, pp 183-198. Reprinted 2008.
© Copyright 2025 Paperzz