Predicting Results of a Match-Play Golf Tournament with Markov Chains
Kevin R. Gue, Jeffrey Smith and Özgür Özmen*
*Department of Industrial & Systems Engineering, Auburn University, Alabama, USA, [kevin.gue, jsmith,
ozgur]@auburn.edu
Abstract. We introduce a Markov Chain model for predicting outcomes in golf match-play. The model
uses individual players’ score probability distributions for each hole to estimate the probability of winning
the match. The model is specific both to the individual participants and to the course on which the match
is played. We use six years of PGA ShotLink data to determine individual player statistics and to estimate
the required probability distributions. We compare the prediction of the model results in the 2010 Ryder Cup
singles match-play (Day 3).
1. Introduction
Golf tournaments take on two major forms: stroke-play and match-play. In stroke-play, a player’s final score
consists of the sum of scores for each hole of the tournament; the player with the lowest score wins. In matchplay, two players compete on a hole-by-hole basis, such that the player with the lower score on a hole wins
one point. Equal scores on a hole yield one half-point for each player. The player with the most points after 18
holes wins. A match may not require a full 18 holes, if one player has a point advantage greater than the
number of remaining holes. For example, if a player is “3-up” with only two holes remaining, he or she has
won the match and play ceases. In this study, we consider only match-play competition.
Match outcome and decision support models for sports have been proposed in several studies (Scarf and
Shi, 2005; Goddard, 2005; Barnett and Clarke, 2005). Reilly and Williams (2003) summarize the effects
of implementing scientific methodologies to soccer. Regarding golf, Scheid (1979) simulates the effect of
handicap allowances in golf on players’ chance of winning, and McHale (2010) conducts simulation studies
to examine the fairness of handicapping by using data from a real golf tournament. Similar to these studies,
Franks and McGarry (2003) examine the relationship between observed results and expected results using real
data.
Markov chains are widely used to model sporting events (Sokol, 2004; Kostuk et al., 2001). Berry (2011)
builds a Markov chain model to compare Tiger Woods with other golfers to find out if he has the “persona of
a winner.” Fearing et al. (2011) use PGA Tour ShotLink1 data to develop distance-based models of putting
performance and to create a new putting performance metric.
Our study combines player statistics from six years of PGA ShotLink data and a Markov chain model to
predict the outcome of golf match-play events. In the following sections, we start with giving information
about the aggregated data we formed and the mathematical model we built. Then we talk about our validation
efforts regarding the data model. We present results of our computational model for Ryder Cup 2010, and
conclude with pedagogical notes and future goals.
2. Methodology
2.1 PGA ShotLink Data
PGA ShotLink data is gathered in the major PGA Tour stroke-play events by volunteers using mobile computers and laser rangefinders. We used six years of raw data (2004–2009) consisting of the scores of every player
on every hole in every tournament during those years. We aggregated this data to estimate player performance
statistics by par of hole. The structure of the data is given in Table 1.
For example, we can determine a player’s probability of scoring i strokes on a Par- j, where i is {1, 2, ........, 10}
and j is {3, 4, 5} as in Table 2. (Professional players almost never score more than 10 on a hole.)
1 http://www.pgatour.com/story/9596346/
Table 1. Sample player scores on Par-5 holes
Player Name
Mickelson, Phil
Mahan, Hunter
Watson, Bubba
Obs
1407
1790
1079
1
0
0
0
2
0
1
0
3
40
34
38
4
678
733
482
5
575
863
463
6
97
138
82
7
12
18
10
8
3
3
3
9
1
0
1
10
0
0
0
Table 2. Sample probabilities on Par-5 holes
Player Name
Mickelson, Phil
Mahan, Hunter
Watson, Bubba
Obs
1407
1790
1079
1
0
0
0
2
0
0
0
3
0.03
0.02
0.04
4
0.48
0.41
0.45
5
0.41
0.48
0.43
6
0.07
0.08
0.08
7
0.01
0.01
0.01
8
0
0
0
9
0
0
0
10
0
0
0
2.2 Mathematical Model
We model a match-play match as a Markov process, in which the state of the match is the advantage one player
has over the other and the transition probabilities correspond to the probabilities that that player wins, ties, or
loses the current hole. We also assume that performance on a hole does not depend on holes already played,
and so we meet the required memorylessness property. We also assume that the performance of a player is not
influenced by the identity of his opponent.
Let A j and B j be random variables corresponding to the score of Player A and Player B on a par- j hole.
The probability that A wins the hole is
10
P(A j < B j ) =
∑ P(A j < b|B j = b)P(B j = b)
b=1
10
=
∑ P(A j < b)P(B j = b)
b=1
b−1 10
=
∑ ∑ P(A j = a)P(B j = b)
a=1 b=2
Similarly, the probability of a tie is
10
P(A j = B j ) =
∑ P(A j = a)P(B j = a),
a=1
and the probability of a loss is
P(A j > B j ) = 1 − P(A j < B j ) − P(A j = B j ).
With the probabilities of win, tie, and loss for each of the three pars (3, 4, 5), we can completely specify
a state transition diagram (see Figure 1). The match, which may be defined from either player’s perspective,
begins in state zero and proceeds hole-by-hole, using probabilities appropriate to the par of each hole. In an
18-hole match-play we have 21 different states. Gray nodes indicate termination states, in which one player
has won. There is also a termination state of tie after 18 holes.
The structure of the state diagram suggests a simple, recursive expression for the probability that the match
is in state m after h holes. Let wh , th , and `h be the probabilities of a win, tie, and loss on hole h. These
probabilities will depend, of course, on the par of the hole. In general, the probability p(m, h) of being in state
m after h holes is the recursive expression:
p(m, h) = p(m − 1, h − 1)wh + p(m, h − 1)th + p(m + 1, h − 1)`h .
If states (m − 1, h − 1), (m, h − 1), or (m + 1, h − 1) are infeasible (e.g., 3-up after two holes), we set their
respective state probabilities to zero. Similarly, if a state is feasible, but the transition is not, we modify the
10
10
9
9
11
9
17
9
3
2
Hole 1
0
in
W
Tie
2
8
. . . .
1
. . . .
0
0
. . . .
-1
-1
. . . .
7
. . . .
. .
. . . .
2
2
1
1
0
0
-1
-1
-2
-2
se
Lo
1
18
8
-2
-7
. . . .
-8
. .
-8
-3
-9
-9
-9
-10
Figure 1. State diagram of a match. Gray circles represent terminating states, indicating the end of the match.
recursion appropriately. For example, p(2, 18) = p(1, 17)w18 because the other “feeder states” (p(3, 17) and
p(2, 17)) are winning states, and therefore the match is over if they are reached (see Figure 1).
The probability that a player wins the match is the sum of probabilities of reaching the winning states. The
probability that the match ends in a tie is p(0, 18).
3. Validation and Results
We assume that the probabilities we derived from the raw data are accurate and applicable in head-to-head
matches between individual players. This is an important assumption and warrants validation. Since the
current data was gathered from stroke-play tournaments, we wanted to also collect some match-play data to
provide validity evidence. To our knowledge, there are only two major events which have match-play rounds;
Ryder Cup Day-3 and the Accenture Match Play tournament. We searched world wide web to find these
tournaments’ data2 and to discern rivalry information between players. Since the Accenture event is a knockout style tournament and since there are so few match-play tournaments, we could only find small number of
players who played against each other multiple times. Our goal was to have enough match-play observations
between two players to calculate binomial confidence interval and compare it with the conditional probabilities
computed using our model. Since the number of observations for each par level was around 20 or below, we
found wide 95% confidence intervals which the conditional probabilities derived from ShotLink data always
fall within. This doesn’t give us great comfort in our validation efforts, but we are currently seeking additional
head-to-head data in order to improve the validiation process.
For our second validation effort, we use only PGA Championship data in PGA ShotLink and assume
that if two players played a hole on the same day in the same round at the same event, we can use that
2 We
could find scorecards for Accenture Match Play Tournament 2011, 2010, 2009, 2008, 2007, 2006,
2005, 2004 and Ryder Cup 2010, 2008, 2006, 2004
data as if they played against each other in a match-play for that particular hole. We analyzed the data and
picked two players (Toms and Mickelson) who played the most common holes in the same days. Using this
data, we then calculated the probabilities of winning, losing and halving for these players. Assuming the
normal approximation, we calculated binomial 95% confidence intervals on the respective probabilities. Table
3 shows that all of the conditional probabilities calculated by our algorithm using all ShotLink data fall within
confidence intervals. In terms of validation, our results are still fairly weak. We hope to work with the PGA
Tour to identify and obtain some additional data to support our validation effort.
Table 3. Validation results for all ShotLink data
Hole
Par - 3
Par - 4
Par - 5
All ShotLink Data
Toms Tie Mickelson
0.252 0.506
0.242
0.270 0.469
0.261
0.344 0.399
0.256
PGA Championship Data
95% Confidence Interval
Toms Tie Mickelson
Toms
Tie
Mickelson
0.219 0.563
0.219
(0.136,0.301) (0.463,0.662) (0.136,0.301)
0.273 0.473
0.254
(0.219,0.326) (0.413,0.534) (0.201,0.306)
0.319 0.403
0.278
(0.212,0.427) (0.289,0.516) (0.174,0.381)
3.1 Ryder Cup 2010 Day-3 Results
The Ryder Cup is a golf competition between two teams from Europe and the United States which is held
in every two years. Each team consists of 12 members who are picked by the respective team captains. The
Ryder Cup matches involve various match-play competitions between players selected from two teams of
twelve. Currently, the matches consist of eight foursomes matches, eight fourball matches and 12 singles
matches.3 The winner of each match scores a point for his team, or 1/2 point if the match ends in a draw. In
this paper we are interested only in singles matches that are played at day-3 of the Ryder Cup tournament.
The sequences of the players in each team are announced by the team captains the night before Day-3
session. Players who have the same rank play against each other. We ran our algorithm for Ryder Cup 2010
and the results are given in Table 4. Note that the actual winners are illustrated in bold characters.
Table 4. Results for Ryder Cup 2010 Singles Match Play
Match
1
2
3
4
5
6
7
8
9
10
11
12
US Player
Stricker, Steve
Cink, Stewart
Furyk, Jim
Johnson, Dustin
Kuchar, Matt
Overton, Jeff
Watson, Bubba
Woods, Tiger
Fowler, Rickie
Mickelson, Phil
Johnson, Zach
Mahan, Hunter
EU Player
Westwood, Lee
McIlroy, Rory
Donald, Luke
Kaymer, Martin
Poulter, Ian
Fisher, Ross
Jimenez, Miguel A.
Molinari, Francesco
Molinari, Edoardo
Hanson, Peter
Harrington, Padraig
McDowell, Graeme
P(US Wins)
0.557
0.481
0.481
0.592
0.475
0.569
0.579
0.679
0.746
0.714
0.457
0.535
P(EU Wins)
0.320
0.391
0.389
0.292
0.396
0.309
0.301
0.213
0.164
0.186
0.415
0.342
P(Tie)
0.122
0.128
0.130
0.116
0.129
0.121
0.119
0.109
0.090
0.100
0.128
0.123
In the appendix, we present the probabilities of winning and being tied from the US player’s (first player
listed) perspective. In Table 5, we show the conditional probabilities for Ryder Cup 2010 match-plays that
are found by our algorithm using all PGA ShotLink data. In Table 6, similarly to our second validation effort,
we present Ryder Cup 2010 match-play opponents’ data that is discerned from PGA Championship rounds.
We only list the match-ups that have sufficient observations to make the normal (distribution) approximation.
3 http://en.wikipedia.org/wiki/Ryder
Cup
Table 7 gives 95% confidence intervals for PGA Championship data to compare with probabilities found using
all ShotLink data.
4. Conclusion
Assuming that the player probabilities we derived from ShotLink data are accurate for match-play, we calculate
the probabilities of winning, losing and being tied for each player against each other on each par level (3, 4, and
5). For further validation of our assumption, we need more match-play data to compare. With our recursive
algorithm, we can also find the probabilities of winning an 18 hole match. For the Ryder Cup tournament,
using the same recursive logic (but without termination states and pruning), we can predict which team is more
likely to win the day-3 singles match-play event (consists of 12 matches). In the 2010 Ryder Cup, team Europe
was leading the game with 9.5 to 6.5 before day-3 started. Our algorithm found 80% chance of winning for
team US. Even the chance of winning for US with deficit of 3+ was around 40% which suggested that very
exciting day-3 event was waiting for us — at least this was an accurate prediction. As a future goal, we are
working on a team selection tool based on our probability model. The tool will assist in the team selection
process by finding “good” player assignments given the opposing team’s line-up.
We also assigned this model as a class project in our undergraduate applied probability course to measure
the reaction of the students regarding their learning experience. Our purpose was to introduce an entertaining but also stimulating problem that would raise the student interest and makes the subject matter more
memorable. Feedbacks we got back from the students were encouraging and really useful to design different
implementations of this project assignment. Our future plan is to design the project in milestones at which
students accomplish one task at a time such as manipulating the data, calculating conditional probabilities,
calculating match results, and calculating game results (in Ryder Cup case) etc. We want them to compare
their results with the real life Ryder Cup results to gain more faith on the method.
5. Acknowledgments
We would like to thank Kin Lo of PGA Tour Headquarters and the PGA Tour for providing us with the
ShotLink data that was used in this research.
Appendix
Table 5. Ryder Cup Match-up Probabilities when All ShotLink data is used
Stricker vs. Westwood
Cink vs. McIlroy
Furyk vs. Donald
Johnson vs. Kaymer
Kuchar vs. Poulter
Overton vs. Fisher
Watson vs. Jimenez
Woods vs. Molinari, F
Fowler vs. Molinari, E
Mickelson vs. Hanson
Johnson vs. Harrington
Mahan vs. McDowell
Par-3 Win
0.262
0.251
0.240
0.269
0.260
0.271
0.245
0.197
0.295
0.286
0.251
0.254
Par-3 Tie
0.509
0.535
0.530
0.496
0.516
0.515
0.504
0.563
0.482
0.503
0.515
0.513
Par-4 Win
0.300
0.286
0.263
0.315
0.263
0.300
0.292
0.313
0.362
0.321
0.270
0.296
Par-4 Tie
0.467
0.472
0.498
0.446
0.492
0.477
0.469
0.504
0.449
0.462
0.479
0.459
Par-5 Win
0.313
0.274
0.292
0.355
0.287
0.314
0.395
0.436
0.403
0.432
0.286
0.326
Par-5 Tie
0.415
0.394
0.436
0.379
0.417
0.388
0.377
0.357
0.353
0.368
0.405
0.401
Table 6. PGA Championship results
Stricker vs. Westwood
Furyk vs. Donald
Mickelson vs. Hanson
Johnson vs. Harrington
Mahan vs. McDowell
Par-3 Win
0.229
0.156
0.304
0.234
0.219
Par-3 Tie
0.542
0.578
0.482
0.484
0.500
Par-4 Win
0.280
0.227
0.299
0.285
0.159
Par-4 Tie
0.470
0.471
0.463
0.424
0.534
Par-5 Win
0.306
0.212
0.375
0.327
0.458
Par-5 Tie
0.500
0.538
0.531
0.423
0.375
Table 7. Confidence intervals of PGA Championship results
Stricker vs. Westwood
Furyk vs. Donald
Mickelson vs. Hanson
Johnson vs. Harrington
Mahan vs. McDowell
Par-3 Win
(0.11,0.348)
(0.067,0.245)
(0.183,0.424)
(0.131,0.338)
(0.076,0.362)
Par-3 Tie
(0.401,0.683)
(0.457,0.699)
(0.351,0.613)
(0.362,0.607)
(0.327,0.673)
Par-4 Win
(0.204,0.357)
(0.164,0.289)
(0.229,0.369)
(0.217,0.352)
(0.083,0.236)
Par-4 Tie
(0.385,0.555)
(0.396,0.546)
(0.387,0.54)
(0.351,0.498)
(0.43,0.638)
Par-5 Win
(0.155,0.456)
(0.101,0.323)
(0.207,0.543)
(0.199,0.454)
(0.259,0.658)
Par-5 Tie
(0.337,0.663)
(0.403,0.674)
(0.358,0.704)
(0.289,0.557)
(0.181,0.569)
References
Barnett T. and Clarke S. (2005) Combining player statistics to predict outcomes of tennis matches. IMA
Journal of Management Mathematics 16, 113.
Berry S. (2011) Is tiger woods a winner? Mathematical Association of America Distinguished Lecture Series .
Fearing D., Acimovic J. and Graves S. (2011) How to catch a tiger: Understanding putting performance on
the pga tour. Journal of Quantitative Analysis in Sports 7.
Franks I. and McGarry T. (2003) The science of match analysis. Science and soccer .
Goddard J. (2005) Regression models for forecasting goals and match results in association football.
International Journal of Forecasting 21, 331–340.
Kostuk K., Willoughby K. and Saedt A. (2001) Modelling curling as a markov process. European Journal of
Operational Research 133, 557–565.
McHale I. (2010) Assessing the fairness of the golf handicapping system in the uk. Journal of sports sciences
28, 1033–1041.
Reilly T. and Williams A. (2003) Science and soccer.
Scarf P. and Shi X. (2005) Modelling match outcomes and decision support for setting a final innings target in
test cricket. IMA Journal of Management Mathematics 16, 161.
Scheid F. (1979) Golf competition between individuals. Winter Simulation Conference: Proceedings of the 11
th conference on Winter simulation- Volume 2: San Diego, CA, United States , 505–510.
Sokol J. (2004) An intuitive markov chain lesson from baseball. Informs Transactions on Education 5, 47–55.
© Copyright 2026 Paperzz