On VP Scales - Bridge Guys

On VP Scales
1. Introduction
2. Victory Point (VP) scales - what are they?
3. How many imps is a VP?
4. How many matchpoints is a VP?
5. How do we create a Butler Imp VP scale?
6. Conclusion in respect of the anomalous results
7. What affects K?
8. Assigning matches
1. Introduction
This article is written as a result of some anomalous results from an on-line 15 Board Butler imp on-line Swiss
competition, played over 6 rounds with "incomplete-Barometer" scoring. Since matches were played during the
course of a week, early matches had less accurate barometers than late played matches; hence my term
"incomplete-Barometer". The recommendation contained in the EBU White Book at the time of the event was to
halve the number of boards and use the VP scale for that number. This advice proved to be flawed.
2. Victory Point (VP) scales - what are they?
The English Bridge Union (EBU) uses VP scales for Swiss events, both Swiss Teams and Swiss Pairs and also for
all-play-all teams events. With some fudging VP scales can also be used for Butler imps events. There are a number
of different sorts of VP scales, but the ones used by the EBU are designed to give an equal probability of any result
from 20-0 through 10-10 to 0-20. The statistical caveat to this is that the matches themselves are between teams
(pairs) of equal strength, and each board is independent.
VP scales can be used for any sort of event where a scoring method yields a normal distribution of results. In this
scale there are 21 different possible results so each VP result represents 100/21 = 4.7619% of the complete range of
results. Note that this means the score 10-10 will occur 4.7619% of the time but all the other scores will appear
twice as often (eg 11-9 will appear as often as 9-11 as often as 10-10, but 11-9’s will appear 9.5238% of the time).
If we know the std deviation of a set of results we can express the VP scale as a number of standard deviations
from the mean for each score. Below is an extract from a table of the normal distribution expressed in standard
deviations from the mean. (I've nicked a table that's public domain and just show a few rows so you can see how it
works).
Table 1. Gaussian distribution table
z
.00
.01
.02
.03
.04
.05
.06
.07
.08
.09
1.0 0.15865 0.15625 0.15386 0.15150 0.14917 0.14686 0.14457 0.14231 0.14007 0.13786
0.9 0.18406 0.18141 0.17878 0.17618 0.17361 0.17105 0.16853 0.16602 0.16354 0.16109
0.8 0.21185 0.20897 0.20611 0.20327 0.20045 0.19766 0.19489 0.19215 0.18943 0.18673
0.7 0.24196 0.23885 0.23576 0.23269 0.22965 0.22663 0.22363 0.22065 0.21769 0.21476
0.6 0.27425 0.27093 0.26763 0.26434 0.26108 0.25784 0.25462 0.25143 0.24825 0.24509
0.5 0.30853 0.30502 0.30153 0.29805 0.29460 0.29116 0.28774 0.28434 0.28095 0.27759
0.4 0.34457 0.34090 0.33724 0.33359 0.32997 0.32635 0.32276 0.31917 0.31561 0.31206
0.3 0.38209 0.37828 0.37448 0.37070 0.36692 0.36317 0.35942 0.35569 0.35197 0.34826
0.2 0.42074 0.41683 0.41293 0.40904 0.40516 0.40129 0.39743 0.39358 0.38974 0.38590
0.1 0.46017 0.45620 0.45224 0.44828 0.44433 0.44038 0.43644 0.43250 0.42857 0.42465
0.0 0.50000 0.49601 0.49202 0.48803 0.48404 0.48006 0.47607 0.47209 0.46811 0.46414
At least it’s reassuring to see that 68.27% of all results fall within 1 std dev of zero. Don't believe everything you
read on the net.
All we need to do is to interpolate the number of standard deviations for cumulative intervals of multiples of
4.7619% to get the VP scale expressed in standard deviations. I’ve used straight-line interpolation for the third digit
as it’s not particularly significant in the grand scheme of things.
There is a case for awarding negative VP's to a side which has been soundly trounced. In particular it makes it
much less attractive to "shoot" for a good result when one is currently on a score of about 1 or 2 VP's. An assertion
of Mike Pomfrey's is that for negative VP's half of the range of a whitewash should be given to 20-0, and the
remaining half of the range should be equally distributed between 20-(-1) through 20-(-5). There is good
justification for this unequal split in that for any head to head match moving from 19-1 to 20-0 there is a net
difference of 2 VPs whereas moving from 20-0 to 20(-1) is a net difference of 1 VP, and so the first half of a score
of 20 can appropriately be assigned to 20-0. This extends the table from 20-0 through to 20-(-5)
Table 2. The 20-0 VP scale expressed
in standard deviations
Cum. %age
from mean
Std devs
from mean
VPs
Table 3. Negative VPs for a 20-0 scale
expressed in standard deviations
Cum. %age
from mean
Std devs
from mean
VPs
00.0000-02.3810 0.000-0.059 10-10
45.2381-47.6190 1.668-1.981 20- 0
02.3810-07.1429 0.059-0.180 11- 9
47.6190-48.1952 1.981-2.074 20 -1
07.1429-11.9048 0.180-0,303 12- 8
48.1952-48.6764 2.074-2.189 20 -2
11.9048-16.6667 0.303-0.431 13- 7
48.6764-49.0476 2.189-2.344 20 -3
16.6667-21.4286 0.431-0.566 14- 6
49.0476-49.6238 2.344-2.592 20- 4
21.4286-26.1905 0.566-0.712 15- 5
49.6238-50.0
26.1905-30.9524 0.712-0.876 16- 4
30.9524-35.7143 0.876-1.068 17- 3
35.7143-40.4762 1.068-1.309 18- 2
40.4762-45.2381 1.309-1.668 19- 1
45.2381-50.0
1.668+
20- 0
2.592+
20 -5
3. How many imps is a VP?
Some years ago John Manning published what at that time was original research on ideal VP scales for EBU
events. To get to a usable scale for a Team-of-4 match, we need to know that the standard deviation expressed in
imps is K x sqrt(n) where K=6.5 and n is the number of boards in the match. The figure of 6.5 is the standard
deviation for a 1-board match, and was derived empirically by Manning et al after considerable study of large
numbers of match results. Indeed there is some discussion as to the exact value of K but there is confidence it lies
between 6.0 and 7.0, with higher values in this range preferred when contestants are not equally matched. Manning
produced values of 6 (for a 13 round Swiss simulation) and 6.65 for round robins wheras McKinnon, an Aussie,
used 7 but that was a while ago when perhaps bidding was less accurate. Max Bavin for a long time used 20/3
which is simply vulgar.
As a result of some theory by Mike Pomfrey, we can convert Teams-of-4 scales to Teams-of-8 scales, as long as
we cross-imp the Team-of-8 results (giving 4 comparisons per board). Pomfrey asserts that the relationship between
two scales varies as the square root of the product of the number of comparisons and the number of results
[root(CxR)]. In fact with a bit of juggling we note that teams of 4 has C=1, R=2 and so the generalised standard
deviation for any cross-imped match is K x sqrt(n x C x R / 2)
So let's construct our VP table for an 8-board Teams-of-4 match using K=6.5, C=1, R=2 and n=8 (Std dev=18.38).
The EBU uses this scale for 7-9 boards, but what the heck, it's actually computed for 8 boards. We might as well
construct the table for negative VP's too.
Table 4. VP table for 8 board matches
Std devs
from mean
VP score Computed White Book
Imp range Imp range
Table 5. Extension to the table for negative VPs.
Std devs
from mean
VP score Computed Sensible
Imp range Imp range
0.000-0.059 10-10
0.0- 1.1
0- 0
1.668-1.981 20- 0
30.7-36.4 31-36
0.059-0.180 11- 9
1.1- 3.3
1- 2
1.981-2.074 20 -1
36.4-38.1 37-38
0.180-0.303 12- 8
3.3- 5.6
3- 4
2.074-2.189 20 -2
38.1-40.2 39-40
0.303-0.431 13- 7
5.6- 7.9
5- 6
2.189-2.344 29 -3
40.2-43.1 41-43
0.431-0.566 14- 6
7.9-10.4
7- 9
2.344-2.592 20 -4
43.1-47.7 44-47
0.566-0.712 15- 5
10.4-13.1 10-12
2.592+
47.7+
0.712-0.876 16- 4
13.1-16.1 13-15
0.876-1.068 17- 3
16.1-19.6 16-18
1.068-1.309 18- 2
19.6-24.1 19-23
1.309-1.668 19- 1
24.1-30.7 24-29
1.668+
30.7+
20- 0
20 -5
48+
30+
It is worth noting that although 10-10 is computed as a 2.2 imp spread -1.1 through +1.1 we score it as a single imp
spread of zero to zero. This is done so that the other intervals can slowly increase up to the standard deviation (173) at which point the ranges increase much more quickly. We need to be looking at matches of about 14 boards
before it is sensible to assign the 3 imp spread (0-1) to the score of 10-10. Indeed the VP scale is at best an
approximation as one can see. Also here we have used a K of 6.5 whereas there's a much better fit to the White
Book with K=6.25 which gives a 20-0 of 29.5, which I suspect is what was used for the White book tables. I
wonder whether hand dealing vs computer dealt hands has an effect on K? Manning certainly used hand dealt data.
Using the K x sqrt(n) formula we can easily see that for a match of 32 boards we "know" that the 20-0 score will
be about 60 imps, the ratio of sqrt(32) to sqrt(8) (actually 61 imps, since it's quoted as a range of boards).
Following Pomfrey we have for Teams-of-4 C=1; R=2 and for cross-imp Teams-of-8 we have C=4; R=4, and the
relationship is that of root(2) to root(16). So if you want to devise a VP scale for cross-imp teams of 8 playing 8
boards, you simply multiply the Teams-of-4 imps by 2 x sqrt(2) and you get 85 imps is a 20-0 win
4. How many matchpoints is a VP?
We can also use Tables 2 and 3 to create Swiss Pair VPs. This is based on the score for a 1-board match, where
100% awards a 20-0. It follows a 4-board match would have a 20-0 of 75% and the generalised formula for Swiss
pairs is 50+50/sqrt(n) for the 20-0. fwiw this gives a standard deviation for the 8 board match of 50/(1.668 x
sqrt(8)) = 10.598 and the mean is 50. Another way of looking at it is that 20-0 is 67.68%. What surprises me is that
the overall frequency of the different VP scores really does behave much as the statisticians predict, but there you
go; they must be clever people.
5. How do we create a Butler Imp VP scale?
We now move on to Butler imp VP scales, and to put it mildly it is yucky. Max Bavin suggests "To convert Butler
IMPs to normal IMPs, I think that you should multiply by 5/6 of root (2R/C). Assuming that t = (c+1) [everyone
plays all the boards], then the root factor (2R/C) is definitely correct. The 5/6 factor is a Bavin invention to do with
the fact that IMP scales are non-linear". This is a discussion Max and I have been having for a while, but I think
we're in agreement. Bavin also suggests and I agree that there is a "bigger variance in standard in On-Line bridge
than in f2f national championships"
The problem is due to using a datum and due to the non-linearity of the imp scale. If you win a board at teams of 4
by +420, +50 for 470 you get 10 imps. At Butler, if the datum is near the midpoint (ie half the field made it, and
half the field didn’t) say +180 then we score 6 imps against the datum and our opponents lose 6 imps against the
datum and so we are net +12 imps, but if the datum is close to 420 or 50 we only score 10. This perplexed all of us
- how do we find a VP scale? We concluded, again from inspection of a large numbers of imp results, where the
board has been played a large number of times, that Butler imps overstates the number of imps compared with
teams of 4 by a factor that is between 1.18 and 1.2, let’s call it 6/5. This conclusion has been reached relatively
recently, and is empirical.
As to the higher variance in online bridge one can reasonably say that the Brighton field is much more uniform than
that playing on-line and so we should adjust our K upwards. But there’s another problem with on-line games where
we use barometer scoring. You know with 1 board to play what your score is and, say you’re trailing 19-1, then the
very shape of the VP scale makes it worthwhile to "shoot" as your maximum loss is 1VP but your gain could be
several VPs. This means that each board is not independent of all others, as your result on the last board is affected
by your results on the others. And further, board 15 of such a match contains a larger number than usual of wild
swings and the datum is all over the shop. Also because of the shooting effect there is a case for pushing the std dev
up, but this is a 1 board effect and I've ascribed 1 imp to it (I have no justification but I think it's a reasonable idea.)
What we need to do is convert normal f2f imps to Butler online imps so we can establish a VP scale. By inversion
the formula will be 6/5 x sqrt(C / (2 x R)) and we can plug that into our extant formula for "normal teams" (ie
Teams-of-4) where we know R=2 and C=1 giving B x K x sqrt(n x C / (2 x R)) + 1 as the STD dev of an online
Butler game, where B is the Bavin or Butler factor, C is No.Tables - 1 and K is now 7.
So let us consider a 15 board match, "normal" Teams of 4: we know the std Dev is 6.5 x sqrt(15) = 25.17 and we
can multiply by 1.668 to get the 20-0 score = 42.00. For the equivalent online Butler game with 15 tables in play
we get a Std Dev of 1.2 x 7 x sqrt(15 x 14 / 30) + 1 = 23.22 and a 20-0 score of 38.73. The nearest published VP
scale is for 10-13 boards with a 20-0 of 36 imps, but we can devise our own - I won't bore you with the math.
There are two things we can do about the "shooting". Firstly we institute negative VPs to cut down the instances of
shooting in which case we can remove my 1 imp adjustment, and secondly we can start each table on a different
board number, to maximise the chances of an "honest" datum.
6. Conclusion in respect of the anomalous results
So there we have it. For the game in question a 15 table, 15 board online Butler:
1) 20-0 should have been 39 imps and not 30.
2) We should have used negative VPs.
3) Each table should have started with a different board.
... and inspection of the results with a 39 imp 20-0 shows good correlation with the requirement to equalise all
likely scores.
7. What affects K?
We know some of these things in my list have an effect, others are just surmise.
1) Variance in the strength of the field
2) Barometer scoring
3) Online play
4) Computer dealing?
5) Homogeneity of bidding system?
Max Bavin has promised me the Brighton Swiss Teams 2004 cards, so perhaps I can answer a few more questions
later
8. Assigning matches
We need also to consider what is the "best" way of assigning the matches. I think it's reasonable to use random
draw for the first round, and indeed this is what is normally done by the EBU. Seeded first rounds seem to get bad
press. I believe some ABF games are seeded with top half teams drawn against bottom half teams. This round is
known as the 'bloodbath'. It certainly gainsays the requirement that teams should be of equal strength as mentioned
in section 2.
We have quite a choice of methods: random draw, raw score difference, raw score quotient, capped raw score,
swiss count-back to name but a few. Let's look at these methods in some detail, taking an 8-board Swiss teams for
the basis of discussion
8.1 Raw score difference (goal difference). This looks quite attractive until one considers the team who took 70
imps out of a team of bunnies on the first round for their 20-0. The side effect is that they will be saddled with
playing the next strongest team on their score for the rest of the competition.
8.2 Raw score quotient (goal average). Let's consider the wild but strong team who are imp generators, who win
their match 70-40 for a score of 20-0, and the tight and strong team who win their match 35-5. The first team has an
average of 1.75 and the second 7.0 - yet they have the same score. Should the tight team be saddled with other tight
teams and the Frenzied Four have to play yet another Oxbridge 1st team?
8.3 Swiss count-back (strength of previous opponents). So for the 2nd round we'd better have another random
draw. On the third round we will find that the teams who started slowly will have played teams who are more likely
to have gotten better scores generally and will therefore be computed to have played stronger teams. This seems to
be ok in a sense, but you could just as well rank the teams in order of strength of previous opponents and get a
totally different assignment list with vast numbers of mis-matches, as compared with their actual VP scores.
8.4 Capped raw score instead of VPs. We'll set the cap at the 20-0 score. This looks ok too, until you think of the
team who takes 3 x 30 imp wins for +90. There won't be a team close to them, and the competition is over for the
rest of the contestants. This is why we use Swiss, so that no one team can run away, and to make the competition
more attractive for the bulk of the contestants.
8.5 Random draw. I feel strongly that this must be the correct method. A Swiss is designed to find a winner. If the
20-0 scale is fine enough to find a winner then it's done its job. If you don't like having random draw then use a 300 or 40-0 scale to decrease the size of the groups on the same score. But why even bother with this, as it makes no
difference in doing its job of finding the winner. The important point of a Swiss is that one divides the field into
equal parts, and if you achieve a given score then, for that event, for that field, that is your measure of merit and you
have no merit more or less than any other team on that score.
In conclusion, if you're going to Swiss then use a fine enough scale by all means and pick any of the above
methods, all of which are flawed, or recognise that a Swiss is designed to find a winner and random draw is equally
unfair to everyone.
This article is still under construction
John Probst, February 2004; revised April 2004
this article is in the public domain subject to acknowledgement