On VP Scales 1. Introduction 2. Victory Point (VP) scales - what are they? 3. How many imps is a VP? 4. How many matchpoints is a VP? 5. How do we create a Butler Imp VP scale? 6. Conclusion in respect of the anomalous results 7. What affects K? 8. Assigning matches 1. Introduction This article is written as a result of some anomalous results from an on-line 15 Board Butler imp on-line Swiss competition, played over 6 rounds with "incomplete-Barometer" scoring. Since matches were played during the course of a week, early matches had less accurate barometers than late played matches; hence my term "incomplete-Barometer". The recommendation contained in the EBU White Book at the time of the event was to halve the number of boards and use the VP scale for that number. This advice proved to be flawed. 2. Victory Point (VP) scales - what are they? The English Bridge Union (EBU) uses VP scales for Swiss events, both Swiss Teams and Swiss Pairs and also for all-play-all teams events. With some fudging VP scales can also be used for Butler imps events. There are a number of different sorts of VP scales, but the ones used by the EBU are designed to give an equal probability of any result from 20-0 through 10-10 to 0-20. The statistical caveat to this is that the matches themselves are between teams (pairs) of equal strength, and each board is independent. VP scales can be used for any sort of event where a scoring method yields a normal distribution of results. In this scale there are 21 different possible results so each VP result represents 100/21 = 4.7619% of the complete range of results. Note that this means the score 10-10 will occur 4.7619% of the time but all the other scores will appear twice as often (eg 11-9 will appear as often as 9-11 as often as 10-10, but 11-9’s will appear 9.5238% of the time). If we know the std deviation of a set of results we can express the VP scale as a number of standard deviations from the mean for each score. Below is an extract from a table of the normal distribution expressed in standard deviations from the mean. (I've nicked a table that's public domain and just show a few rows so you can see how it works). Table 1. Gaussian distribution table z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09 1.0 0.15865 0.15625 0.15386 0.15150 0.14917 0.14686 0.14457 0.14231 0.14007 0.13786 0.9 0.18406 0.18141 0.17878 0.17618 0.17361 0.17105 0.16853 0.16602 0.16354 0.16109 0.8 0.21185 0.20897 0.20611 0.20327 0.20045 0.19766 0.19489 0.19215 0.18943 0.18673 0.7 0.24196 0.23885 0.23576 0.23269 0.22965 0.22663 0.22363 0.22065 0.21769 0.21476 0.6 0.27425 0.27093 0.26763 0.26434 0.26108 0.25784 0.25462 0.25143 0.24825 0.24509 0.5 0.30853 0.30502 0.30153 0.29805 0.29460 0.29116 0.28774 0.28434 0.28095 0.27759 0.4 0.34457 0.34090 0.33724 0.33359 0.32997 0.32635 0.32276 0.31917 0.31561 0.31206 0.3 0.38209 0.37828 0.37448 0.37070 0.36692 0.36317 0.35942 0.35569 0.35197 0.34826 0.2 0.42074 0.41683 0.41293 0.40904 0.40516 0.40129 0.39743 0.39358 0.38974 0.38590 0.1 0.46017 0.45620 0.45224 0.44828 0.44433 0.44038 0.43644 0.43250 0.42857 0.42465 0.0 0.50000 0.49601 0.49202 0.48803 0.48404 0.48006 0.47607 0.47209 0.46811 0.46414 At least it’s reassuring to see that 68.27% of all results fall within 1 std dev of zero. Don't believe everything you read on the net. All we need to do is to interpolate the number of standard deviations for cumulative intervals of multiples of 4.7619% to get the VP scale expressed in standard deviations. I’ve used straight-line interpolation for the third digit as it’s not particularly significant in the grand scheme of things. There is a case for awarding negative VP's to a side which has been soundly trounced. In particular it makes it much less attractive to "shoot" for a good result when one is currently on a score of about 1 or 2 VP's. An assertion of Mike Pomfrey's is that for negative VP's half of the range of a whitewash should be given to 20-0, and the remaining half of the range should be equally distributed between 20-(-1) through 20-(-5). There is good justification for this unequal split in that for any head to head match moving from 19-1 to 20-0 there is a net difference of 2 VPs whereas moving from 20-0 to 20(-1) is a net difference of 1 VP, and so the first half of a score of 20 can appropriately be assigned to 20-0. This extends the table from 20-0 through to 20-(-5) Table 2. The 20-0 VP scale expressed in standard deviations Cum. %age from mean Std devs from mean VPs Table 3. Negative VPs for a 20-0 scale expressed in standard deviations Cum. %age from mean Std devs from mean VPs 00.0000-02.3810 0.000-0.059 10-10 45.2381-47.6190 1.668-1.981 20- 0 02.3810-07.1429 0.059-0.180 11- 9 47.6190-48.1952 1.981-2.074 20 -1 07.1429-11.9048 0.180-0,303 12- 8 48.1952-48.6764 2.074-2.189 20 -2 11.9048-16.6667 0.303-0.431 13- 7 48.6764-49.0476 2.189-2.344 20 -3 16.6667-21.4286 0.431-0.566 14- 6 49.0476-49.6238 2.344-2.592 20- 4 21.4286-26.1905 0.566-0.712 15- 5 49.6238-50.0 26.1905-30.9524 0.712-0.876 16- 4 30.9524-35.7143 0.876-1.068 17- 3 35.7143-40.4762 1.068-1.309 18- 2 40.4762-45.2381 1.309-1.668 19- 1 45.2381-50.0 1.668+ 20- 0 2.592+ 20 -5 3. How many imps is a VP? Some years ago John Manning published what at that time was original research on ideal VP scales for EBU events. To get to a usable scale for a Team-of-4 match, we need to know that the standard deviation expressed in imps is K x sqrt(n) where K=6.5 and n is the number of boards in the match. The figure of 6.5 is the standard deviation for a 1-board match, and was derived empirically by Manning et al after considerable study of large numbers of match results. Indeed there is some discussion as to the exact value of K but there is confidence it lies between 6.0 and 7.0, with higher values in this range preferred when contestants are not equally matched. Manning produced values of 6 (for a 13 round Swiss simulation) and 6.65 for round robins wheras McKinnon, an Aussie, used 7 but that was a while ago when perhaps bidding was less accurate. Max Bavin for a long time used 20/3 which is simply vulgar. As a result of some theory by Mike Pomfrey, we can convert Teams-of-4 scales to Teams-of-8 scales, as long as we cross-imp the Team-of-8 results (giving 4 comparisons per board). Pomfrey asserts that the relationship between two scales varies as the square root of the product of the number of comparisons and the number of results [root(CxR)]. In fact with a bit of juggling we note that teams of 4 has C=1, R=2 and so the generalised standard deviation for any cross-imped match is K x sqrt(n x C x R / 2) So let's construct our VP table for an 8-board Teams-of-4 match using K=6.5, C=1, R=2 and n=8 (Std dev=18.38). The EBU uses this scale for 7-9 boards, but what the heck, it's actually computed for 8 boards. We might as well construct the table for negative VP's too. Table 4. VP table for 8 board matches Std devs from mean VP score Computed White Book Imp range Imp range Table 5. Extension to the table for negative VPs. Std devs from mean VP score Computed Sensible Imp range Imp range 0.000-0.059 10-10 0.0- 1.1 0- 0 1.668-1.981 20- 0 30.7-36.4 31-36 0.059-0.180 11- 9 1.1- 3.3 1- 2 1.981-2.074 20 -1 36.4-38.1 37-38 0.180-0.303 12- 8 3.3- 5.6 3- 4 2.074-2.189 20 -2 38.1-40.2 39-40 0.303-0.431 13- 7 5.6- 7.9 5- 6 2.189-2.344 29 -3 40.2-43.1 41-43 0.431-0.566 14- 6 7.9-10.4 7- 9 2.344-2.592 20 -4 43.1-47.7 44-47 0.566-0.712 15- 5 10.4-13.1 10-12 2.592+ 47.7+ 0.712-0.876 16- 4 13.1-16.1 13-15 0.876-1.068 17- 3 16.1-19.6 16-18 1.068-1.309 18- 2 19.6-24.1 19-23 1.309-1.668 19- 1 24.1-30.7 24-29 1.668+ 30.7+ 20- 0 20 -5 48+ 30+ It is worth noting that although 10-10 is computed as a 2.2 imp spread -1.1 through +1.1 we score it as a single imp spread of zero to zero. This is done so that the other intervals can slowly increase up to the standard deviation (173) at which point the ranges increase much more quickly. We need to be looking at matches of about 14 boards before it is sensible to assign the 3 imp spread (0-1) to the score of 10-10. Indeed the VP scale is at best an approximation as one can see. Also here we have used a K of 6.5 whereas there's a much better fit to the White Book with K=6.25 which gives a 20-0 of 29.5, which I suspect is what was used for the White book tables. I wonder whether hand dealing vs computer dealt hands has an effect on K? Manning certainly used hand dealt data. Using the K x sqrt(n) formula we can easily see that for a match of 32 boards we "know" that the 20-0 score will be about 60 imps, the ratio of sqrt(32) to sqrt(8) (actually 61 imps, since it's quoted as a range of boards). Following Pomfrey we have for Teams-of-4 C=1; R=2 and for cross-imp Teams-of-8 we have C=4; R=4, and the relationship is that of root(2) to root(16). So if you want to devise a VP scale for cross-imp teams of 8 playing 8 boards, you simply multiply the Teams-of-4 imps by 2 x sqrt(2) and you get 85 imps is a 20-0 win 4. How many matchpoints is a VP? We can also use Tables 2 and 3 to create Swiss Pair VPs. This is based on the score for a 1-board match, where 100% awards a 20-0. It follows a 4-board match would have a 20-0 of 75% and the generalised formula for Swiss pairs is 50+50/sqrt(n) for the 20-0. fwiw this gives a standard deviation for the 8 board match of 50/(1.668 x sqrt(8)) = 10.598 and the mean is 50. Another way of looking at it is that 20-0 is 67.68%. What surprises me is that the overall frequency of the different VP scores really does behave much as the statisticians predict, but there you go; they must be clever people. 5. How do we create a Butler Imp VP scale? We now move on to Butler imp VP scales, and to put it mildly it is yucky. Max Bavin suggests "To convert Butler IMPs to normal IMPs, I think that you should multiply by 5/6 of root (2R/C). Assuming that t = (c+1) [everyone plays all the boards], then the root factor (2R/C) is definitely correct. The 5/6 factor is a Bavin invention to do with the fact that IMP scales are non-linear". This is a discussion Max and I have been having for a while, but I think we're in agreement. Bavin also suggests and I agree that there is a "bigger variance in standard in On-Line bridge than in f2f national championships" The problem is due to using a datum and due to the non-linearity of the imp scale. If you win a board at teams of 4 by +420, +50 for 470 you get 10 imps. At Butler, if the datum is near the midpoint (ie half the field made it, and half the field didn’t) say +180 then we score 6 imps against the datum and our opponents lose 6 imps against the datum and so we are net +12 imps, but if the datum is close to 420 or 50 we only score 10. This perplexed all of us - how do we find a VP scale? We concluded, again from inspection of a large numbers of imp results, where the board has been played a large number of times, that Butler imps overstates the number of imps compared with teams of 4 by a factor that is between 1.18 and 1.2, let’s call it 6/5. This conclusion has been reached relatively recently, and is empirical. As to the higher variance in online bridge one can reasonably say that the Brighton field is much more uniform than that playing on-line and so we should adjust our K upwards. But there’s another problem with on-line games where we use barometer scoring. You know with 1 board to play what your score is and, say you’re trailing 19-1, then the very shape of the VP scale makes it worthwhile to "shoot" as your maximum loss is 1VP but your gain could be several VPs. This means that each board is not independent of all others, as your result on the last board is affected by your results on the others. And further, board 15 of such a match contains a larger number than usual of wild swings and the datum is all over the shop. Also because of the shooting effect there is a case for pushing the std dev up, but this is a 1 board effect and I've ascribed 1 imp to it (I have no justification but I think it's a reasonable idea.) What we need to do is convert normal f2f imps to Butler online imps so we can establish a VP scale. By inversion the formula will be 6/5 x sqrt(C / (2 x R)) and we can plug that into our extant formula for "normal teams" (ie Teams-of-4) where we know R=2 and C=1 giving B x K x sqrt(n x C / (2 x R)) + 1 as the STD dev of an online Butler game, where B is the Bavin or Butler factor, C is No.Tables - 1 and K is now 7. So let us consider a 15 board match, "normal" Teams of 4: we know the std Dev is 6.5 x sqrt(15) = 25.17 and we can multiply by 1.668 to get the 20-0 score = 42.00. For the equivalent online Butler game with 15 tables in play we get a Std Dev of 1.2 x 7 x sqrt(15 x 14 / 30) + 1 = 23.22 and a 20-0 score of 38.73. The nearest published VP scale is for 10-13 boards with a 20-0 of 36 imps, but we can devise our own - I won't bore you with the math. There are two things we can do about the "shooting". Firstly we institute negative VPs to cut down the instances of shooting in which case we can remove my 1 imp adjustment, and secondly we can start each table on a different board number, to maximise the chances of an "honest" datum. 6. Conclusion in respect of the anomalous results So there we have it. For the game in question a 15 table, 15 board online Butler: 1) 20-0 should have been 39 imps and not 30. 2) We should have used negative VPs. 3) Each table should have started with a different board. ... and inspection of the results with a 39 imp 20-0 shows good correlation with the requirement to equalise all likely scores. 7. What affects K? We know some of these things in my list have an effect, others are just surmise. 1) Variance in the strength of the field 2) Barometer scoring 3) Online play 4) Computer dealing? 5) Homogeneity of bidding system? Max Bavin has promised me the Brighton Swiss Teams 2004 cards, so perhaps I can answer a few more questions later 8. Assigning matches We need also to consider what is the "best" way of assigning the matches. I think it's reasonable to use random draw for the first round, and indeed this is what is normally done by the EBU. Seeded first rounds seem to get bad press. I believe some ABF games are seeded with top half teams drawn against bottom half teams. This round is known as the 'bloodbath'. It certainly gainsays the requirement that teams should be of equal strength as mentioned in section 2. We have quite a choice of methods: random draw, raw score difference, raw score quotient, capped raw score, swiss count-back to name but a few. Let's look at these methods in some detail, taking an 8-board Swiss teams for the basis of discussion 8.1 Raw score difference (goal difference). This looks quite attractive until one considers the team who took 70 imps out of a team of bunnies on the first round for their 20-0. The side effect is that they will be saddled with playing the next strongest team on their score for the rest of the competition. 8.2 Raw score quotient (goal average). Let's consider the wild but strong team who are imp generators, who win their match 70-40 for a score of 20-0, and the tight and strong team who win their match 35-5. The first team has an average of 1.75 and the second 7.0 - yet they have the same score. Should the tight team be saddled with other tight teams and the Frenzied Four have to play yet another Oxbridge 1st team? 8.3 Swiss count-back (strength of previous opponents). So for the 2nd round we'd better have another random draw. On the third round we will find that the teams who started slowly will have played teams who are more likely to have gotten better scores generally and will therefore be computed to have played stronger teams. This seems to be ok in a sense, but you could just as well rank the teams in order of strength of previous opponents and get a totally different assignment list with vast numbers of mis-matches, as compared with their actual VP scores. 8.4 Capped raw score instead of VPs. We'll set the cap at the 20-0 score. This looks ok too, until you think of the team who takes 3 x 30 imp wins for +90. There won't be a team close to them, and the competition is over for the rest of the contestants. This is why we use Swiss, so that no one team can run away, and to make the competition more attractive for the bulk of the contestants. 8.5 Random draw. I feel strongly that this must be the correct method. A Swiss is designed to find a winner. If the 20-0 scale is fine enough to find a winner then it's done its job. If you don't like having random draw then use a 300 or 40-0 scale to decrease the size of the groups on the same score. But why even bother with this, as it makes no difference in doing its job of finding the winner. The important point of a Swiss is that one divides the field into equal parts, and if you achieve a given score then, for that event, for that field, that is your measure of merit and you have no merit more or less than any other team on that score. In conclusion, if you're going to Swiss then use a fine enough scale by all means and pick any of the above methods, all of which are flawed, or recognise that a Swiss is designed to find a winner and random draw is equally unfair to everyone. This article is still under construction John Probst, February 2004; revised April 2004 this article is in the public domain subject to acknowledgement
© Copyright 2026 Paperzz