Backgammon playing errors and rating in GNUBG

Kees van den Doel

Summary

Simulations were performed to establish the relation between the rating difference (R) between backgammon players of various strengths, and the error rates for the two players. This will allow a match analysis to translate the error rate of each player into an estimation of the absolute ratings of the players, once the rating of "perfect play" has been defined. Of course, this estimate will be only as good as the quality of the analysis and will become invalid for the estimation of the rating of players at a comparable level as at which the analysis was performed. However, the measured relation does not depend on the analysis level and remains valid for very high levels of analysis such as rollouts. The main unproven assumption is that human errors can be modeled by injecting noise in the results of the evaluation function.

Data on human players is currently being analysed to compare the predicted ratings with the actual ratings of players. Results thus far are tabulated in Appendix I.

Two sets of simulations were performed. In the first the relation between the difference in normalized error rate per unforced decision (DEPD) and rating difference R was measured. The main result of the first simulation set is an almost linear relation between DEPD and R, which depends somewhat on the match length. As a by-product of this investigation the relative strengths of various playing levels of GNUBG were determined precisely by rating.

The second set of simulations attempts to improve on the linear model of the first set, by separating out the effects of cube errors and chequer play errors. A bilinear fit was made of R to both the normalized chequer play error per unforced move (EPM) and the cube error per unforced cube decision (EPC). The resulting bilinear fit is shown to improve the prediction of the rating difference, especially for matches with very poor cube handling compared to chequer play.

The scripts used for the simulations and a readme file explaining their use are available here.

A spinoff script to batch analyse .mat files can be downloaded here.

The final result is the following formula for the rating difference:
R = a2(N)*EPM+b(N)*EPC,
where
a2(N) = 8798 + 25526/N,
and
b(N) = 863 - 519/N.

Part I: Rating and DEPD

To verify

Introduction

In order to estimate the rating difference between playing levels we played a series of 1000 matches of a given match length between various levels and estimated the rating difference by analyzing the results. From the analysis we obtain the normalized error rate per unforced decision and the "luck adjusted result" (P), which is an estimate of the match winning probability using variance reduction. These numbers were averaged over the 1000 matches and the variance was also estimated. The rating difference R and a 90% confidence interval was then estimated from the 90% confidence interval of P using the FIBS rating formula, P = 1/(1 + 10^(R*sqrt(N)/2000)), with N the match length.

Simulations were performed for match lengths of 1, 3, 5, 7, 11, and 23. For practical reasons it was necessary to perform the error and luck analysis at 0-ply (expert level of GNUBG) and consider only levels at the expert level (0-ply) and below (0-ply with noise addition). Equal noise was generated for chequer play and for cube decisions. The following levels (0-ply with given noise) were paired:

player1

player1

0

0.005

0

0.015

0

0.02

0

0.04

0

0.06

0.04

.06

Level pairings

 

The assumption that the rating difference between players depends only on DEPD and not on the absolute level of play was verified implicitly with this choice as the last pairing has both players with noise addition. If the absolute EPD mattered the last pairing would give results inconsistent with the first five, which is not the case except for 1pt matches where as described in the results section.

With these settings, the simulation took about 100 hours on a 950Mhz Pentium III. Simulations were performed on Windows XP, using GAWK and sh scripts within the GNU ULTILS in combination with the no-GUI version of GNUBG (build 03 08 07). They should work equally well on LINUX.

Three AWK scripts are used to create GNUBG scripts to play a given number of matches, analyze them, and to collect statistics. Separate scripts were used in order to modularize the tasks so the simulation can be interrupted and continued without loss of data. The resulting match statistics were then processed trough a fourth AWK script which computes the average rating R, and the average DEPD over the set, as well as the 90% confidence intervals as estimated for a variance estimation. The levels of the players and the analysis level is set in the .gnubgautorc file, outside the scripts. Further analysis was then performed in MATLAB.

Results

The results of the simulations are given in the figure below for all match lengths.

rall.jpg

R versus DEPD for various match lengths

More detailed results are given in the figures below for each specific match length. The measured data points are indicated by the circles. The red line is a piecewise linear interpolation through the data points. The black lines indicate the 90% confidence interval. The green line is a least square fit through the data points (minimizing the sum of squares of absolute rating differences)

r3.jpg

3pt match

 

r5.jpg

5pt match

 

r7.jpg

7pt match

 

r11.jpg

11pt match

 

r23.jpg

23pt match

 

r1.jpg

1pt match

For all match length except 1, the linear fit seems quite good and we can approximate the rating difference by R= a(N)*DEPD, where N is the match length. The function N*a(N) is plotted in the figure below, along with a linear least square fit.

lsqcoeff.jpg

Coefficient N*a(N) versus match length

The linear fit of N*a(N) gives the following rating formula:
R = (11971 + 23681/N) * DEPD.
In the figure below the formula is plotted along with the actual data points;

lsqcoeff.jpg

Rating versus DEPD approximated by R = (11971 + 23681/N) * DEPD. The dotted lines are the prediction, the solid lines are piecewise linear segments through the data.

The bump in the curve for the 1pt match around DEPD=0.015 is caused by the data point from the 0.04 - 0.06 (intermediate - beginner). Apparently the approximation that the rating difference does not depend on the absolute values of the error rates is less accurate for 1pt matches.

Note that, for the purpose of measuring DEPD between simulated players, it is a reasonable approximation to analyze a match at expert level if both players are playing at expert level or worse. This is because we are only interested in the difference between the error rates. When a match is re-analyzed at World Class level (2-ply) the individual error rates increase, but the difference remains approximately constant. To verify this explicitly, we ran 2 sets of 40 matches between 0-ply expert level and 0-ply advanced level (.015 noise) and analyzed them both at 0-ply and at 2-ply. The results are given in the table below.

0-ply

2-ply

0.0075

0.0068

0.0080

0.0072

2-ply versus 0-ply DEPD estimates

We see that the error in the 0-ply estimation appears to be in the order of 5%. As an additional verification, a set of 100 3 pt matches was played between the World Class level (2-ply) and the expert level which was analyzed at 2-ply. It was found that the resulting DEPD and rating difference R was well predicted by the model. The data point is indicated by the 'X' on the figure for the 3pt match above and fits well on the curve.

Conclusions

The data is well approximated by a linear relation between rating difference and error per decision difference of the form R=a(N)*DEPD where N is the match length. An excellent approximation for a(N) is a(N) = 23681/N + 11971. The approximation is worst for N=1. The assumption that R does not depend on the absolute EPD but only on the difference was validated, except for N=1. The assumption that DEPD for matches between players at or below the GNUBG expert level can we approximated by a 0-ply analysis was verified by re-analyzing some matches at 2-ply with small effect on the DEPD. An explicit simulation of a series of 2-ply versus 0-ply matches resulted in an estimated R which fitted the model very well, supporting the assumption that the linear relation between R and DEPD is valid for any level of analysis.

Part II: Rating and EPM and EPC

Introduction

In the above we have assumed that the rating depends only on DEPD = (totMoveErrors + totCubeErrors)/totalDecisions. In reality the rating difference will depend on some combination of chequerplay skill and cube skill, at which the above formula makes an educated guess. A more powerful predictor taking into account the separate effect of cube errors and chequerplay errors can be obtained by simulating matches with independently varied cube and chequerplay noise and fitting the rating by a bilinear form R= a2(N)*EPM +b(N)*EPC, with EPC = moverErr/unforcedMoves and EPC=cubeErr/unforcedCubeDecisions.

For match lengths of 3, 5, 7, 11, and 23, 500 matches between GNUBG 0-ply and GNUBG 0-ply with 16 different noise settings were played. The chequer play noise settings were 0, 0.02, 0.04, 0.06 and the cube play noise settings were 0.05, 0.1, 0.4, 0.8, 16 combinations in total. For each set of 500 matches the EPM and EPC were averaged. The rating difference was estimated as in Part I by using the luck adjusted result.

For each match length the 16 data points were fitted to R = a2(N)*EPM +b(N)*EPC with a least square method, minimizing the squares of the rating differences. For N=1 there are no cube decisions and a2(1) = a(1) = 37916.

Results

In the figures, below, we plot R against DEPD. The green "O" indicates the measure rating differences, the green dots the 90% confidence interval of the data, the red "X" indicates the linear fit using the formula from Part I, and the blue "X" indicates the prediction from the bilinear fit with the coefficients a2 and b as indicated.

Measurements and fit for match length 3.

 

Measurements and fit for match length 5.

 

Measurements and fit for match length 7.

 

Measurements and fit for match length 11.

 

Measurements and fit for match length 23.

The coefficients a2(N) and b(N) as a function of match length are well approximated by a linear function of 1/N. We show these coefficients and a linear fit in the two figures below.

Fitting Na2(N) with a linear function.

 

Fitting Nb(N) with a linear function.

Conclusions

The bilinear fit improves the linear fit especially for large cube errors. Usually the linear estimate is too high, reflecting too much weight given to the cube errors, for which the bilinear formula corrects. The data is well approximated by
R = a2(N)*EPM+b(N)*EPC,
where
a2(N) = 8798 + 25526/N,
and
b(N) = 863 - 519/N.

Appendix I: Human data

I have analysed match collections of several online players who have an actual rating, and computed their average expected rating using the bilinear formula from part II. The interval given with the estimated rating is the 90% confidence interval of the estimation. Also indicated is the estiamted taing of GNUBG if it would play on this site. On FIBS and gamesite200 GNUBG bots are actually playing with ratings of 2050 and 2200. This parameter is called the rating offset in GNUBG. Note that the estimated 90% confidence interval is obtained by estimating the variance from the actual data, assuming a normal distribution, which is not reliable for small sample sizes. (I should actually use the t-distribution, for smaller sample sizes, to make a better estimate of the confidence intervals.)

Player Site Rating offset # matches Actual rating Estimated rating
Albert Silver FIBS 2050 129 1829 1852 +- 11
Nardy FIBS 2050 140 1769 1749 +- 17
RJ Veldhuizen FIBS 2050 346 1805 1819 +- 15
csg FIBS 2050 24 1577 1540 +- 123
Holger FIBS 2050 94 1750 1753 +- 24
kvandoel Gamesite2000 2200 261 1920 1876 +- 22
quax Gamesite2000 2200 101 1894 1891 +- 45
slork Gamesite2000 2200 33 1863 1918 +- 34
cloots GamesGrid 2000 76 1770 1758 +- 30
Human matches analysed. Indicated is the name of the player, where matches were played, the rating offset used, which is the rating og GNUBG on this site, number of matches analysed, and the GNUBG 0-ply estimation based on the bilinear fit to the error rates.

Appendix II: Misc. measurements

Player1

Player2

Match length

R1-R2

DEPD (e2-e1)

EPM

EPC

expInt

intExp

5pt

472 [451 493]

--

-0.0344

-0.01

expBeg

advExp

5pt

115 [103 127]

--

-0.00865

+0.0176

expExp

expBeg

5pt

31 [21 41]

--

0

v

-0.0185

expExp

expBeg

11pt

29 [23 36]

--

0

-0.0165

expExp

expBeg2 (1.0 cubenoise)

5pt

191 [176 206]

--

0

-0.192

wclassWclass

wclassExp

5pt

6 [-15 27]

--

--

--

1-ply

0-ply

5pt

13 [-4 30]

--

--

--

Some misc. measurements. expBeg for example means chequer play at level expert, cubeplay at level beginner. Rating interval is the 90% confidence interval.