About Revised European Ratings

I'm Dave de Vos, a Dutch go player.

I wanted to investigate the EGD rating system. I attempted to make a revised system that fixes some issues that I notice in the EGD rating system.

The EGD rating system was originated by Aleš Cieply and it is explained here. The EGD manager Aldo Podavini kindly provided the game history from the EGD for me to play with. He also suggested to reverse engineer the EGD rating system and reproduce the EGD rating history and go from there to tweak it. That is what I did.

My main concern is the a function. It is used to compute an expected game result, so it should predict winrates reasonably well. But the expected winrates from the a function used by the system don't match all that well with observed winrates. (See 1/a predicted EGD vs 1/a observed EGD). The expected odds are about twice the observed odds, so the expectations of the EGD are clearly too high. Only around rating 100 and rating 2700 its predictions come closer to the observations.

This means that all players lose more than the system expects against a lower rated player and win more than the system expects against a higher rated player. Over time, this will contract the rating range: 1 grade difference will correspond to less than 100 points rating difference. The most frequent opponent has a rating of about 1700, and the frequency tapers off below and above. Because of this, I expect that the rating range will contract towards 1700. But this trend may be obscured by other deflation or inflation effects.

For some time, I've had this feeling that there is gradual deflation in the mid-dan region of a few rating points per year. I suspect that to some degree this deflation may be attributed to the above cause. And even if it isn't, I see no reason to use a model that doesn't match with observations. So I implemented a revised rating system that uses an a function that matches the observed winrates better.

The expected game result is computed like: Se(r1, r2) => 1 / (exp(f(r1, r2)) + 1)
where f is the log-odds of winning probability.

The EGD rating system basically uses: f(r1, r2) => (r2 - r1) / a(min(r2, r1)).
where a(r) is a(r) => r < 2700 ? (4100 - rating) / 20 : 70.

The revised system uses the Bradley-Terry model: f(r1, r2) => β(r2) - β(r1)
were β(r) is the antiderivative of the observed ρ(r):β(r) => r * 0.0022 + 0.000025 * 440 * exp(rating / 440).

I also used a different epsilon to reduce the mismatch between declared ranks and ratings in the kyu region, which could be a result of an underestimation of deflation due to improving players removing points from the system. I think it is fair to assume that this occurs more frequently towards the lower end of the rating range. I also changed the con function to match better with the other changes. I also made some other changes, the most important of which is the rating reset policy.

Towards the lower end of the rating range there is still a discrepancy between observed winrates and the revised expected winrates, but I think this can be attributed to the hard minimum rating 100 (20k) applied by the EGD. It seems that in the historical data, declared grades are retrofitted to this minimum grade. Sometimes declared grades lower than 20k can be inferred (like when a 16k gives a 9 stones handicap to a '20k', it could be inferred that the '20k' actually registered as a 25k), but it seems most of it it can't be reconstructed from the data I got. I did find a request to lower the minimum grade to 30k by the French Go association in a recent annual EGF meeting report, but even if it would be granted, I don't know if it's even possible to recover these grades from the historical data.

On the Player Rating History page you can compare the rating histories computed with this revised rating system.

This site is still under construction. As it is now, it's just a quick and dirty contraption to share my thoughts and results. I'm still tweaking the system, so the charts may also evolve over time. Please feel free to contact me at dave dot devos at planet dot nl if you have any questions, remarks or suggestions. I also created a topic on the lifein19x19 forum.

The domain name of this site is similar to goratings.org from Rémi Coulom, but there is no connection. I did ask Rémi if he is ok with me using this domain name and he did not mind. By the way, Rémi's system seems to use a standard Elo scale, so I can use my β function to estimate a crude conversion from the European ratings to his pro ratings:

2700 (1 p)>2750
2730 (2 p)>2824
2760 (3 p)>2902
2790 (4 p)>2985
2820 (5 p)>3073
2850 (6 p)>3167
2880 (7 p)>3266
2910 (8 p)>3371
2940 (9 p)>3483
2970 (10 p)>3602
3000 (11 p)>3729

Addendum 2017-10-05

If the predicted winrates matched observed winrates exactly, on average players would not gain nor lose points when everybody's skill stays the same.
But the expected winrates don't match observed winrates in the EGD. In the example below I show what happens because of it.

We have a player with rating 2100. He plays games against a player with rating 2000. The EGD expects him to win 71% of these games. In reality he wins about 60% (as observed in the statistics of the EGD).
So his winrate minus the expected winrate is -0.11. His K factor is 24, so on average he will lose 2.6 points per game played against this opponent.

The same player also plays against another player with rating 2200. The EGD expects him to win 26% of these games. In reality he wins about 35% (as observed in the statistics of the EGD).
So his winrate minus the expected winrate is +0.11. Using his K-factor of 24, we find that on average he will win about 2.6 points per game played against this opponent.
So if he plays both opponents with the same frequency, his rating will not change on average.

But the demographics of the EGD data show that since 2003, players rated around 2000 appear more frequently in tournament games than players rated around 2200 (the ratio is about 5:4). Correcting for this we find that his rating will change by about (5 * -2.6 + 4 * 2.6) / 9 = -0.29 points per game in this demographic distribution.
This is not much, but if this player plays 25 games a year, which is typical for tournament players in this rating region, he will lose 6 points in a year and over 10 years, every player around 2100 rating would lose 60 points.

But the EGD also uses an epsilon parameter. This will give this player 24 * 0.016 = 0.39 free points for every game he plays. This is more than enough to compensate for the expected winrate errors.
One could argue that this epsilon correction would not be neccessary if the expected winrates were closer to reality (My finding is that this is indeed the case and I see no reason to keep these winrate errors).
Nevertheless, it would seem that the expected winrate errors are more than compensated by the epsilon parameter.

Still, there is a gradually increasing difference between declared ranks and ratings in the EGD rating distributions, with a maximum of about 50 points around the lower dan region in 2012. This trend is reversing a bit in recent years, but my theory is that in recent years, dan players chose to comply with the rating system instead of looking at the ranks they have according to handicap.
What other causes could there be?
Is it that players around the lower dan region were overranking themselves more and more between 1996 and 2012?
I cannot rule this out, but neither can I rule out that this deflation is caused by a defect of the rating system.

Another possible cause for deflation is improving players. There are two mechanisms in the EGD that are supposed to compensate for this.
1: The rating reset policy. This is to prevent quickly improving players from removing many points from the system. But the EGD only resets players who get 2 stones stronger between tournaments. That is rather conservative, because getting stronger is usually a more gradual process. Most players don't get stronger that quickly.
2: The epsilon parameter. This should compensate for slowly improving players (which is a much bigger group I assume). But as we have seen above, 3/4 of the epsilon parameter is used up to compensate for the expected winrate errors. So of the original 0.016, only 0.004 is left to counter deflation from slowly improving players!

So in the end, there isn't much to counter the deflation caused by slowly improving players and they will inevitably take away points from the system, leading to deflation.

So what to do?
1: Fix the expected winrate errors.
2: Use a less conservative reset policy.

The revised system does both and I find it has no need for an epsilon parameter.

Addendum 2017-11-06

DeepMind, the creator of AlphaGo also uses an Elo rating scale to measure strength. I can use my β function to estimate a conversion from the ratings in their papers about AlphaGo (note that DeepMind extrapolates 230 Elo points per rank downward from 7d, which is incorrect IMO):

Fan Hui 2800 (4p) < 2900
AlphaGo Fan 2894 (7p) < 3200
AlphaGo Lee 3020 (12p) < 3704
AlphaGo Master 3216 (18p) < 4805
AlphaGo Zero 3257 (20p) < 5099