Evaluating rating systems

Page of 2 (40 records)

sort

DeinFreund

8 years ago
(edited 8 years ago)

(ZK Elo applied to FFA games, click for full size)

There have been many proposals for rating systems over the time. Now I would like to give everybody the chance to prove the accuracy of their proposed systems.

Scoring

Your rating will have to make a prediction p for each Team representing its win chance as a fraction between 0.0 and 1.0. This prediction is scored by the following formula:

Team won:  (1+log2(p)) 
Team lost: (1+log2(1-p))

Formula suggested by

Brackman

The score of all predictions will be summed for your final score.

Implementing your Rating

Here is my sample implementation of standard ZK Elo*. It would be best if you implemented your rating system overriding the same methods. If java is unfamiliar to you, just implement the corresponding methods in your language and I'm sure we'll be able to translate it.

*I left out malus as it doesn't change much, but if you want to improve it feel free to. Tell me if I got something else wrong.

Good luck!

+4 / -0

GoogleFrog

8 years ago

Is that formula just a minimum message length-derived scoring for how efficiently your prediction would compress a string of wins and losses?

+0 / -0

Brackman

8 years ago

Your scoring rule is very good :D. Interesting that it has so much in common with string compression. The formula is very fundamental in information theory.[Spoiler]
I assume your class ELO is only a part of your code. The original codes' weighting system would improve results further (even though it's a bit strange), but indeed that's much effort for a small advantage. If it treated teams separately instead of putting all losers into one loser team (ZK original code makes only 1 loser team, too) it would be much easier to generalize it to the teamstrength system.

+0 / -0

DeinFreund

8 years ago

Pshh I was gonna display the scores as fraction of maximum possible score at the end. I just thought averaging might sound more complex and error-prone than it actually is.

My implementation is only the basic principle of zk elo. It already gives pretty similar results though. I also couldn't track whether battles counted for ELO or not, so there will always be some error.

While zk elo ignores actual teams, teams are preserved here so your system can make use of that:

Collection<Collection<Integer>> winnerTeams, Collection<Collection<Integer>> loserTeams

The outer Collection is the Teams, where each Team is a collection of UserIDs.

+0 / -0

Sprung

8 years ago

quote:
Your scoring rule is very good :D

quote:
Formula suggested by Brackman

lul

+2 / -0

Brackman

8 years ago

:D. Your code is very good, too. It is much more general than I thought at first and can be used with many teams. If I implemented zk's elo for testing purposes I wouldn't do weighting at first, too. [Spoiler]
So I'd like to give every player with elo x a playerstrength of 10^(x/400) and every team a teamstrength of sum (average if extra coms are possible) of players' playerstrengths and distribute probabilities proportional to teamstrength. If all teams have equal size extra com consideration makes no difference. I have written a class Teamstrength implements ELO and TeamstrengthWithExtraComs implements ELO. I'm not sure if it works and have written it with a self-made text editor. Besides the nicer version, I have a version that works with higher probability without and with extra coms. I think with

DeinFreund and

Sprung we have here the leading experts to correct my teamstrength code if it doesn't work.

Furthermore, I have a more fundamental objection to teamstrength itself. Teamstrength would give a player in a 1v1v1 the same win chance as the 1 player in a 1v2. The win chance in a 1v2 should be lower. That is, if there are no extra coms. With extra coms, the 1v2 win chance for the 1 player should be higher than the 1v1v1 as has already been discussed somewhere. The whole problem could be solved by distributing probabilities proportional to squares of teamstrengths, but this would make win chances too extreme. Any thoughts?

Sprung ?

+0 / -0

DeinFreund

8 years ago
(edited 8 years ago)

Brackman
I'm sure you meant to put a minus to the ELO gain for loser teams..

Here's my compiling version of your code(fixed

Sprung error): http://hastebin.com/emoyiditab.coffee

Unless I messed something up while fixing the compiler errors, the results are not very astonishing (and also wrong):
[Spoiler]

Always 0.5: 0.0
Teamstrength without Coms: 0.01926114687777866
Teamstrength with Coms: 0.019319722086680802
ZK Elo: 0.023980837112044084

(This was random balanced/unbalanced Teams from 2v2 up)

Even if I lock it to a specific game size, for example 2v2 or 8v8, ZK ELO still makes better predictions.

In 1v1 there's an improvement:

Teamstrength without Coms: 0.23044596621083455
Teamstrength with Coms: 0.23044596621083455
ZK Elo: 0.22747797449491308

FFA is where ZK Elo falls apart*:

Teamstrength without Coms: 0.28109929290835056
Teamstrength with Coms: 0.2793023428252123
ZK Elo: 0.055679590599419466

*AFAIK ZK doesn't even make FFA predictions, I just used the put all enemies in one team assumption.

If you're wondering how exactly this scoring was implemented, you can check https://github.com/DeinFreund/ZKForumParser/blob/master/src/zkforumparser/StatsBattles.java#L745

+0 / -0

Rafalpluk

8 years ago

What does horizontal axis of the chart represent?

+0 / -0

DeinFreund

8 years ago

Rafalpluk Battles. 0 represents the first ever FFA Battle and the last number is the latest one in my dataset.

+0 / -0

Sprung

8 years ago

quote:
a more fundamental objection to teamstrength itself. Teamstrength would give a player in a 1v1v1 the same win chance as the 1 player in a 1v2. The win chance in a 1v2 should be lower

I don't consider 1v2 breaking teamstrength a problem because it breaks every single a-posteriori rating system. There is no winning.

quote:
the results are not very astonishing

The results are wrong (or rather, the code is). In 1v1, teamstrength and ZK Elo are the same so they must have the exact same rating.

+2 / -0

DeinFreund

8 years ago
(edited 8 years ago)

The teamstrength implementation was changing ELO ratings during the calculation. (The evaluation system was working correctly, seems this unwanted bug yielded a slight improvement :D)

Results:
[Spoiler]

1v1:

Teamstrength without Coms: 0.22747797449491305
ZK Elo: 0.22747797449491308
Teamstrength with Coms: 0.22747797449491305

Teams:

ZK Elo: 0.023980837112044084
ZK Elo (K=64): 0.02683235735477585
Teamstrength without Coms: 0.018535611537062093
Teamstrength with Coms: 0.018567291478910775
Teamstrength without Coms (K=64): 0.022331963626771544
Teamstrength with Coms (K=64): 0.022197582594406972

2v2:

Teamstrength without Coms: 0.029091448997986557
ZK Elo: 0.0322964809911929
Teamstrength with Coms: 0.029091448997986585

The best K value for ZK ELO seems to lie between 63 and 64 (currently it's 32).

+0 / -0

Brackman

8 years ago
(edited 8 years ago)

Very good! Interesting results. I didn't think that K has such a big influence in the long run. With how many games did you do it? Nearly all?

A weighting system like zk's effectively increases K for certain battles and decreases it for others and will therefore probably yield better results than the best constant K.

Also, zk elo multiplies K by sqrt(number of all players in game / 2) (like you also did in your code). This is only for 2 teams. Generally it would have to be sqrt(average number of players per team). I didn't use this factor. Can you test if teamstrength becomes better with it or zk elo without it? If the ideal K of zk elo with sqrt factor is ~64 then the ideal K for teamstrength without sqrt is probably higher because the sqrt factor increases the effektive K. So it would be best to always compare systems with their ideal K instead of the same nominal K but another effective K.

The sqrt factor says that real results of games with many players are more meaningfull. I rather think that the importance of every game is the same, but that predictions of games with many players should be more distinct. Therfore you can also try to use the factor D := sqrt(average number of players per team) to modify predicted probabilities p to (p^D)/(p^D+(1-p)^D) instead.

So on the whole we would have 2^3=8 systems if we don't want to code a complicated weighting system:
1. zk elo or teamstrength
2. with/without sqrt factor to modify K
3. with/without D factor to modify p
(all with their ideal K, but even with constant K you can still compare systems that both either have sqrt factor to modify K or not)

I don't know if you want to test them all.. :) [Spoiler]

+0 / -0

DeinFreund

8 years ago
(edited 8 years ago)

I'm using all publicly available battles for the testing.

Using sqrt(N/2) factor for teamstrength:

Teamstrength with Coms: 0.0214564709657561
Teamstrength without Coms: 0.02153829204252216
Teamstrength with Coms (K=64): 0.013549012710202778
Teamstrength without Coms (K=64): 0.013519467431481061
ZK Elo: 0.023980837112044084
ZK Elo (K=64): 0.02683235735477585

I've been playing around with D as well, but it seems to do pretty much the same as the K factor mod. If I combine both D and K mod the results go negative.. Wasn't there some idea of a fundamental improvement over ELO? I wonder if we could get to try some actually different rating systems.

+0 / -0

Brackman

8 years ago

I had expected that using D and K mod together would be bad unless K is reduced to compensate.

My main motivation for team strength was that it was the only reasonable FFA solution we had. Considering your results, I have combined elo and teamstrength to a new system you might want to try. I have coded it completely general so that any K, K mod, D mod and extra com consideration can be given in the constructor. I'd like you to test the following constructors:

GeneralEloTeamstrength(64, false, true, true)
GeneralEloTeamstrength(64, false, false, true)
GeneralEloTeamstrength(80, false, false, true)
GeneralEloTeamstrength(64, true, false, true)

The last one is the same as current zk elo for teams, but has reasonable FFA calculation. I assume you used a standard value for K in your class ELO if it isn't given in the constructor?

The constructors above all consider extra coms. My code also allows to calculate without extra coms if we want to do uneven team balance without them one day. (Then it will rather be good players vs 1 more bad players than average players vs 1 more good and bad players.)

I have noticed that my D modification formula is only for 2 teams. My codes above use the generalized form.

+0 / -0

DeinFreund

8 years ago
(edited 8 years ago)

Fixed code: http://hastebin.com/rumowaluxu.coffee

[Spoiler]

FFA:

ZK Elo: 0.25485844438712807
ZK Elo (K=64): 0.2657360808127713
Teamstrength without Coms: 0.2866350228401899
Teamstrength without Coms (K=64): 0.28622489804229634
Teamstrength without Coms (K=48): 0.2888074993496073
Teamstrength with Coms: 0.2848292300975108
Teamstrength with Coms (K=64): 0.28448268049986297
GeneralTeamstrength without anything: 0.28051610927148507
GeneralTeamstrength with kMod: 0.2809349868024027
GeneralTeamstrength with kMod and extra coms: 0.27858346124298333
GeneralTeamstrength with kMod and Dmod: 0.28087469719441543
GeneralTeamstrength with extra coms: 0.2782526726828936
GeneralTeamstrength with everything: 0.2782126551467202
GeneralTeamstrength with Dmod: 0.2804710363447592
GeneralTeamstrength with Dmod and extra coms: 0.27791170092843487
Always 0.5: 0.0

Teams:

ZK Elo: 0.022266870981796708
ZK Elo (K=64): 0.025134314691304453
Teamstrength without Coms: 0.02036528875058607
Teamstrength without Coms (K=64): 0.015195225626115786
Teamstrength without Coms (K=48): 0.01940440219773913
Teamstrength with Coms: 0.020365288750586043
Teamstrength with Coms (K=64): 0.0151952256261157
GeneralTeamstrength without anything: 0.01727341975694278
GeneralTeamstrength with kMod: 0.022268228946067867
GeneralTeamstrength with kMod and extra coms: 0.022268228946067864
GeneralTeamstrength with kMod and Dmod: 0.023670405578113386
GeneralTeamstrength with extra coms: 0.017273419756942795
GeneralTeamstrength with everything: 0.023670405578113445
GeneralTeamstrength with Dmod: 0.020300624161614066
GeneralTeamstrength with Dmod and extra coms: 0.02030062416161411
Always 0.5: 0.0

Teams, your four suggestions:

GeneralEloTeamstrength(80, false, false, true): 0.025119824222315297
GeneralEloTeamstrength(64, false, true, true): 0.023044934707556364
GeneralEloTeamstrength(64, false, false, true): 0.02363564600606478
GeneralEloTeamstrength(64, true, false, true): 0.025135718615030066

+1 / -0

Brackman

8 years ago
(edited 8 years ago)

Very nice. I assume by GeneralTeamstrength you mean GeneralEloTeamstrength and that you used K=32 as standard value? I think the latter is the reason for the low values for GeneralTeamstrength while my proposals for GeneralEloTeamstrength are quite good. It's all about optimizing K. We know that we need K~64 with K mod. Therefore I have proposed a higher K (80) for without K mod without D mod and it was indeed better than its K=64 version. Probably it can become even better with another K>64. (If everything was used K<64 would be better.)

As expected, the values for GeneralEloTeamstrength are similar to elo for teams. But for FFA I expected a bigger improvement. Maybe this is also because you did GeneralEloTeamstrength for FFA only with K=32? ZK elo doesn't even do correct FFA calculation (probability sum > 1).[Spoiler]

+0 / -0

DeinFreund

8 years ago
(edited 8 years ago)

Brackman I fixed the ZK elo calculation by a correction factor, to get the sum to 1 as well. This is how it achieved the higher scores this time.

FFA scores(with equalized per game weighting):
[Spoiler]

ZK Elo: 0.2154381955120888
ZK Elo (K=64): 0.22597974298887324
Teamstrength without Coms (K=64): 0.24340332586710858
GTeamstrength(80, true, false, true): 0.24264502158205353
GTeamstrength(80, false, true, true): 0.24144056948238055
GTeamstrength(80, false, false, true): 0.24315627635387257
GTeamstrength(80, false, false, true): 0.24315627635387257
GTeamstrength(64, true, false, true): 0.24386827371311898
GTeamstrength(64, false, true, true): 0.24275771463505985
GTeamstrength(64, false, false, true): 0.24413863651114887
GTeamstrength(100, false, false, true): 0.23951963535695858
Always 0.5: 0.0

+0 / -0

Brackman

8 years ago
(edited 8 years ago)

We must really have a scout mindset to rating systems! In general, values for FFA are about 10 times higher than for teams, but FFAs are more rare. It would be nice to see how big the influence of FFA improvement is for all games on the whole.

But further investigation is actually not needed to already conclude that GeneralEloTeamstrength is better for FFA than currently. For sure conclusions about D and K mod, the differences are too small. So finally, I propose using GeneralEloTeamstrength with weighting system and D and K mod as currently (which means only K mod) with extra coms because it is the smallest change that yields proved improvement. It is the same for teams but better for FFA.[Spoiler]

+0 / -0

DeinFreund

8 years ago

Note that what's referred to as "ZK ELO" in my results is not what's currently being used for FFA balance. So even that should already improve FFA balance a lot.

+0 / -0

Firepluk

8 years ago
(edited 8 years ago)

Just record ELO separately for 1 vs 1 2 vs 2 3 vs 3 4 vs 4 5 vs 5, ... X vs Y,
FFA/COOP should differ as well

and simply take the closest elo to the current game size when doing balancing before game start

+0 / -1

Page of 2 (40 records)

Forum index > General discussion >

Scoring

Implementing your Rating