Loading...
  OR  Zero-K Name:    Password:   

Evaluating rating systems

40 posts, 2905 views
Post comment
Filter:    Player:  
Page of 2 (40 records)
sort

(ZK Elo applied to FFA games, click for full size)

There have been many proposals for rating systems over the time. Now I would like to give everybody the chance to prove the accuracy of their proposed systems.

Scoring


Your rating will have to make a prediction p for each Team representing its win chance as a fraction between 0.0 and 1.0. This prediction is scored by the following formula:
Team won:  (1+log2(p)) 
Team lost: (1+log2(1-p))


Formula suggested by DErankBrackman

The score of all predictions will be summed for your final score.

Implementing your Rating


Here is my sample implementation of standard ZK Elo*. It would be best if you implemented your rating system overriding the same methods. If java is unfamiliar to you, just implement the corresponding methods in your language and I'm sure we'll be able to translate it.

*I left out malus as it doesn't change much, but if you want to improve it feel free to. Tell me if I got something else wrong.


Good luck!
+4 / -0


8 years ago
Is that formula just a minimum message length-derived scoring for how efficiently your prediction would compress a string of wins and losses?
+0 / -0
8 years ago
Your scoring rule is very good :D. Interesting that it has so much in common with string compression. The formula is very fundamental in information theory.[Spoiler]
I assume your class ELO is only a part of your code. The original codes' weighting system would improve results further (even though it's a bit strange), but indeed that's much effort for a small advantage. If it treated teams separately instead of putting all losers into one loser team (ZK original code makes only 1 loser team, too) it would be much easier to generalize it to the teamstrength system.
+0 / -0

8 years ago
Pshh I was gonna display the scores as fraction of maximum possible score at the end. I just thought averaging might sound more complex and error-prone than it actually is.

My implementation is only the basic principle of zk elo. It already gives pretty similar results though. I also couldn't track whether battles counted for ELO or not, so there will always be some error.

While zk elo ignores actual teams, teams are preserved here so your system can make use of that:
Collection<Collection<Integer>> winnerTeams, Collection<Collection<Integer>> loserTeams

The outer Collection is the Teams, where each Team is a collection of UserIDs.
+0 / -0

8 years ago
quote:
Your scoring rule is very good :D

quote:
Formula suggested by Brackman

lul
+2 / -0
8 years ago
:D. Your code is very good, too. It is much more general than I thought at first and can be used with many teams. If I implemented zk's elo for testing purposes I wouldn't do weighting at first, too. [Spoiler]
So I'd like to give every player with elo x a playerstrength of 10^(x/400) and every team a teamstrength of sum (average if extra coms are possible) of players' playerstrengths and distribute probabilities proportional to teamstrength. If all teams have equal size extra com consideration makes no difference. I have written a class Teamstrength implements ELO and TeamstrengthWithExtraComs implements ELO. I'm not sure if it works and have written it with a self-made text editor. Besides the nicer version, I have a version that works with higher probability without and with extra coms. I think with CHrankAdminDeinFreund and PLrankAdminSprung we have here the leading experts to correct my teamstrength code if it doesn't work.

Furthermore, I have a more fundamental objection to teamstrength itself. Teamstrength would give a player in a 1v1v1 the same win chance as the 1 player in a 1v2. The win chance in a 1v2 should be lower. That is, if there are no extra coms. With extra coms, the 1v2 win chance for the 1 player should be higher than the 1v1v1 as has already been discussed somewhere. The whole problem could be solved by distributing probabilities proportional to squares of teamstrengths, but this would make win chances too extreme. Any thoughts? PLrankAdminSprung ?
+0 / -0
DErankBrackman
I'm sure you meant to put a minus to the ELO gain for loser teams..

Here's my compiling version of your code(fixed PLrankAdminSprung error): http://hastebin.com/emoyiditab.coffee

Unless I messed something up while fixing the compiler errors, the results are not very astonishing (and also wrong):
[Spoiler]

If you're wondering how exactly this scoring was implemented, you can check https://github.com/DeinFreund/ZKForumParser/blob/master/src/zkforumparser/StatsBattles.java#L745
+0 / -0


8 years ago
What does horizontal axis of the chart represent?
+0 / -0

8 years ago
PLrankRafalpluk Battles. 0 represents the first ever FFA Battle and the last number is the latest one in my dataset.
+0 / -0

8 years ago
quote:
a more fundamental objection to teamstrength itself. Teamstrength would give a player in a 1v1v1 the same win chance as the 1 player in a 1v2. The win chance in a 1v2 should be lower

I don't consider 1v2 breaking teamstrength a problem because it breaks every single a-posteriori rating system. There is no winning.

quote:
the results are not very astonishing

The results are wrong (or rather, the code is). In 1v1, teamstrength and ZK Elo are the same so they must have the exact same rating.
+2 / -0
The teamstrength implementation was changing ELO ratings during the calculation. (The evaluation system was working correctly, seems this unwanted bug yielded a slight improvement :D)

Results:
[Spoiler]

The best K value for ZK ELO seems to lie between 63 and 64 (currently it's 32).
+0 / -0
Very good! Interesting results. I didn't think that K has such a big influence in the long run. With how many games did you do it? Nearly all?

A weighting system like zk's effectively increases K for certain battles and decreases it for others and will therefore probably yield better results than the best constant K.

Also, zk elo multiplies K by sqrt(number of all players in game / 2) (like you also did in your code). This is only for 2 teams. Generally it would have to be sqrt(average number of players per team). I didn't use this factor. Can you test if teamstrength becomes better with it or zk elo without it? If the ideal K of zk elo with sqrt factor is ~64 then the ideal K for teamstrength without sqrt is probably higher because the sqrt factor increases the effektive K. So it would be best to always compare systems with their ideal K instead of the same nominal K but another effective K.

The sqrt factor says that real results of games with many players are more meaningfull. I rather think that the importance of every game is the same, but that predictions of games with many players should be more distinct. Therfore you can also try to use the factor D := sqrt(average number of players per team) to modify predicted probabilities p to (p^D)/(p^D+(1-p)^D) instead.

So on the whole we would have 2^3=8 systems if we don't want to code a complicated weighting system:
1. zk elo or teamstrength
2. with/without sqrt factor to modify K
3. with/without D factor to modify p
(all with their ideal K, but even with constant K you can still compare systems that both either have sqrt factor to modify K or not)

I don't know if you want to test them all.. :) [Spoiler]
+0 / -0
I'm using all publicly available battles for the testing.

Using sqrt(N/2) factor for teamstrength:
Teamstrength with Coms: 0.0214564709657561
Teamstrength without Coms: 0.02153829204252216
Teamstrength with Coms (K=64): 0.013549012710202778
Teamstrength without Coms (K=64): 0.013519467431481061
ZK Elo: 0.023980837112044084
ZK Elo (K=64): 0.02683235735477585


I've been playing around with D as well, but it seems to do pretty much the same as the K factor mod. If I combine both D and K mod the results go negative.. Wasn't there some idea of a fundamental improvement over ELO? I wonder if we could get to try some actually different rating systems.
+0 / -0
8 years ago
I had expected that using D and K mod together would be bad unless K is reduced to compensate.

My main motivation for team strength was that it was the only reasonable FFA solution we had. Considering your results, I have combined elo and teamstrength to a new system you might want to try. I have coded it completely general so that any K, K mod, D mod and extra com consideration can be given in the constructor. I'd like you to test the following constructors:
GeneralEloTeamstrength(64, false, true, true)
GeneralEloTeamstrength(64, false, false, true)
GeneralEloTeamstrength(80, false, false, true)
GeneralEloTeamstrength(64, true, false, true)
The last one is the same as current zk elo for teams, but has reasonable FFA calculation. I assume you used a standard value for K in your class ELO if it isn't given in the constructor?

The constructors above all consider extra coms. My code also allows to calculate without extra coms if we want to do uneven team balance without them one day. (Then it will rather be good players vs 1 more bad players than average players vs 1 more good and bad players.)

I have noticed that my D modification formula is only for 2 teams. My codes above use the generalized form.
+0 / -0
Fixed code: http://hastebin.com/rumowaluxu.coffee

[Spoiler]
+1 / -0
Very nice. I assume by GeneralTeamstrength you mean GeneralEloTeamstrength and that you used K=32 as standard value? I think the latter is the reason for the low values for GeneralTeamstrength while my proposals for GeneralEloTeamstrength are quite good. It's all about optimizing K. We know that we need K~64 with K mod. Therefore I have proposed a higher K (80) for without K mod without D mod and it was indeed better than its K=64 version. Probably it can become even better with another K>64. (If everything was used K<64 would be better.)

As expected, the values for GeneralEloTeamstrength are similar to elo for teams. But for FFA I expected a bigger improvement. Maybe this is also because you did GeneralEloTeamstrength for FFA only with K=32? ZK elo doesn't even do correct FFA calculation (probability sum > 1).[Spoiler]
+0 / -0
DErankBrackman I fixed the ZK elo calculation by a correction factor, to get the sum to 1 as well. This is how it achieved the higher scores this time.

FFA scores(with equalized per game weighting):
[Spoiler]
+0 / -0
We must really have a scout mindset to rating systems! In general, values for FFA are about 10 times higher than for teams, but FFAs are more rare. It would be nice to see how big the influence of FFA improvement is for all games on the whole.

But further investigation is actually not needed to already conclude that GeneralEloTeamstrength is better for FFA than currently. For sure conclusions about D and K mod, the differences are too small. So finally, I propose using GeneralEloTeamstrength with weighting system and D and K mod as currently (which means only K mod) with extra coms because it is the smallest change that yields proved improvement. It is the same for teams but better for FFA.[Spoiler]
+0 / -0

8 years ago
Note that what's referred to as "ZK ELO" in my results is not what's currently being used for FFA balance. So even that should already improve FFA balance a lot.
+0 / -0
Firepluk
Just record ELO separately for 1 vs 1 2 vs 2 3 vs 3 4 vs 4 5 vs 5, ... X vs Y,
FFA/COOP should differ as well

and simply take the closest elo to the current game size when doing balancing before game start
+0 / -1
Page of 2 (40 records)