Evaluating rating systems

Page of 2 (40 records)

sort

Brackman

8 years ago

quote:
Note that what's referred to as "ZK ELO" in my results is not what's currently being used for FFA balance. So even that should already improve FFA balance a lot.

Yes, indeed I considered that. Therefore GTeamstrength will yield even bigger improvements than your last numbers suggest.

quote:
Just record ELO separately for 1 vs 1 2 vs 2 3 vs 3 4 vs 4 5 vs 5, ... X vs Y,
FFA/COOP should differ as well

I have (had) a more simple idea: Just use a weighted average of 1v1 elo and team elo depending on game size:

elo = 1v1elo/n+(1-1/n)*teamElo,

where n is the average number of players per team (2.5 in a 2v3). If 1v1 elo should still be separated from teams, you can have smallTeamElo and bigTeamElo with

elo = smallTeamElo*1.5/n+(1-1.5/n)*bigTeamElo or
elo = smallTeamElo*2/n+(1-2/n)*bigTeamElo.

Probably, usung 2/N instead of 1/n, where N is the number of all players in the game, is even better. It would be interesting to test those weighted averages on the data! FFA elo could be separated additionally.

Licho 's objection here is that FFA elo needs a lot more games to converge.

+0 / -0

malric

8 years ago

Just curious: do any of the rating system discussed here take into account the time as an active player (not resigned, not afk) ? I guess it could make a difference for the winning team. Ex: high elo player quits/resigns early, then, if they still win, the team mates should get higher elo increase.

+0 / -0

DeinFreund

8 years ago
(edited 8 years ago)

Can you define that mathematically? Like for example EloWeight = ResignTime/GameDuration. Also what if they lose?

This will only include resigners, as afk/alt+f4 is not recorded in any way currently.

+1 / -0

malric

8 years ago

Did not think in detail, was just curious if anybody else is doing it (because there are advanced stuff around, would not want to reinvent the wheel). Even if no one else is doing it, will probably not have the time to do it, but asking the question is not that much investment.

Regarding "when you loose", I do not think the computation should be different. It's like a "promise". When you start you "promise" some of your ELO if you loose. If team lost, the "promised" ELO should be given. On the winning side though the question is "whom should get the 'won' ELO". I think it should be taken by people that actually put more effort (not the quitters)...

+0 / -0

fiendicus_prime

3 years ago

I can't get my head around the maths of this scoring system. It seems to punish any attempt to make a prediction with certainty, because the negative score for a mistake costs more than the positive score for a win. For example:

```
score = 0
# Predicted a win for the correct team 51/100, with a probability of 52%
score += 51 * (1 + math.log2(0.52))
# Predicted a win for the wrong team 49/100, with a probability of 52%
score += 49 * (1 + math.log2(0.48))
...
score: -6.163388023594507e-05
```

(Note: for 2 teams I'm not sure if I should be adding 2 * the expression or not, and also I'm not sure if I'm supposed to get the mean either - but I feel like it should be the mean)

Another perspective on this, I'm looking at some data using Trueskill and it can predict the winner ~57% of the time. But it gets a negative score (mean -0.1589) from this. If, on the other hand, I normalise the probability with a fudge:

```
p = 0.54 if p >= 0.5 else 0.46
```

Then the score comes out positive (mean 0.0095). This seems off to me. Similarly I can write p = 0.5 and get a score of 0 with no ability to calculate anything.

What am I missing?

+1 / -0

malric

3 years ago

quote:
I can't get my head around the maths of this scoring system. It seems to punish any attempt to make a prediction with certainty, because the negative score for a mistake costs more than the positive score for a win.

The score is made to compare 2 rating systems, don't think it makes sense to think about the value on it's own (negative/positive).

For example, we have 3 algorithms for 100 games:
- algorithm A - always predicts 50% (useless algorithm, just as reference)
- algorithm B - makes correct decision in 50 games while predicting win chance of 75%, rest fails while predicting win chance of 75%
- algorithm C - makes correct decision in 25 games while predicting win chance of 100%, rest fails while predicting win chance of 50%

which would you say is "better"?

The scoring system in the first post would give:
- algorithm A = 0
- algorithm B = -20
- algorithm C = 50

So, qualitatively I would say that the scoring system favors systems that are correct when they think they are correct and penalizes systems that they are "sure of themselves" when they are wrong.

Compared to the A, B, C algorithms, the algorithm that gave the predictions in your example (the one with 52% win chance) is almost indistinguishable from 50% chance (never too sure on itself neither when is right nor when it is wrong).

quote:
I'm looking at some data using Trueskill and it can predict the winner 57%

On what data/scenario? (ffa, teams, 1v1, etc.). Also, how "sure" is the algorithm when it fails versus when it is correct? If in the 43% of cases in which it failed it gave winning chance of 100%, while in the 57% of cases which it was correct it gave 51% would you call that a good algorithm?...

+1 / -0

fiendicus_prime

3 years ago

quote:

The score is made to compare 2 rating systems, don't think it makes sense to think about the value on it's own (negative/positive).

Thanks @Malric, that helps a lot. I was stuck thinking it gave you a number that was meaningful in isolation, and indeed the magnitude of the prediction is important.

I'm still feeling there's something not ideal about it though because of the log2. If I compare two predictions:

```
>>> 65 * (1 + math.log2(0.51)) + 35 * (1 + math.log2(0.49))
0.8368727947070336

>>> 65 * (1 + math.log2(0.79)) + 35 * (1 + math.log2(0.21))
-0.9087605487041657
```

Aren't these equally good predictions? They are both wrong 14/100 times. But, not only does the prediction which estimated low get a better score, the sign flips completely for the prediction that estimated high; seemingly worse than just guessing 50:50. For the high estimation, we need to predict more than an extra 1 in 100 battles correctly before it starts winning. This feels intuitively wrong to me (but, indeed, maths/stats aren't always intuitive).

I'll do a new post about my experiments in the future. Don't worry, I'm not going to argue it's better than WHR - I'm quite sure it isn't - just looking if I can see any patterns in the data :)

+1 / -0

Aquanim

3 years ago

quote:
Don't worry, I'm not going to argue it's better than WHR - I'm quite sure it isn't

As far as I understand it, WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players. WHR provides a rating number which establishes chances of winning a 1v1 against another player with their own rating number. Our current generalisation to teams games is "average ratings and hope for the best". You may be able to improve on that generalisation without improving on WHR itself.

+3 / -0

GoogleFrog

3 years ago
(edited 3 years ago)

The log scoring system is related to information theory. There are some links to follow in this paragraph. https://en.wikipedia.org/wiki/Scoring_rule#Logarithmic_score

One way to look at it is that the assigned probability is a statement about the compressibility of a series of games. Consider a long string of wins and losses generated by the two teams playing each other a large number of times. The predicted win value then corresponds to a claim about how to compress this string.

A predicted win chance of 50% amounts to saying that the best way to compress the string is just to write out a series of 1s and 0s, to represent the outcome of each game with a single bit. In other words, that the string is incompressible. Any prediction other than 50% is a claim about some knowledge that can be used compress the string more than this. The knowledge being encoded is that of how much stronger one team is than the other team. For example, (and this example is human-readable to drive intuition, not optimal) if team A has a 99% win chance then we expect them to have long win streaks. Using this knowledge we can be fairly sure (I haven't done the calculations though) that the following is a better way to compress the wins and losses than simply writing a 1 for a win and a 0 for a loss:

Write 00 if team B wins.
Write 01 to represent 1 win by team A.
Write 10 to represent a block of 10 wins by team A.
Write 11 to represent a block of 100 wins by team A.

Now it only costs us 12 bits to represent 123 wins in a row by team A, but the tradeoff is that it costs 246 bits to represent 123 wins by team B. This tradeoff means that if we predicted wrongly and the teams are actually evenly matched, then this proposed compression algorithm does worse than just writing 1 for a win and 0 for a loss. This is due to streaks of 10 wins being rare with a 50% win chance, so we'll mostly be using two bits to write down each win or loss.

Optimal compressions do not look like what I wrote above. Luckily we don't even need to know what they are. The upshot of information theory is that the compressibility a string of 1s and 0s with a 1 frequency of X is some linear function of log(X). Another bit of information theory is the idea of compressibility being equivalent to information content. The final piece of the puzzle is that this sense of information is the same sense used for Bayesian updating, which is a measure of how much can be learnt from a piece of information. The goal of a match prediction system is to know so much that it doesn't learn new anything from the outcomes of games, or other words, how surprised the system is by game outcomes. Log scoring achieves this by scoring based on the upper bound on how much a prediction system can learn from the actual outcome of a game, in a formal sense.

quote:
score = 0
# Predicted a win for the correct team 51/100, with a probability of 52%
score += 51 * (1 + math.log2(0.52))
# Predicted a win for the wrong team 49/100, with a probability of 52%
score += 49 * (1 + math.log2(0.48))
...
score: -6.163388023594507e-05

If I've thought through this correctly, then a negative score means that the predictions of 52% were worse than 50/50 guessing. This is because the system that just guesses always receives a score of 0. The predictions of 52% were overconfident. If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.

quote:
>>> 65 * (1 + math.log2(0.51)) + 35 * (1 + math.log2(0.49))
0.8368727947070336
>>> 65 * (1 + math.log2(0.79)) + 35 * (1 + math.log2(0.21))
-0.9087605487041657
Aren't these equally good predictions? They are both wrong 14/100 times.

Predictors are not right or wrong about outcomes, they are right or wrong about the probabilities of outcomes. Predicting a 90% chance of victory and seeing defeat in isolation isn't "getting it wrong". Such a prediction expects defeat 10% of the time. It doesn't make sense to then take these individual events and aggregate the number of times the predictor "got it wrong" without taking into account the probabilities assigned. Both of the predictors in this example are wrong in the sense that one assigned a probability of 79% and the other 51% when the underlying probability is 65%. They aren't wrong in the sense of each having predicted "in the wrong direction" 14 times because this isn't an important metric.

The remainder of the discrepancy is explained by expected surprise not being linear with the 0 to 1 probability scale. The difference between predictions of 80% and 90% is greater than the difference between predictions of 45% and 55%, even though they both differ by 10 percentage points. Percentage point difference simply isn't useful. The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.

+3 / -0

Brackman

3 years ago
(edited 3 years ago)

quote:
also I'm not sure if I'm supposed to get the mean either - but I feel like it should be the mean

Yes, it's best to use the mean. It's also not wrong to use the sum instead or to use ln instead of 1+log_2. Any affine transformation with non-zero slope works. But only if you use 1+log_2, you have the nice property that always guessing 50% yields score 0. And only if you use the mean, guessing always 100% right yields score 1.

Here's another way to see that guessing probabilities far away from 50% must be punished harder (using information theory, like

GoogleFrog): Unexpected events contain a much higher amount of information. For example, if you win against @Godde, that's more interesting than if you win against an equal player. If you choose a complicated password, it becomes exponentially more unexpected.

probability = exp(- information)
information = - log(probability)

The log scoring rule is a direct consequence from the entropy of information. If something very unexpected happens, this means that a lot of information in your prediction system might be wrong. If something 100% unexpected happens, this means that all information in your prediction system is wrong. This is the logical principle ex falso quodlibet: If you believe only one wrong statement, you can use that to prove logically that every statement is true and false at the same time. Hence this yields score minus infinity. So choose your religion carefully.

quote:
I'll do a new post about my experiments in the future. Don't worry, I'm not going to argue it's better than WHR - I'm quite sure it isn't - just looking if I can see any patterns in the data :)

Feel free to try. It would be a great achievement to find a better system than WHR. Like

Aquanim, I'd rather expect small improvements from details about the WHR implementation. Finding a better system that is very different from the established systems like Glicko or WHR is very unlikely imo - but I'm not saying 0% because this might give me a very bad score ;).

+3 / -0

fiendicus_prime

3 years ago
(edited 3 years ago)

Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively.

quote:

WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players.

Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :)

quote:

If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.

Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a random match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random.

quote:

The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.

Agreed but I think this causes you to favor worse algorithms (i.e. that would predict the correct outcome less often) because they hedge their bets a bit. It's possible that when balancing teams this bias doesn't matter, since you're talking about different combinations for the same algorithm, and we just want to maximise for p 0.5. But when comparing algorithms it surely does matter, because an algorithm that is ostensibly better at maximising p 0.5 might score lower if it tends to estimate win chance too high compared to too low.

quote:

If something very unexpected happens, this means that a lot of information in your prediction system might be wrong.

I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric.

What about something like:

```
predictions.append(p if winner == team1_id else 1 - p)
...
success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions)
score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])
```

With some real data...

score: 1 + log2(pwinner)
adj score: as above

Fixed p of 0.5402115158636898 (actual win rate for team 1)
---------------------------------------------------------
success rate: 0.5402115158636898
score: 0.004670620126253189
adj score: 0.037237666464714145

Trueskill
---------
success rate: 0.5811515863689777
score: -0.019440827083669232
adj score: 0.038751475717060516

So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better.

+0 / -0

dunno

3 years ago

Another way to see the great importance of the magnitude of the prediction (rather than just being on the right side of 50%) is through repeated betting.

Suppose the real win rate of team 1 is 75%. You're lucky and someone stupid offers you bets with even odds. Using the Kelly criterion to decide how much to bet (to maximize the expected geometric growth rate of your capital, and almost surely beat any other strategy in the long term), you bet a ratio of 2p-1 of your current capital c each round, where p is your prediction of the win rate.

You bet for 100 rounds, of which team 1 wins 75.

On each win, you win (2p-1) c, so your new capital is 2 p c (wealth multiplier: 2p).

On each loss, you lose (2p-1) c, so your new capital is c - (2p-1) c = 2 (1-p) c (wealth multiplier: 2 (1-p)).

Assuming you start with $1, at the end, you have:

$ (2p)^75 * (2(1-p))^25
= $ 2^100 * p^75 * (1-p)^25
= $ 2^(100 + 75 * log2(p) + 25 * log2(1-p))

Depending on your prediction p of the win rate, your ending money is:

p=51% --> $ 2.66
p=60% --> $ 3,279.76
p=70% --> $ 259,050.20
p=75% --> $ 479,837.88
p=80% --> $ 229,349.86
p=90% --> $ 46.90
p=99% --> $ 0.00 (more precisely, 5.9 * 10^-21)

(Incidentally, this is a good example for why the quantity that you want to maximize in repeated bets is the expected geometric growth rate, or equivalently the expectation of the logarithm of the wealth multiplier, not the expected profit. The p=99% bet (bet 2p-1 = 98% of you money each round) actually has the highest expected gain each round! Betting your entire money each round would have even higher expected gain, but you'd go broke on a single win of team 2.)

(Sorry for any calculation errors, but I think the general principle holds)

+2 / -0

Brackman

3 years ago
(edited 3 years ago)

The relation to the Kelly criterion is very interesting!

quote:
(adj) score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])

Measuring the deviation between success rate and predicted probability doesn't make sense because you are not looking at a series of equal games. The (actual and predicted) probabilities can vary between different games and they are not very directly related to the total succes rate. A system that gets this variation correctly is punished by this term in adj score. For example, Trueskill only has a higher adj score than constant p=0.5402115158636898 because Trueskill has a higher success rate. The last adj score term actually punishes Trueskill arbitrarily. About the success rate, I want to quote

GoogleFrog:

quote:
It doesn't make sense to then take these individual events and aggregate the number of times the predictor "got it wrong" without taking into account the probabilities assigned.

All you really have is a vector of predicted probabilties and a vector of actual results and somehow you have to calculate their deviation. If you don't believe in the logarithmic relation between information and probability, there are other valid scoring rules. But I think it is worth to think about the presented arguments for nonlinearity.

+0 / -0

fiendicus_prime

3 years ago
(edited 3 years ago)

Thanks everyone for taking time to explain this to me!

dunno I appreciate that betting example, I can follow that :)

quote:

Brackman The last adj score term actually punishes Trueskill arbitrarily.

Yes, you're right. I'm not close to understanding anything about information theory, but I've heard enough to see that my arguments are false and I'm properly out of my depth :D

One additional mistake I made in comparing to a fixed probability was not to randomize the team order. This appears to be necessary because of the team 1 win bias - anyone know why this is? Does/did the balancer always assign the higher probability to team 1? Anyway, randomizing that stops a fixed probability from getting such good results.

What I hadn't really understood is that Trueskill works with individual ratings. In fact the function I took (from https://github.com/sublee/trueskill/issues/1) isn't even in the library proper. It stands to reason that this knows nothing about how individual ratings combine to get a team win probability in Zero-K, so this is clearly a place for adjustment (as noted by

Aquanim). Halfing the difference from p=0.5 seems to yield a substantial increase in the score, so I guess this is closer to the true team win probability.

+0 / -0

Brackman

3 years ago

quote:
Halfing the difference from p=0.5 seems to yield a substantial increase in the score, so I guess this is closer to the true team win probability.

It depends on which formulae you actually used. The Trueskill formulae from your link use rating sums of teams instead of averages which inserts a factor N where N is the number of players. Then they divide it by the combined standard deviations which multiplies it by 1/sqrt(N). Effectively, this multiplies rating differences by sqrt(N). This is equivalent to a D mod of D = sqrt(N). This moves predictions further away from 50% for higher player numbers. In principle, it is fine to do this. But it can also make the predictions too distinct which may be the reason for this. If it is, you can compensate the effect by applying an inverse D mod of 1/sqrt(N).

+0 / -0

fiendicus_prime

3 years ago
(edited 3 years ago)

Finally rewrote things so I can quickly try different fudges to the win probability function from here: https://trueskill.org/#win-probability.

Anyway, I figured I'd concentrate on the way that the team skill delta was calculated (delta_mu), as that felt the most tangible to me and also logically something that will vary between different kinds of games (chess, football, zero-k etc). After trying various fudges, I found that a simple mean of the individual player scores got a far better score than the sum. Even a small factor (0.9 or 1.1x) away from the mean reduced the score.

Example, 2v2-4v4 games, ranking from ~~all games~~ (edit: oops, I meant 2v2-4v4 games):

win_probability function	score
base	-0.0181
my fudge (`(p + 0.5)/2`)	0.0297
Brackman's suggestion from https://zero-k.info/Forum/Post/250763#250763	score: -0.0862 (1)
delta mu from mean	0.0315
0.9x delta mu from mean	0.0312
1.1x delta mu from mean	0.0313

(1) should be *4? that gives 0.0222

The delta mu from mean peak is so tightly centered around the exact mean that it can't be a coincidence. It seems likely to be a property either of the win_probability function or the scoring function? This seems to work very well for 1v1 games too, increasing the score from about 0.1 to 0.14. I haven't checked anything else yet.

None of these changes affect the success rate of the function.

For clarity my delta mu from mean function below:

```
delta_mu = mean(r.mu for r in team1) - mean(r.mu for r in team2)
sum_sigma = sum(r.sigma 2 for r in itertools.chain(team1, team2))
size = len(team1) + len(team2)
denom = math.sqrt(size * (ts.beta 2) + sum_sigma)
return ts.cdf(delta_mu / denom)
```

+2 / -0

Brackman

3 years ago
(edited 3 years ago)

Very good!

quote:
The delta mu from mean peak is so tightly centered around the exact mean that it can't be a coincidence.

I know what delta_mu is but what is the mean peak and what is the exact mean?

quote:
None of these changes affect the success rate of the function.

This is obvious from the underlying math. Multiplying delta_mu by a factor D is equivalent to dividing denom by D which has the same effect as applying a D mod of D. [Spoiler] All this does is bring the predictions further away from 50% for D > 1 and closer to 50% for D < 1. It is just the correct way of doing it in contrast to the 0.5 fudge which can never produce chances <= 25% or >= 75%. By fitting the D mod to maximize the log score, it should be possible to eliminate data poisoning.

Here I define "base" as using D = 1. [Spoiler]Calculating delta_mu from mean instead of sum uses D == 2/size which is indeed expected to perform better. [Spoiler]My suggestion uses D = 2/sqrt(size) which is a compromise of the two for size >= 4. [Spoiler]How can the compromise be worse than each of the extremes? Did you apply the 0.5 fudge on it but not on the others?

quote:
Example, 2v2-4v4 games, ranking from all games:

From the number 0.0297, I guess that this was only with ranking data from the class of 2v2-4v4 battles. If it was from all games, you would get 0.0408, right?

+0 / -0

fiendicus_prime

3 years ago

quote:
I know what delta_mu is but what is the mean peak and what is the exact mean?

I meant the mean, 0.9x mean, 1.1x mean results (I tested more disparate values too). I was surprised that the peak score was dead on the mean.

Doesn't it surprise you that's there's no absolute effect?

quote:
How can the compromise be worse than each of the extremes? Did you apply the 0.5 fudge on it but not on the others?

Thanks for all the explanation about D mod. I see why it's superior to the average with 0.5. I will try it, I think what you described in the linked post wasn't D mod though, it was just trying to encode something like my fudge into the win_probability function?

That said, I feel like performing a function on the team mu values is important, since it's more likely to be able to e.g. consider team composition e.g. uneven teams (which currently are excluded from my analysis).

quote:
From the number 0.0297, I guess that this was only with ranking data from the class of 2v2-4v4 battles. If it was from all games, you would get 0.0408, right?

Ooops, you are correct!

+0 / -0

Brackman

3 years ago

quote:
I think what you described in the linked post wasn't D mod though, it was just trying to encode something like my fudge into the win_probability function?

If you mean this post, it describes a mechanism how considering a different set of games than what was originally considered by the balancer could potentially introduce a "poisoning" due to imperfect D mod and how it can be eliminated by improving the D mod.

You already tried different D mods because it's effectively the same as multiplying delta_mu by D. So that's fine. There's no need to apply D mods additionally to a delta_mu multiplication. That D = 1 (rather than 0.9 or 1.1) maximizes the score can be used as an argument that potential poisoning is eliminated and all doubts against using different subsets of the data are futile.

I also knew from your code that you exclude uneven teams which is fine. Otherwise I would have complained about using mu sums instead of means and I would expect an even worse score for an uneven teams calculation with sums instead of means.

Only that the compromise is worse than the two extremes doesn't make sense to me yet. There might be something strange in the code.

+0 / -0

iwantusername

3 years ago

Just for everyone else regarding Whole History Rating (WHR) & Zero-K by

DeinFreund:

PDF slides: https://drive.google.com/file/d/1sDqfKhThaQqDbFQ5Xb6rit-ZOZo303TI/view

+1 / -0

Page of 2 (40 records)

Forum index > General discussion >