Loading...
  OR  Zero-K Name:    Password:   

Post edit history

Evaluating rating systems

To display differences between versions, select one or more edits in the list using checkboxes and click "diff selected"
Post edit history
Date Editor Before After
7/17/2022 1:31:21 AMAUrankAdminGoogleFrog before revert after revert
Before After
1 The log scoring system is related to information theory. There are some links to follow in this paragraph. https://en.wikipedia.org/wiki/Scoring_rule#Logarithmic_score 1 The log scoring system is related to information theory. There are some links to follow in this paragraph. https://en.wikipedia.org/wiki/Scoring_rule#Logarithmic_score
2 \n 2 \n
3 One way to look at it is that the assigned probability is a statement about the compressibility of a series of games. Consider a long string of wins and losses generated by the two teams playing each other a large number of times. The predicted win value then corresponds to a claim about how to compress this string. 3 One way to look at it is that the assigned probability is a statement about the compressibility of a series of games. Consider a long string of wins and losses generated by the two teams playing each other a large number of times. The predicted win value then corresponds to a claim about how to compress this string.
4 \n 4 \n
5 A predicted win chance of 50% amounts to saying that the best way to compress the string is just to write out a series of 1s and 0s, to represent the outcome of each game with a single bit. In other words, that the string is incompressible. Any prediction other than 50% is a claim about some knowledge that can be used compress the string more than this. The knowledge being encoded is that of how much stronger one team is than the other team. For example, (and this example is human-readable to drive intuition, not optimal) if team A has a 99% win chance then we expect them to have long win streaks. Using this knowledge we can be fairly sure (I haven't done the calculations though) that the following is a better way to compress the wins and losses than simply writing a 1 for a win and a 0 for a loss: 5 A predicted win chance of 50% amounts to saying that the best way to compress the string is just to write out a series of 1s and 0s, to represent the outcome of each game with a single bit. In other words, that the string is incompressible. Any prediction other than 50% is a claim about some knowledge that can be used compress the string more than this. The knowledge being encoded is that of how much stronger one team is than the other team. For example, (and this example is human-readable to drive intuition, not optimal) if team A has a 99% win chance then we expect them to have long win streaks. Using this knowledge we can be fairly sure (I haven't done the calculations though) that the following is a better way to compress the wins and losses than simply writing a 1 for a win and a 0 for a loss:
6 * Write 00 if team B wins. 6 * Write 00 if team B wins.
7 * Write 01 to represent 1 win by team A. 7 * Write 01 to represent 1 win by team A.
8 * Write 10 to represent a block of 10 wins by team A. 8 * Write 10 to represent a block of 10 wins by team A.
9 * Write 11 to represent a block of 100 wins by team A. 9 * Write 11 to represent a block of 100 wins by team A.
10 Now it only costs us 12 bits to represent 123 wins in a row by team A, but the tradeoff is that it costs 246 bits to represent 123 wins by team B. This tradeoff means that if we predicted wrongly and the teams are actually evenly matched, then this proposed compression algorithm does [i]worse[/i] than just writing 1 for a win and 0 for a loss. This is due to streaks of 10 wins being rare with a 50% win chance, so we'll mostly be using two bits to write down each win or loss. 10 Now it only costs us 12 bits to represent 123 wins in a row by team A, but the tradeoff is that it costs 246 bits to represent 123 wins by team B. This tradeoff means that if we predicted wrongly and the teams are actually evenly matched, then this proposed compression algorithm does [i]worse[/i] than just writing 1 for a win and 0 for a loss. This is due to streaks of 10 wins being rare with a 50% win chance, so we'll mostly be using two bits to write down each win or loss.
11 \n 11 \n
12 Optimal compressions do not look like what I wrote above. Luckily we don't even need to know what they are. The upshot of information theory is that the compressibility a string of 1s and 0s with a 1 frequency of X is some linear function of log(X). Another bit of information theory is the idea of compressibility being equivalent to information content. The final piece of the puzzle is that this sense of information is the same sense used for Bayesian updating, which is a measure of how much can be learnt from a piece of information. The goal of a match prediction system is to know so much that it doesn't learn new anything from the outcomes of games, or other words, how surprised the system is by game outcomes. Log scoring achieves this by scoring based on the upper bound on how much a prediction system can learn from the actual outcome of a game, in a formal sense. 12 Optimal compressions do not look like what I wrote above. Luckily we don't even need to know what they are. The upshot of information theory is that the compressibility a string of 1s and 0s with a 1 frequency of X is some linear function of log(X). Another bit of information theory is the idea of compressibility being equivalent to information content. The final piece of the puzzle is that this sense of information is the same sense used for Bayesian updating, which is a measure of how much can be learnt from a piece of information. The goal of a match prediction system is to know so much that it doesn't learn new anything from the outcomes of games, or other words, how surprised the system is by game outcomes. Log scoring achieves this by scoring based on the upper bound on how much a prediction system can learn from the actual outcome of a game, in a formal sense.
13 \n 13 \n
14 [q]score = 0 14 [q]score = 0
15 # Predicted a win for the correct team 51/100, with a probability of 52% 15 # Predicted a win for the correct team 51/100, with a probability of 52%
16 score += 51 * (1 + math.log2(0.52)) 16 score += 51 * (1 + math.log2(0.52))
17 # Predicted a win for the wrong team 49/100, with a probability of 52% 17 # Predicted a win for the wrong team 49/100, with a probability of 52%
18 score += 49 * (1 + math.log2(0.48)) 18 score += 49 * (1 + math.log2(0.48))
19 ... 19 ...
20 score: -6.163388023594507e-05[/q] 20 score: -6.163388023594507e-05[/q]
21 If I've thought through this correctly, then a negative score means that the predictions were worse than 50/50 guessing. This is because the system that just guesses always receives a score of 0. The predictions of 52% were overconfident. If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on. 21 If I've thought through this correctly, then a negative score means that the predictions of 52% were worse than 50/50 guessing. This is because the system that just guesses always receives a score of 0. The predictions of 52% were overconfident. If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.
22 \n 22 \n
23 [q]>>> 65 * (1 + math.log2(0.51)) + 35 * (1 + math.log2(0.49)) 23 [q]>>> 65 * (1 + math.log2(0.51)) + 35 * (1 + math.log2(0.49))
24 0.8368727947070336 24 0.8368727947070336
25 >>> 65 * (1 + math.log2(0.79)) + 35 * (1 + math.log2(0.21)) 25 >>> 65 * (1 + math.log2(0.79)) + 35 * (1 + math.log2(0.21))
26 -0.9087605487041657 26 -0.9087605487041657
27 Aren't these equally good predictions? They are both wrong 14/100 times.[/q] 27 Aren't these equally good predictions? They are both wrong 14/100 times.[/q]
28 Predictors are not right or wrong about outcomes, they are right or wrong about the probabilities of outcomes. Predicting a 90% chance of victory and seeing defeat in isolation isn't "getting it wrong". Such a prediction expects defeat 10% of the time. It doesn't make sense to then take these individual events and aggregate the number of times the predictor "got it wrong" without taking into account the probabilities assigned. Both of the predictors in this example are wrong in the sense that one assigned a probability of 79% and the other 51% when the underlying probability is 65%. They aren't wrong in the sense of each having predicted "in the wrong direction" 14 times because this isn't an important metric. 28 Predictors are not right or wrong about outcomes, they are right or wrong about the probabilities of outcomes. Predicting a 90% chance of victory and seeing defeat in isolation isn't "getting it wrong". Such a prediction expects defeat 10% of the time. It doesn't make sense to then take these individual events and aggregate the number of times the predictor "got it wrong" without taking into account the probabilities assigned. Both of the predictors in this example are wrong in the sense that one assigned a probability of 79% and the other 51% when the underlying probability is 65%. They aren't wrong in the sense of each having predicted "in the wrong direction" 14 times because this isn't an important metric.
29 \n 29 \n
30 The remainder of the discrepancy is explained by expected surprise not being linear with the 0 to 1 probability scale. The difference between predictions of 80% and 90% is greater than the difference between predictions of 45% and 55%, even though they both differ by 10 percentage points. Percentage point difference simply isn't useful. The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale. 30 The remainder of the discrepancy is explained by expected surprise not being linear with the 0 to 1 probability scale. The difference between predictions of 80% and 90% is greater than the difference between predictions of 45% and 55%, even though they both differ by 10 percentage points. Percentage point difference simply isn't useful. The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.