Loading...
  OR  Zero-K Name:    Password:   

Post edit history

Evaluating rating systems

To display differences between versions, select one or more edits in the list using checkboxes and click "diff selected"
Post edit history
Date Editor Before After
7/17/2022 2:12:53 PMGBrankfiendicus_prime before revert after revert
7/17/2022 2:12:07 PMGBrankfiendicus_prime before revert after revert
Before After
1 Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively. 1 Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively.
2 \n 2 \n
3 [quote] 3 [quote]
4 WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players. 4 WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players.
5 [/quote] 5 [/quote]
6 \n 6 \n
7 Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :) 7 Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :)
8 \n 8 \n
9 [quote] 9 [quote]
10 If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on. 10 If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.
11 [/quote] 11 [/quote]
12 \n 12 \n
13 Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a _random_ match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random. 13 Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a _random_ match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random.
14 \n 14 \n
15 [quote] 15 [quote]
16 The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale. 16 The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.
17 [/quote] 17 [/quote]
18 \n 18 \n
19 Agreed but I think this causes you to favor worse algorithms ( i. e. that would predict the correct outcome less often) because they hedge their bets a bit. It's possible that when balancing teams this bias doesn't matter, since you're talking about different combinations for the same algorithm, and we just want to maximise for p = 0. 5. But when comparing _algorithms_ it surely does matter, because an algorithm that is ostensibly better at maximising p = 0. 5 might score lower if it tends to estimate win chance too high compared to too low. 19 Agreed but I think this causes you to favor worse algorithms ( i. e. that would predict the correct outcome less often) because they hedge their bets a bit. It's possible that when balancing teams this bias doesn't matter, since you're talking about different combinations for the same algorithm, and we just want to maximise for p 0. 5. But when comparing _algorithms_ it surely does matter, because an algorithm that is ostensibly better at maximising p 0. 5 might score lower if it tends to estimate win chance too high compared to too low.
20 \n 20 \n
21 [quote] 21 [quote]
22 If something very unexpected happens, this means that a lot of information in your prediction system might be wrong. 22 If something very unexpected happens, this means that a lot of information in your prediction system might be wrong.
23 [/quote] 23 [/quote]
24 \n 24 \n
25 I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric. 25 I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric.
26 \n 26 \n
27 What about something like: 27 What about something like:
28 \n 28 \n
29 ``` 29 ```
30 predictions.append(p if winner == team1_id else 1 - p) 30 predictions.append(p if winner == team1_id else 1 - p)
31 ... 31 ...
32 success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions) 32 success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions)
33 score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions]) 33 score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])
34 ``` 34 ```
35 \n 35 \n
36 With some real data... 36 With some real data...
37 \n 37 \n
38 score: 1 + log2(pwinner) 38 score: 1 + log2(pwinner)
39 adj score: as above 39 adj score: as above
40 \n 40 \n
41 Fixed p of 0.5402115158636898 (actual win rate for team 1) 41 Fixed p of 0.5402115158636898 (actual win rate for team 1)
42 --------------------------------------------------------- 42 ---------------------------------------------------------
43 success rate: 0.5402115158636898 43 success rate: 0.5402115158636898
44 score: 0.004670620126253189 44 score: 0.004670620126253189
45 adj score: 0.037237666464714145 45 adj score: 0.037237666464714145
46 \n 46 \n
47 Trueskill 47 Trueskill
48 --------- 48 ---------
49 success rate: 0.5811515863689777 49 success rate: 0.5811515863689777
50 score: -0.019440827083669232 50 score: -0.019440827083669232
51 adj score: 0.038751475717060516 51 adj score: 0.038751475717060516
52 \n 52 \n
53 So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better. 53 So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better.