Zero-K

Post edit history

Evaluating rating systems

To display differences between versions, select one or more edits in the list using checkboxes and click "diff selected"

Post edit history

	Date	Editor	Before	After
	7/17/2022 2:12:53 PM	fiendicus_prime	before revert	after revert
	7/17/2022 2:12:07 PM	fiendicus_prime	before revert	after revert

	Before		After
1	Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively.	1	Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively.
2	\n	2	\n
3	[quote]	3	[quote]
4	WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players.	4	WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players.
5	[/quote]	5	[/quote]
6	\n	6	\n
7	Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :)	7	Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :)
8	\n	8	\n
9	[quote]	9	[quote]
10	If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.	10	If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.
11	[/quote]	11	[/quote]
12	\n	12	\n
13	Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a _random_ match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random.	13	Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a _random_ match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random.
14	\n	14	\n
15	[quote]	15	[quote]
16	The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.	16	The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.
17	[/quote]	17	[/quote]
18	\n	18	\n
19	Agreed but I think this causes you to favor worse algorithms ( i. e. that would predict the correct outcome less often) because they hedge their bets a bit. It's possible that when balancing teams this bias doesn't matter, since you're talking about different combinations for the same algorithm, and we just want to maximise for p = 0. 5. But when comparing _algorithms_ it surely does matter, because an algorithm that is ostensibly better at maximising p = 0. 5 might score lower if it tends to estimate win chance too high compared to too low.	19	Agreed but I think this causes you to favor worse algorithms ( i. e. that would predict the correct outcome less often) because they hedge their bets a bit. It's possible that when balancing teams this bias doesn't matter, since you're talking about different combinations for the same algorithm, and we just want to maximise for p 0. 5. But when comparing _algorithms_ it surely does matter, because an algorithm that is ostensibly better at maximising p 0. 5 might score lower if it tends to estimate win chance too high compared to too low.
20	\n	20	\n
21	[quote]	21	[quote]
22	If something very unexpected happens, this means that a lot of information in your prediction system might be wrong.	22	If something very unexpected happens, this means that a lot of information in your prediction system might be wrong.
23	[/quote]	23	[/quote]
24	\n	24	\n
25	I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric.	25	I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric.
26	\n	26	\n
27	What about something like:	27	What about something like:
28	\n	28	\n
29	```	29	```
30	predictions.append(p if winner == team1_id else 1 - p)	30	predictions.append(p if winner == team1_id else 1 - p)
31	...	31	...
32	success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions)	32	success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions)
33	score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])	33	score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])
34	```	34	```
35	\n	35	\n
36	With some real data...	36	With some real data...
37	\n	37	\n
38	score: 1 + log2(pwinner)	38	score: 1 + log2(pwinner)
39	adj score: as above	39	adj score: as above
40	\n	40	\n
41	Fixed p of 0.5402115158636898 (actual win rate for team 1)	41	Fixed p of 0.5402115158636898 (actual win rate for team 1)
42	---------------------------------------------------------	42	---------------------------------------------------------
43	success rate: 0.5402115158636898	43	success rate: 0.5402115158636898
44	score: 0.004670620126253189	44	score: 0.004670620126253189
45	adj score: 0.037237666464714145	45	adj score: 0.037237666464714145
46	\n	46	\n
47	Trueskill	47	Trueskill
48	---------	48	---------
49	success rate: 0.5811515863689777	49	success rate: 0.5811515863689777
50	score: -0.019440827083669232	50	score: -0.019440827083669232
51	adj score: 0.038751475717060516	51	adj score: 0.038751475717060516
52	\n	52	\n
53	So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better.	53	So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better.