Zero-K

Post edit history

Evaluating rating systems

To display differences between versions, select one or more edits in the list using checkboxes and click "diff selected"

Post edit history

	Date	Editor	Before	After
	7/21/2022 6:43:04 AM	Brackman	before revert	after revert

	Before		After
1	Very good!	1	Very good!
2	\n	2	\n
3	[q]The delta mu from mean peak is so tightly centered around the exact mean that it can't be a coincidence.[/q]I know what delta_mu is but what is the mean peak and what is the exact mean?	3	[q]The delta mu from mean peak is so tightly centered around the exact mean that it can't be a coincidence.[/q]I know what delta_mu is but what is the mean peak and what is the exact mean?
4	\n	4	\n
5	[q]None of these changes affect the success rate of the function. [/q]This is obvious from the underlying math. Multiplying delta_mu by a factor D is equivalent to dividing denom by D which has the same effect as applying a D mod of D. D mod means modifying the probability p to ( p^D) /( p^D+( 1-p) ^D) . All this does is bring the predictions further away from 50% for D > 1 and closer to 50% for D < 1. It is just the correct way of doing it in contrast to the 0. 5 fudge which can never produce chances <= 25% or >= 75%. [url=https://zero-k. info/Forum/Post/250860#250860]By fitting the D mod to maximize the log score, it should be possible to eliminate data poisoning[/url].	5	[q]None of these changes affect the success rate of the function. [/q]This is obvious from the underlying math. Multiplying delta_mu by a factor D is equivalent to dividing denom by D which has the same effect as applying a D mod of D. [spoiler]D mod means modifying the probability p to ( p^D) /( p^D+( 1-p) ^D) . Same effect means exact equivalance for the cases of elo and WHR and at least similar for TrueSkill. But for TrueSkill I have not proven the equivalence. [/spoiler] All this does is bring the predictions further away from 50% for D > 1 and closer to 50% for D < 1. It is just the correct way of doing it in contrast to the 0. 5 fudge which can never produce chances <= 25% or >= 75%. [url=https://zero-k. info/Forum/Post/250860#250860]By fitting the D mod to maximize the log score, it should be possible to eliminate data poisoning[/url].
6	\n	6	\n
7	Here I define "base" as using D = 1. [spoiler]By calculating team rating sums instead of averages and then having a denom proportional to sqrt(size), it effectively assumes that big team game outcomes should be more distinct proportional to sqrt(size) which is probably wrong. I'm alternating between == and = to avoid forum format breaking.[/spoiler]Calculating delta_mu from mean instead of sum uses D == 2/size which is indeed expected to perform better. [spoiler]By still having a denom proportional to sqrt(size), it effectively assumes that big team games become less distinct proportional to sqrt(size).[/spoiler]My suggestion uses D = 2/sqrt(size) which is a compromise of the two for size >= 4. [spoiler]It effectively uses team means and removes the sqrt(size) proportionality in denom and thereby assumes that big team outcomes do not become more or less distinct with size which is what traditional ZK elo does.[/spoiler]How can the compromise be worse than each of the extremes? Did you apply the 0.5 fudge on it but not on the others?	7	Here I define "base" as using D = 1. [spoiler]By calculating team rating sums instead of averages and then having a denom proportional to sqrt(size), it effectively assumes that big team game outcomes should be more distinct proportional to sqrt(size) which is probably wrong. I'm alternating between == and = to avoid forum format breaking.[/spoiler]Calculating delta_mu from mean instead of sum uses D == 2/size which is indeed expected to perform better. [spoiler]By still having a denom proportional to sqrt(size), it effectively assumes that big team games become less distinct proportional to sqrt(size).[/spoiler]My suggestion uses D = 2/sqrt(size) which is a compromise of the two for size >= 4. [spoiler]It effectively uses team means and removes the sqrt(size) proportionality in denom and thereby assumes that big team outcomes do not become more or less distinct with size which is what traditional ZK elo does.[/spoiler]How can the compromise be worse than each of the extremes? Did you apply the 0.5 fudge on it but not on the others?
8	\n	8	\n
9	[q]Example, 2v2-4v4 games, ranking from all games:[/q]From the number 0.0297, I guess that this was only with ranking data from the class of 2v2-4v4 battles. If it was from all games, you would get 0.0408, right?	9	[q]Example, 2v2-4v4 games, ranking from all games:[/q]From the number 0.0297, I guess that this was only with ranking data from the class of 2v2-4v4 battles. If it was from all games, you would get 0.0408, right?

Zero-K is a free real time strategy (RTS), that aims to be the best open source multi-platform strategy game available :-)