Regarding discussion on http://zero-k.info/Forum/Thread/35303 I played around with some real Zero-K data I got from the back of a bus. Notes: - Used the scoring function from http://zero-k.info/Forum/Post/163322#163322 `mean([1 + log2(pwinner)])` - I'm not sure how to select "real" ranking games from the tables that I have accessible (battles, battle_player). I've done some obvious things to filter games that seemed inappropriate. - Because WHR is a bit scary, I used python trueskill to generate rankings. - I've only considered games with equal numbers of players in each team. Code: https://colab.research.google.com/drive/1KziZMjYJUR-oeK2ZL9RdZclT8I3-KfrB?usp=sharingResults: Comparing the prediction of particular classes of games when using all available battles for generating the rankings, versus only the battles of that class. 1v1 battles: Ranking data | Ranking battles | Rated battles | Correct % | Score | All | 146446 | 95899 | 68.2% | 0.1060 | Class | 114660 | 95899 | 67.8% | 0.1024 |
2v2-4v4 battles: Ranking data | Ranking battles | Rated battles | Correct % | Score | All | 146446 | 21275 | 59.8% | 0.0408 | Class | 21637 | 21275 | 58.1% | 0.0297 |
5v5+ battles: Ranking data | Ranking battles | Rated battles | Correct % | Score | All | 146446 | 10149 | 52.7% | -0.0046 | Class | 10149 | 10149 | 52.8% | -0.0072 |
Conclusion: For 1v1 battles there's a little advantage to using the whole dataset. For 2v2-4v4 battles there's some advantage to using the whole dataset. For 5v5+ battles there doesn't seem to be a clear winner either way; there's no evidence that it's worse to use the whole dataset. P.S.: I've heard this has all been done before with similar conclusions when investigating WHR :)
+2 / -0
|
That's interesting and a good confirmation that combined ratings are more accurate! Due to this effect, I'm wondering if you could improve the score for big teams by replacing denom = math.sqrt(size * (ts.beta * ts.beta) + sum_sigma) by denom = math.sqrt(size/4 * (size * (ts.beta * ts.beta) + sum_sigma))
+0 / -0
|
Did you use only the information that was known before the evaluated battle?
+1 / -0
|
quote: Silent_AI Did you use only the information that was known before the evaluated battle? |
Yes, that's the intent anyway. You'd expect e.g. 1v1 to predict with high accuracy since matchmaker doesn't attempt to balance very close to 50:50, for obvious reasons. I'd be interested what the stats are for WHR. Maybe a project to try out later :)
+0 / -0
|
quote: Yes, that's the intent anyway. |
Sorry, I do not understand. Do the above results "use only the information that was known before the evaluated battle" or not? (and information includes player rating) Also, any reason to separate on that battle size? (not sure if really applies, but reminds me of: Simpson's paradox)
+0 / -0
|
|
quote: Did you use only the information that was known before the evaluated battle? |
It does seem so to me from the provided code. I cannot say it with absolute certainty though because I don't know the source code of the imported TrueSkill package. I know a way to hack this package to include knowledge about battle outcomes but I'm not accusing fiendicus_prime of this. It's probably fine. I think that there is no Simpson's paradox here. The claim holds for each of the three classes. Since the score on all data would just be a weighted average of the classes, the claim would still hold for all data. I only hope that you instead of averaging the probability with 0.5 and that the line quote: score = 1 + mean([math.log2(0.5 - (0.5 - p) * 0.5) for p in predictions]) |
is just an alternative version for testing purposes. But even if not, that would be an easy fix.
+0 / -0
|
Where is this imaginary dataset of team games that were balanced without 1v1 games contaminating the team compositions?
+0 / -0
|
quote: TinySpider Where is this imaginary dataset of team games that were balanced without 1v1 games contaminating the team compositions? |
1. Not all team games are autobalanced (I don't know how to tell which... possibly there are more tables that I don't have?) 2. It seems likely that if team games could be balanced better without the 1v1 data then the 1v1 data would make the prediction worse, but it doesn't (and makes it better for small teams). I don't doubt that there are some players who buck this rule, but we'd need a deterministic way to identify them. Possibly you could interpolate between their overall ranking vs the class-specific one, weighted by the sigma (uncertainty).
+0 / -0
|
quote: 2. It seems likely that if team games could be balanced better without the 1v1 data then the 1v1 data would make the prediction worse, but it doesn't (and makes it better for small teams). |
Can you walk me through this thought process? Start with how you got team games data without 1v1 data.
+0 / -0
|
|
quote: TinySpider Can you walk me through this thought process? Start with how you got team games data without 1v1 data. |
IRL you don't always have randomized controlled trials, so you try to make statements about the data that you have like: if 1v1 data is a poor predictor of multiplayer skill, then the ability to predict the outcome of a battle ought to be better if you exclude that data. How would you prove that it's better not to include 1v1 data? Everything I've seen suggests that's not the case. You could employ 1000 people to play teams / 1v1 Zero-K for a week, using random teams, and perform a similar analysis. This would be a very valuable dataset for analysis.
+0 / -0
|
quote: How would you prove that it's better not to include 1v1 data? |
That's not how burden of proof works. You cannot make a claim about any contribution of 1v1 data when you do not have any data that doesn't include 1v1 data.
+0 / -0
|
quote: TinySpider You cannot make a claim about any contribution of 1v1 data when you do not have any data that doesn't include 1v1 data. |
But, as stated earlier, I do; since not all games are autobalanced. I agree, it would be nice to know which ones these are, then we could test your prediction on that dataset _alone_. Instead of trying to assert that you are right repeatedly perhaps it would be more productive to try and show evidence that supports your theory. This, indeed, is why I did the analysis in the first place. I would have loved to have shown evidence to support the idea that a separate multiplayer ranking would produce better balance, but I have failed.
+0 / -0
|
You can simply exclude 1v1s by looking at the number of players as fiendicus_prime did. That some of those 1v1s influenced the team balance at the time doesn't matter because
+1 / -0
|
quote: You can simply exclude 1v1s by looking at the number of players |
You cannot simply do that. Any team game on record has been balanced using 1v1 games and cannot be separated from 1v1 games, team games that were not balanced at all are the only relevant available sample.
+0 / -0
|
quote: You cannot simply do that. Any team game on record has been balanced using 1v1 games and cannot be separated from 1v1 games, team games that were not balanced at all are the only relevant available sample. |
It seems like you think that at least one of these is true: - the dataset includes the ranks used to balance the historical games and they are being used in this evaluation - the two (actually more) disparate algorithms used to balance and predict games back when they were played actually somehow influence the new algorithm despite numbers being not used If it's the former, you would be simply wrong. If it's the latter, then you have to demonstrate how you think this influence would happen. It sounds frankly pretty much an extraordinary claim, but i'm sure you have extraordinary mathematical proof to back it up, if so. If it's none of the above, then please clarify.
+2 / -1
|
Anarchid Ignoring the elaborate strawman you constructed, here's an allegory of my own: There's a dataset of crop yields for a 10 year period. These crop yields were achieved by using a specific amount of fertilizer based on past predictions of how much it impacted crop yields, in order to meet a production goal. Nobody kept track of carrots, potatoes or wheat in relation to fertilizer usage (mistake). In an attempt to correct this and more accurately predict required fertilizer usage, a brilliant scientist decided to use this old dataset to determine that separate tracking of carrots, potatoes and wheat does not produce more accurate results. Can you spot the mistake this scientist did?
+1 / -0
|
quote: There's a dataset of crop yields for a 10 year period. These crop yields were achieved by using a specific amount of fertilizer based on past predictions of how much it impacted crop yields, in order to meet a production goal. Nobody kept track of carrots, potatoes or wheat in relation to fertilizer usage (mistake).
In an attempt to correct this and more accurately predict required fertilizer usage, a brilliant scientist decided to use this old dataset to determine that separate tracking of carrots, potatoes and wheat does not produce more accurate results.
Can you spot the mistake this scientist did? |
moving swiftly on
+2 / -2
|
Jadem here have some upvotes to validate your existence.
+1 / -1
|