Loading...
  OR  Zero-K Name:    Password:   

Do 1v1 games improve multiplayer balance

132 posts, 2361 views
Post comment
Filter:    Player:  
Page of 7 (132 records)
sort

2 years ago
Regarding discussion on http://zero-k.info/Forum/Thread/35303 I played around with some real Zero-K data I got from the back of a bus.

Notes:

- Used the scoring function from http://zero-k.info/Forum/Post/163322#163322 `mean([1 + log2(pwinner)])`
- I'm not sure how to select "real" ranking games from the tables that I have accessible (battles, battle_player). I've done some obvious things to filter games that seemed inappropriate.
- Because WHR is a bit scary, I used python trueskill to generate rankings.
- I've only considered games with equal numbers of players in each team.

Code:

https://colab.research.google.com/drive/1KziZMjYJUR-oeK2ZL9RdZclT8I3-KfrB?usp=sharing

Results:

Comparing the prediction of particular classes of games when using all available battles for generating the rankings, versus only the battles of that class.

1v1 battles:

Ranking dataRanking battlesRated battlesCorrect %Score
All1464469589968.2%0.1060
Class1146609589967.8%0.1024

2v2-4v4 battles:

Ranking dataRanking battlesRated battlesCorrect %Score
All1464462127559.8%0.0408
Class216372127558.1%0.0297

5v5+ battles:

Ranking dataRanking battlesRated battlesCorrect %Score
All1464461014952.7%-0.0046
Class101491014952.8%-0.0072

Conclusion:

For 1v1 battles there's a little advantage to using the whole dataset.
For 2v2-4v4 battles there's some advantage to using the whole dataset.
For 5v5+ battles there doesn't seem to be a clear winner either way; there's no evidence that it's worse to use the whole dataset.

P.S.:

I've heard this has all been done before with similar conclusions when investigating WHR :)
+2 / -0
That's interesting and a good confirmation that combined ratings are more accurate! Due to this effect, I'm wondering if you could improve the score for big teams by replacing
denom = math.sqrt(size * (ts.beta * ts.beta) + sum_sigma)
by
denom = math.sqrt(size/4 * (size * (ts.beta * ts.beta) + sum_sigma))
+0 / -0
2 years ago
Did you use only the information that was known before the evaluated battle?
+1 / -0

2 years ago
quote:
CZrankSilent_AI Did you use only the information that was known before the evaluated battle?


Yes, that's the intent anyway.
You'd expect e.g. 1v1 to predict with high accuracy since matchmaker doesn't attempt to balance very close to 50:50, for obvious reasons. I'd be interested what the stats are for WHR. Maybe a project to try out later :)
+0 / -0
2 years ago
quote:
Yes, that's the intent anyway.
Sorry, I do not understand. Do the above results "use only the information that was known before the evaluated battle" or not? (and information includes player rating)

Also, any reason to separate on that battle size? (not sure if really applies, but reminds me of: Simpson's paradox)
+0 / -0
quote:
FRrankmalric Sorry, I do not understand. Do the above results "use only the information that was known before the evaluated battle" or not? (and information includes player rating)


You can check the code. I sorted by battle_id and I determined the prediction of the battle before adjusting the ratings, so I think so, but I might have made a mistake (e.g. lost the sort order at some point).

quote:
FRrankmalric Also, any reason to separate on that battle size?


There have been several requests for ladders or balance split by game size. I suppose the theory is that some players are much better at 1v1 then they are at teams, or vice versa. What we're seeing is that - if this is true - the effect isn't strong enough to give a better prediction overall.
+1 / -0
2 years ago
quote:
Did you use only the information that was known before the evaluated battle?
It does seem so to me from the provided code. I cannot say it with absolute certainty though because I don't know the source code of the imported TrueSkill package. I know a way to hack this package to include knowledge about battle outcomes but I'm not accusing GBrankfiendicus_prime of this. It's probably fine.

I think that there is no Simpson's paradox here. The claim holds for each of the three classes. Since the score on all data would just be a weighted average of the classes, the claim would still hold for all data.

I only hope that you
quote:
Used the scoring function from http://zero-k.info/Forum/Post/163322#163322 `mean([1 + log2(pwinner)])`
instead of averaging the probability with 0.5 and that the line
quote:
score = 1 + mean([math.log2(0.5 - (0.5 - p) * 0.5) for p in predictions])
is just an alternative version for testing purposes. But even if not, that would be an easy fix.
+0 / -0
2 years ago
Where is this imaginary dataset of team games that were balanced without 1v1 games contaminating the team compositions?
+0 / -0

2 years ago
quote:
unknownrankTinySpider Where is this imaginary dataset of team games that were balanced without 1v1 games contaminating the team compositions?


1. Not all team games are autobalanced (I don't know how to tell which... possibly there are more tables that I don't have?)
2. It seems likely that if team games could be balanced better without the 1v1 data then the 1v1 data would make the prediction worse, but it doesn't (and makes it better for small teams).

I don't doubt that there are some players who buck this rule, but we'd need a deterministic way to identify them. Possibly you could interpolate between their overall ranking vs the class-specific one, weighted by the sigma (uncertainty).
+0 / -0
2 years ago
quote:
2. It seems likely that if team games could be balanced better without the 1v1 data then the 1v1 data would make the prediction worse, but it doesn't (and makes it better for small teams).

Can you walk me through this thought process? Start with how you got team games data without 1v1 data.
+0 / -0

2 years ago
DErankBrackman Yes I was just fudging the probabilities at the end to maximise the scoring function, because otherwise all the scoring using that particular prediction function is negative. I'll try out your suggestion incorporating team size later and see if it improves things.
+1 / -0
quote:
unknownrankTinySpider Can you walk me through this thought process? Start with how you got team games data without 1v1 data.


IRL you don't always have randomized controlled trials, so you try to make statements about the data that you have like: if 1v1 data is a poor predictor of multiplayer skill, then the ability to predict the outcome of a battle ought to be better if you exclude that data.

How would you prove that it's better not to include 1v1 data? Everything I've seen suggests that's not the case. You could employ 1000 people to play teams / 1v1 Zero-K for a week, using random teams, and perform a similar analysis. This would be a very valuable dataset for analysis.
+0 / -0
quote:
How would you prove that it's better not to include 1v1 data?

That's not how burden of proof works. You cannot make a claim about any contribution of 1v1 data when you do not have any data that doesn't include 1v1 data.
+0 / -0

2 years ago
quote:
unknownrankTinySpider You cannot make a claim about any contribution of 1v1 data when you do not have any data that doesn't include 1v1 data.


But, as stated earlier, I do; since not all games are autobalanced. I agree, it would be nice to know which ones these are, then we could test your prediction on that dataset _alone_.

Instead of trying to assert that you are right repeatedly perhaps it would be more productive to try and show evidence that supports your theory. This, indeed, is why I did the analysis in the first place. I would have loved to have shown evidence to support the idea that a separate multiplayer ranking would produce better balance, but I have failed.
+0 / -0
2 years ago
You can simply exclude 1v1s by looking at the number of players as GBrankfiendicus_prime did. That some of those 1v1s influenced the team balance at the time doesn't matter because
quote:
It is fine to test a rating system on data that has not been balanced with it because a rating system can also evaluate unbalanced games.
+1 / -0
2 years ago
quote:
You can simply exclude 1v1s by looking at the number of players

You cannot simply do that. Any team game on record has been balanced using 1v1 games and cannot be separated from 1v1 games, team games that were not balanced at all are the only relevant available sample.
+0 / -0
quote:
You cannot simply do that. Any team game on record has been balanced using 1v1 games and cannot be separated from 1v1 games, team games that were not balanced at all are the only relevant available sample.


It seems like you think that at least one of these is true:

- the dataset includes the ranks used to balance the historical games and they are being used in this evaluation
- the two (actually more) disparate algorithms used to balance and predict games back when they were played actually somehow influence the new algorithm despite numbers being not used

If it's the former, you would be simply wrong. If it's the latter, then you have to demonstrate how you think this influence would happen. It sounds frankly pretty much an extraordinary claim, but i'm sure you have extraordinary mathematical proof to back it up, if so.

If it's none of the above, then please clarify.
+2 / -1
2 years ago
EErankAdminAnarchid Ignoring the elaborate strawman you constructed, here's an allegory of my own:

There's a dataset of crop yields for a 10 year period. These crop yields were achieved by using a specific amount of fertilizer based on past predictions of how much it impacted crop yields, in order to meet a production goal. Nobody kept track of carrots, potatoes or wheat in relation to fertilizer usage (mistake).

In an attempt to correct this and more accurately predict required fertilizer usage, a brilliant scientist decided to use this old dataset to determine that separate tracking of carrots, potatoes and wheat does not produce more accurate results.

Can you spot the mistake this scientist did?
+1 / -0
2 years ago
quote:
There's a dataset of crop yields for a 10 year period. These crop yields were achieved by using a specific amount of fertilizer based on past predictions of how much it impacted crop yields, in order to meet a production goal. Nobody kept track of carrots, potatoes or wheat in relation to fertilizer usage (mistake).

In an attempt to correct this and more accurately predict required fertilizer usage, a brilliant scientist decided to use this old dataset to determine that separate tracking of carrots, potatoes and wheat does not produce more accurate results.

Can you spot the mistake this scientist did?

moving swiftly on
+2 / -2
2 years ago
GBrankJadem here have some upvotes to validate your existence.
+1 / -1
Page of 7 (132 records)