Do 1v1 games improve multiplayer balance

Page of 7 (132 records)

sort

fiendicus_prime

3 years ago

Regarding discussion on http://zero-k.info/Forum/Thread/35303 I played around with some real Zero-K data I got from the back of a bus.

Notes:

- Used the scoring function from http://zero-k.info/Forum/Post/163322#163322 `mean([1 + log2(pwinner)])`
- I'm not sure how to select "real" ranking games from the tables that I have accessible (battles, battle_player). I've done some obvious things to filter games that seemed inappropriate.
- Because WHR is a bit scary, I used python trueskill to generate rankings.
- I've only considered games with equal numbers of players in each team.

Code:

https://colab.research.google.com/drive/1KziZMjYJUR-oeK2ZL9RdZclT8I3-KfrB?usp=sharing

Results:

Comparing the prediction of particular classes of games when using all available battles for generating the rankings, versus only the battles of that class.

1v1 battles:

Ranking data	Ranking battles	Rated battles	Correct %	Score
All	146446	95899	68.2%	0.1060
Class	114660	95899	67.8%	0.1024

2v2-4v4 battles:

Ranking data	Ranking battles	Rated battles	Correct %	Score
All	146446	21275	59.8%	0.0408
Class	21637	21275	58.1%	0.0297

5v5+ battles:

Ranking data	Ranking battles	Rated battles	Correct %	Score
All	146446	10149	52.7%	-0.0046
Class	10149	10149	52.8%	-0.0072

Conclusion:

For 1v1 battles there's a little advantage to using the whole dataset.
For 2v2-4v4 battles there's some advantage to using the whole dataset.
For 5v5+ battles there doesn't seem to be a clear winner either way; there's no evidence that it's worse to use the whole dataset.

P.S.:

I've heard this has all been done before with similar conclusions when investigating WHR :)

+2 / -0

Brackman

3 years ago
(edited 3 years ago)

That's interesting and a good confirmation that combined ratings are more accurate! Due to this effect, I'm wondering if you could improve the score for big teams by replacing

denom = math.sqrt(size * (ts.beta * ts.beta) + sum_sigma)

denom = math.sqrt(size/4 * (size * (ts.beta * ts.beta) + sum_sigma))

+0 / -0

Silent_AI

3 years ago

Did you use only the information that was known before the evaluated battle?

+1 / -0

fiendicus_prime

3 years ago

quote:
Silent_AI Did you use only the information that was known before the evaluated battle?

Yes, that's the intent anyway.
You'd expect e.g. 1v1 to predict with high accuracy since matchmaker doesn't attempt to balance very close to 50:50, for obvious reasons. I'd be interested what the stats are for WHR. Maybe a project to try out later :)

+0 / -0

malric

3 years ago

quote:
Yes, that's the intent anyway.

Sorry, I do not understand. Do the above results "use only the information that was known before the evaluated battle" or not? (and information includes player rating)

Also, any reason to separate on that battle size? (not sure if really applies, but reminds me of: Simpson's paradox)

+0 / -0

fiendicus_prime

3 years ago
(edited 3 years ago)

quote:
malric Sorry, I do not understand. Do the above results "use only the information that was known before the evaluated battle" or not? (and information includes player rating)

You can check the code. I sorted by battle_id and I determined the prediction of the battle before adjusting the ratings, so I think so, but I might have made a mistake (e.g. lost the sort order at some point).

quote:
malric Also, any reason to separate on that battle size?

There have been several requests for ladders or balance split by game size. I suppose the theory is that some players are much better at 1v1 then they are at teams, or vice versa. What we're seeing is that - if this is true - the effect isn't strong enough to give a better prediction overall.

+1 / -0

Brackman

3 years ago

quote:
Did you use only the information that was known before the evaluated battle?

It does seem so to me from the provided code. I cannot say it with absolute certainty though because I don't know the source code of the imported TrueSkill package. I know a way to hack this package to include knowledge about battle outcomes but I'm not accusing

fiendicus_prime of this. It's probably fine.

I think that there is no Simpson's paradox here. The claim holds for each of the three classes. Since the score on all data would just be a weighted average of the classes, the claim would still hold for all data.

I only hope that you

quote:
Used the scoring function from http://zero-k.info/Forum/Post/163322#163322 `mean([1 + log2(pwinner)])`

instead of averaging the probability with 0.5 and that the line

quote:
score = 1 + mean([math.log2(0.5 - (0.5 - p) * 0.5) for p in predictions])

is just an alternative version for testing purposes. But even if not, that would be an easy fix.

+0 / -0

TinySpider

3 years ago

Where is this imaginary dataset of team games that were balanced without 1v1 games contaminating the team compositions?

+0 / -0

fiendicus_prime

3 years ago

quote:
TinySpider Where is this imaginary dataset of team games that were balanced without 1v1 games contaminating the team compositions?

1. Not all team games are autobalanced (I don't know how to tell which... possibly there are more tables that I don't have?)
2. It seems likely that if team games could be balanced better without the 1v1 data then the 1v1 data would make the prediction worse, but it doesn't (and makes it better for small teams).

I don't doubt that there are some players who buck this rule, but we'd need a deterministic way to identify them. Possibly you could interpolate between their overall ranking vs the class-specific one, weighted by the sigma (uncertainty).

+0 / -0

TinySpider

3 years ago

quote:
2. It seems likely that if team games could be balanced better without the 1v1 data then the 1v1 data would make the prediction worse, but it doesn't (and makes it better for small teams).

Can you walk me through this thought process? Start with how you got team games data without 1v1 data.

+0 / -0

fiendicus_prime

3 years ago

Brackman Yes I was just fudging the probabilities at the end to maximise the scoring function, because otherwise all the scoring using that particular prediction function is negative. I'll try out your suggestion incorporating team size later and see if it improves things.

+1 / -0

fiendicus_prime

3 years ago
(edited 3 years ago)

quote:
TinySpider Can you walk me through this thought process? Start with how you got team games data without 1v1 data.

IRL you don't always have randomized controlled trials, so you try to make statements about the data that you have like: if 1v1 data is a poor predictor of multiplayer skill, then the ability to predict the outcome of a battle ought to be better if you exclude that data.

How would you prove that it's better not to include 1v1 data? Everything I've seen suggests that's not the case. You could employ 1000 people to play teams / 1v1 Zero-K for a week, using random teams, and perform a similar analysis. This would be a very valuable dataset for analysis.

+0 / -0

TinySpider

3 years ago
(edited 3 years ago)

quote:
How would you prove that it's better not to include 1v1 data?

That's not how burden of proof works. You cannot make a claim about any contribution of 1v1 data when you do not have any data that doesn't include 1v1 data.

+0 / -0

fiendicus_prime

3 years ago

quote:
TinySpider You cannot make a claim about any contribution of 1v1 data when you do not have any data that doesn't include 1v1 data.

But, as stated earlier, I do; since not all games are autobalanced. I agree, it would be nice to know which ones these are, then we could test your prediction on that dataset _alone_.

Instead of trying to assert that you are right repeatedly perhaps it would be more productive to try and show evidence that supports your theory. This, indeed, is why I did the analysis in the first place. I would have loved to have shown evidence to support the idea that a separate multiplayer ranking would produce better balance, but I have failed.

+0 / -0

Brackman

3 years ago

You can simply exclude 1v1s by looking at the number of players as

fiendicus_prime did. That some of those 1v1s influenced the team balance at the time doesn't matter because

quote:
It is fine to test a rating system on data that has not been balanced with it because a rating system can also evaluate unbalanced games.

+1 / -0

TinySpider

3 years ago

quote:
You can simply exclude 1v1s by looking at the number of players

You cannot simply do that. Any team game on record has been balanced using 1v1 games and cannot be separated from 1v1 games, team games that were not balanced at all are the only relevant available sample.

+0 / -0

Anarchid

3 years ago
(edited 3 years ago)

quote:
You cannot simply do that. Any team game on record has been balanced using 1v1 games and cannot be separated from 1v1 games, team games that were not balanced at all are the only relevant available sample.

It seems like you think that at least one of these is true:

- the dataset includes the ranks used to balance the historical games and they are being used in this evaluation
- the two (actually more) disparate algorithms used to balance and predict games back when they were played actually somehow influence the new algorithm despite numbers being not used

If it's the former, you would be simply wrong. If it's the latter, then you have to demonstrate how you think this influence would happen. It sounds frankly pretty much an extraordinary claim, but i'm sure you have extraordinary mathematical proof to back it up, if so.

If it's none of the above, then please clarify.

+2 / -1

TinySpider

3 years ago

Anarchid Ignoring the elaborate strawman you constructed, here's an allegory of my own:

There's a dataset of crop yields for a 10 year period. These crop yields were achieved by using a specific amount of fertilizer based on past predictions of how much it impacted crop yields, in order to meet a production goal. Nobody kept track of carrots, potatoes or wheat in relation to fertilizer usage (mistake).

In an attempt to correct this and more accurately predict required fertilizer usage, a brilliant scientist decided to use this old dataset to determine that separate tracking of carrots, potatoes and wheat does not produce more accurate results.

Can you spot the mistake this scientist did?

+1 / -0

Jadem

3 years ago

quote:
There's a dataset of crop yields for a 10 year period. These crop yields were achieved by using a specific amount of fertilizer based on past predictions of how much it impacted crop yields, in order to meet a production goal. Nobody kept track of carrots, potatoes or wheat in relation to fertilizer usage (mistake).

In an attempt to correct this and more accurately predict required fertilizer usage, a brilliant scientist decided to use this old dataset to determine that separate tracking of carrots, potatoes and wheat does not produce more accurate results.

Can you spot the mistake this scientist did?

moving swiftly on

+2 / -2

TinySpider

3 years ago

Jadem here have some upvotes to validate your existence.

+1 / -1

Page of 7 (132 records)

Forum index > General discussion >