Balancing uneven team games - An attempt at quantifying the advantage of the larger team - forum thread

sort

dunno

2 years ago
(edited 2 years ago)

Rendered notebook here (for ease of including plots and formulas):

* https://jklw.github.io/WHR-ZeroK/posting1_v2.html

Summary: Uneven team games should probably be balanced such that the smaller team has about 50-200 higher average Elo than the larger team (depending on the specifics). Also a little background on how WHR (which I had to reimplement here) works mathematically.

The code is here, though parts of the notebooks other than posting1.ipynb have already bitrotten, and I didn't include the input data (not sure if there are any privacy issues), nor the sampling traces (which are needed to run posting1) because they are about 16 GB :) You can recreate these by running the initial (still working) parts of whr_cumsum_matrix_model.ipynb and then whr_fixed_rating_model.ipynb though (after downloading malric's data).

I used

malric's data from Zero-K Local Analysis (thanks!); if there's interest by the admins, I could re-run this stuff on a DB dump (of game dates, teams and outcomes) as a check.

+9 / -0

malric

2 years ago
(edited 2 years ago)

For "Top 100" you mention "(only last available day for each player)" - is that the date for which you found a game that was not filtered out?

If yes, there might be some issues. One my local database for example I have Godde playing in game 1351963, which was on Monday, April 25, 2022 12:45:10 AM, but in your graph for Godde it says 2021-12-22. The game characteristics are:

battle_id|is_elo|started_sec|duration_sec|title|has_bots|map_id|team_won|game_name_version|game_id|game_version|engine_version|host|is_chicken|is_ffa|is_teams|is_autohost_teams|is_1v1|is_matchmaker
1351963|1|1650847510|1680|[A] Teams All Welcome (32p)|0|34970|2|Zero-K v1.10.4.1|1|v1.10.4.1|105.1.1-841-g099e9d0||||1|1||

I don't see anything in your list that would exclude it.

+3 / -0

dunno

2 years ago
(edited 2 years ago)

Oh no, you're right, I accidentally left out 2022-01 to 2022-04 (yes, the Top 100 is supposed to contain the date of that player's most recent game not filtered out). Good catch.

Will re-run it overnight; I expect/hope that this won't have a large effect on the `epad` results. This will embiggen the dataset from 22150 to 25813 games.

+2 / -0

malric

2 years ago

quote:
, but for some reason none of the 2017 or 2018 games passed the filter

=> None of the games before Saturday, March 16, 2019 10:49:52 PM are marked is_autohost_teams because code is looking for "[A] Teams All Welcome " i the title and before that date the host was called "TEAMS: Newbies welcome!".

+1 / -0

Stuey72727

2 years ago

Due to the way metal is split, I think an extra weaker player can sometimes have a negative impact on a team. Only in cases where there is a fair difference in rank.

+3 / -0

Brackman

2 years ago

quote:
Due to the way metal is split, I think an extra weaker player can sometimes have a negative impact on a team.

Obviously. This effect is already considered in every model that has ever been used.

If I understand Fig. 1 correclty, then the blue graphs show the big teams' advantage for the current version of the rating history and the red graphs consider this advantage in the rating changes. At first glance, it may seem surprising that the advantage increases by a better model. But if my understanding and the jointly fit were correct, those graphs would support my hypothesis that the big teams' advantage causes a rating distortion such that the advantage is partially compensated. Players who tend to get second coms or who avoid games in which they would be outnumbered would have their ratings systematically distorted (and other players too such that the average rating remains constant). By considering the advantage in the model, the ratings would no longer be distorted to compensate the advantage and therefore, the advantage in the model would increase. I'm wondering if the rating distortion that can be seen in

dunno's Top100 data is sufficient to explain this.

First, it should be checked if the code to reproduce the rating history is correct.

I would like to have a function that describes epad in dependency of team size without having to treat it as a vector with many dimensions. This would reduce the number of parameters in the system that can be overfitted. That the 4v5 epad is smaller than the 5v6 epad seems to be an outlier due to too little data. For a proof of concept, the epad vector is fine, though.

How epad depends on BPST is a bit arbitrary. It's probably good enough to be better than the simple adjusted model though. It is questionable if the advantage is worth the arbitrary complexity.

What is the scoring rule for the elpd_loo calculation? For fixed-rating models, I see that elpd_loo ~ 7231*ln(.55). For joint models, elpd_loo ~ 22150 ln(.515). So is the scoring rule ln(probability) summed over the number of games? In this case, it would be nice to calculate 1+elpd_loo/ln(2)/number of games. This would make the score independent of the number of games and score 0 would mean always predicting 50% and score 1 would mean always predicting 100% right. Instead of LOO, you could use only the previous games as training data for every game. But your calculation should be fine, too. It is interesting to see those results.

+1 / -0

dunno

2 years ago

Updated (under new URL) to include the missing 4 months. Luckily no big changes in the inferred epads.

I also disaggregated the “8v9+” category into 8v9, 9v10, ... 15v16+ because the epad for 8v9+ was still clearly positive (if small). It now looks like we can exclude a zero effect up to 10v11.

Took more than 18 hours to run this time ._. (6-7 hours for each of the joint models).

malric: I see. Couldn't include these in the new version because I don't think I have the room titles.

Stuey72727: I agree, but the system currently in use already reflects this, since the weak extra player will drag down the average rating. I'm generally trying to find just a correction term to the current system ("how much better is the large team compared to what the current system thinks", or equivalently "how much higher than the average rating of the large team does the average rating of the small team need to be to get a 50% win chance").

The model is unbiased between negative and positive epad; if the smaller team actually had an advantage on average, the results would have pointed to a negative epad.

+1 / -0

malric

2 years ago

quote:
Couldn't include these in the new version because I don't think I have the room titles.

Don't think you should (would be nice, but seems to already take a lot of processing time). The ~4 years of data processed which seems reasonable to me. I just wanted to make sure there is no issue with the data (wrong things tagged etc.)

My interpretation of

Stuey72727 remark is why analyze "Best Player of Smaller Team" when you could also use "Worst Player of Smaller Team"? (it's more a rhetorical question, I have seen the reason in the notebook).

I do agree though that maybe "Best Player of Smaller Team" is not the only possible discriminator, I would wonder about any measure of dispersion of the ranks in each team. The reason would be that some of the low rank players are much more "random" in performance than high ranked players (ex: people only starting, people occasionally trolling), so it might be that having less wild cards (due to smaller dispersion) is "better". Maybe will try to run at some point myself, but on the road now so I do not have access to reasonable computing power for this exercise.

+0 / -0

dunno

2 years ago
(edited 2 years ago)

Brackman:

quote:
If I understand Fig. 1 correclty, then the blue graphs show the big teams' advantage for the current version of the rating history and the red graphs consider this advantage in the rating changes.

Yes. Put another way, in the red graphs, all player-day ratings and the epads are parameters in a single model. In the blue graphs, I first inferred ratings using the unadjusted WHR model, then took the posterior mean for each player-day rating as fixed, then inferred the epads in their own model.

This latter process throws away all the uncertainty and correlations that might be in the posterior player ratings, which seems significant, but I can't say whether it should lead to higher or lower epad estimates.

(Player ratings can be correlated - for example, if two players always play as a team, then we only really have evidence about what the sum of their rating is, but can't distinguish (2000, 1000) from (1500, 1500), so their ratings would be highly negatively correlated).

quote:
At first glance, it may seem surprising that the advantage increases by a better model. But if my understanding and the jointly fit were correct, those graphs would support my hypothesis that the big teams' advantage causes a rating distortion such that the advantage is partially compensated. Players who tend to get second coms or who avoid games in which they would be outnumbered would have their ratings systematically distorted (and other players too such that the average rating remains constant). By considering the advantage in the model, the ratings would no longer be distorted to compensate the advantage and therefore, the advantage in the model would increase. I'm wondering if the rating distortion that can be seen in dunno's Top100 data is sufficient to explain this.

Not sure if anyone is systematically in the smaller or larger team on autohost. When I hadn't yet restricted to autohost games, I did find some players who were very often in the smaller or in the larger team; these are probably odd friend groups. But for autohost games, I assumed it's more or less a coin flip (thus my lack of surprise that the ratings changed little in the adjusted Top 100). But now that I think about it, at least for 1v2 games, there would be an automatic bias for medium-rated players to be in the small team.

quote:
First, it should be checked if the code to reproduce the rating history is correct.

That would be nice. One possible test (which I haven't gotten around to yet) would be to generate a known ground truth (epad and player-day ratings), then generate games according to the model likelihood and check that the results converge to the ground truth (actually, I would only expect rating differences to converge, since adding a constant to every player-day rating keeps the likelihood unchanged. The only thing constraining this constant is the prior about first-day ratings).

quote:
I would like to have a function that describes epad in dependency of team size without having to treat it as a vector with many dimensions. This would reduce the number of parameters in the system that can be overfitted. That the 4v5 epad is smaller than the 5v6 epad seems to be an outlier due to too little data. For a proof of concept, the epad vector is fine, though.

I think that at least in the simple model, there's little room for overfitting since each epad component is being fitted to hundreds of games (except 1v2 is a little rare). epad_4v5 does have a lower mean than epad_5v6, but due to the uncertainty expressed in the posterior, it's still fairly consistent with epad_4v5 > epad_5v6 (I should plot the posterior for their difference).

It's still true that we'd probably get a tighter estimate with even fewer parameters. (Also, my epad prior is a bit crazily broad, not ruling out an epad of 2000 or -2000 Elo)

quote:
How epad depends on BPST is a bit arbitrary. It's probably good enough to be better than the simple adjusted model though. It is questionable if the advantage is worth the arbitrary complexity.

True. I tried to have both "low" and "high" rating well covered by roughly fitting the "rating highness" function to the CDF of the ratings of the BPST, but this plan was foiled by the strong correlation of the BPST Elo with team size.

I didn't go for a linear relation between BPST Elo and epad because it seems a bit implausible that epad should grow without bound as the BPST gets very good or bad, but not sure.

quote:
What is the scoring rule for the elpd_loo calculation? For fixed-rating models, I see that elpd_loo ~ 7231*ln(.55). For joint models, elpd_loo ~ 22150 ln(.515). So is the scoring rule ln(probability) summed over the number of games? In this case, it would be nice to calculate 1+elpd_loo/ln(2)/number of games. This would make the score independent of the number of games and score 0 would mean always predicting 50% and score 1 would mean always predicting 100% right. Instead of LOO, you could use only the previous games as training data for every game. But your calculation should be fine, too. It is interesting to see those results.

Yes, this is my understanding too, and I agree it would be better to show the ELPD per game instead of the sum.

To be precise, I think ELPD, for one game, is an approximation of the expected value (over the posterior) of ln p(actual winner team wins). Which is a bit different from the ln p using a point estimate like MAP or the mean.

Using only previous games to predict the next would be more realistic, but so far I don't know how to do this practically in reasonable time. I would guess this "clairvoyance effect" isn't so bad in this domain though because it isn't much easier to predict intermediate games than to predict the future.

+0 / -0

Forum index > General discussion >

Balancing uneven team games - An attempt at quantifying the advantage of the larger team