Brackman:
quote: If I understand Fig. 1 correclty, then the blue graphs show the big teams' advantage for the current version of the rating history and the red graphs consider this advantage in the rating changes. 
Yes. Put another way, in the red graphs, all playerday ratings and the epads are parameters in a single model. In the blue graphs, I first inferred ratings using the unadjusted WHR model, then took the posterior mean for each playerday rating as fixed, then inferred the epads in their own model.
This latter process throws away all the uncertainty and correlations that might be in the posterior player ratings, which seems significant, but I can't say whether it should lead to higher or lower epad estimates.
(Player ratings can be correlated  for example, if two players always play as a team, then we only really have evidence about what the sum of their rating is, but can't distinguish (2000, 1000) from (1500, 1500), so their ratings would be highly negatively correlated).
quote: At first glance, it may seem surprising that the advantage increases by a better model. But if my understanding and the jointly fit were correct, those graphs would support my hypothesis that the big teams' advantage causes a rating distortion such that the advantage is partially compensated. Players who tend to get second coms or who avoid games in which they would be outnumbered would have their ratings systematically distorted (and other players too such that the average rating remains constant). By considering the advantage in the model, the ratings would no longer be distorted to compensate the advantage and therefore, the advantage in the model would increase. I'm wondering if the rating distortion that can be seen in dunno's Top100 data is sufficient to explain this. 
Not sure if anyone is systematically in the smaller or larger team on autohost. When I hadn't yet restricted to autohost games, I did find some players who were very often in the smaller or in the larger team; these are probably odd friend groups. But for autohost games, I assumed it's more or less a coin flip (thus my lack of surprise that the ratings changed little in the adjusted Top 100). But now that I think about it, at least for 1v2 games, there would be an automatic bias for mediumrated players to be in the small team.
quote: First, it should be checked if the code to reproduce the rating history is correct. 
That would be nice. One possible test (which I haven't gotten around to yet) would be to generate a known ground truth (epad and playerday ratings), then generate games according to the model likelihood and check that the results converge to the ground truth (actually, I would only expect rating differences to converge, since adding a constant to every playerday rating keeps the likelihood unchanged. The only thing constraining this constant is the prior about firstday ratings).
quote: I would like to have a function that describes epad in dependency of team size without having to treat it as a vector with many dimensions. This would reduce the number of parameters in the system that can be overfitted. That the 4v5 epad is smaller than the 5v6 epad seems to be an outlier due to too little data. For a proof of concept, the epad vector is fine, though. 
I think that at least in the simple model, there's little room for overfitting since each epad component is being fitted to hundreds of games (except 1v2 is a little rare). epad_4v5 does have a lower mean than epad_5v6, but due to the uncertainty expressed in the posterior, it's still fairly consistent with epad_4v5 > epad_5v6 (I should plot the posterior for their difference).
It's still true that we'd probably get a tighter estimate with even fewer parameters. (Also, my epad prior is a bit crazily broad, not ruling out an epad of 2000 or 2000 Elo)
quote: How epad depends on BPST is a bit arbitrary. It's probably good enough to be better than the simple adjusted model though. It is questionable if the advantage is worth the arbitrary complexity. 
True. I tried to have both "low" and "high" rating well covered by roughly fitting the "rating highness" function to the CDF of the ratings of the BPST, but this plan was foiled by the strong correlation of the BPST Elo with team size.
I didn't go for a linear relation between BPST Elo and epad because it seems a bit implausible that epad should grow without bound as the BPST gets very good or bad, but not sure.
quote: What is the scoring rule for the elpd_loo calculation? For fixedrating models, I see that elpd_loo ~ 7231*ln(.55). For joint models, elpd_loo ~ 22150 ln(.515). So is the scoring rule ln(probability) summed over the number of games? In this case, it would be nice to calculate 1+elpd_loo/ln(2)/number of games. This would make the score independent of the number of games and score 0 would mean always predicting 50% and score 1 would mean always predicting 100% right. Instead of LOO, you could use only the previous games as training data for every game. But your calculation should be fine, too. It is interesting to see those results.

Yes, this is my understanding too, and I agree it would be better to show the ELPD per game instead of the sum.
To be precise, I think ELPD, for one game, is an approximation of the expected value (over the posterior) of ln p(actual winner team wins). Which is a bit different from the ln p using a point estimate like MAP or the mean.
Using only previous games to predict the next would be more realistic, but so far I don't know how to do this practically in reasonable time. I would guess this "clairvoyance effect" isn't so bad in this domain though because it isn't much easier to predict intermediate games than to predict the future.