1 |
Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively.
|
1 |
Thanks for the number theory explanations. It was interesting, but above my abilities to argue over. However, I still hope it's worthwhile to look at things intuitively.
|
2 |
\n
|
2 |
\n
|
3 |
[quote]
|
3 |
[quote]
|
4 |
WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players.
|
4 |
WHR as a system doesn't have much to say about predicting the outcome of teams games, much less teams games with an uneven number of players.
|
5 |
[/quote]
|
5 |
[/quote]
|
6 |
\n
|
6 |
\n
|
7 |
Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :)
|
7 |
Unfortunately Trueskill 1 doesn't either AIUI. Apparently Trueskill 2 can, but I don't seen an open source implementation for that. Anyway, we're way off this discussion being relevant :)
|
8 |
\n
|
8 |
\n
|
9 |
[quote]
|
9 |
[quote]
|
10 |
If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.
|
10 |
If trueskill is receiving a negative score, then it is worse than blind guessing on the data you are running it on.
|
11 |
[/quote]
|
11 |
[/quote]
|
12 |
\n
|
12 |
\n
|
13 |
Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a _random_ match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random.
|
13 |
Worse at maximising the expression, but it's hard to imagine it's worse at determining the probability of an outcome of a _random_ match. But yet, this rating system is being interpreted to say that Trueskill is worse than saying the outcome of any random matchup is 50:50. This seems like a red flag to me, unless we believe all games are perfectly balanced at 50:50 - but we can suggest strongly this isn't true, since ELO/WHR/Trueskill does significantly better at predicting the outcome than random.
|
14 |
\n
|
14 |
\n
|
15 |
[quote]
|
15 |
[quote]
|
16 |
The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.
|
16 |
The prediction of 79% is being punished more than the prediction of 51% because it is further away from 65% on the log scale.
|
17 |
[/quote]
|
17 |
[/quote]
|
18 |
\n
|
18 |
\n
|
19 |
Agreed
but
I
think
this
causes
you
to
favor
worse
algorithms
(
i.
e.
that
would
predict
the
correct
outcome
less
often)
because
they
hedge
their
bets
a
bit.
It's
possible
that
when
balancing
teams
this
bias
doesn't
matter,
since
you're
talking
about
different
combinations
for
the
same
algorithm,
and
we
just
want
to
maximise
for
p
=
0.
5.
But
when
comparing
_algorithms_
it
surely
does
matter,
because
an
algorithm
that
is
ostensibly
better
at
maximising
p
=
0.
5
might
score
lower
if
it
tends
to
estimate
win
chance
too
high
compared
to
too
low.
|
19 |
Agreed
but
I
think
this
causes
you
to
favor
worse
algorithms
(
i.
e.
that
would
predict
the
correct
outcome
less
often)
because
they
hedge
their
bets
a
bit.
It's
possible
that
when
balancing
teams
this
bias
doesn't
matter,
since
you're
talking
about
different
combinations
for
the
same
algorithm,
and
we
just
want
to
maximise
for
p
0.
5.
But
when
comparing
_algorithms_
it
surely
does
matter,
because
an
algorithm
that
is
ostensibly
better
at
maximising
p
0.
5
might
score
lower
if
it
tends
to
estimate
win
chance
too
high
compared
to
too
low.
|
20 |
\n
|
20 |
\n
|
21 |
[quote]
|
21 |
[quote]
|
22 |
If something very unexpected happens, this means that a lot of information in your prediction system might be wrong.
|
22 |
If something very unexpected happens, this means that a lot of information in your prediction system might be wrong.
|
23 |
[/quote]
|
23 |
[/quote]
|
24 |
\n
|
24 |
\n
|
25 |
I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric.
|
25 |
I can see what it is desirable to punish very bad predictions in a nonlinear way, I just feel like the punishment for deviation from the actual value ought to be symmetric.
|
26 |
\n
|
26 |
\n
|
27 |
What about something like:
|
27 |
What about something like:
|
28 |
\n
|
28 |
\n
|
29 |
```
|
29 |
```
|
30 |
predictions.append(p if winner == team1_id else 1 - p)
|
30 |
predictions.append(p if winner == team1_id else 1 - p)
|
31 |
...
|
31 |
...
|
32 |
success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions)
|
32 |
success_rate = [x > 0.5 for x in predictions].count(True) / len(predictions)
|
33 |
score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])
|
33 |
score = success_rate - 0.5 - mean([math.pow(success_rate - x, 2) for x in predictions])
|
34 |
```
|
34 |
```
|
35 |
\n
|
35 |
\n
|
36 |
With some real data...
|
36 |
With some real data...
|
37 |
\n
|
37 |
\n
|
38 |
score: 1 + log2(pwinner)
|
38 |
score: 1 + log2(pwinner)
|
39 |
adj score: as above
|
39 |
adj score: as above
|
40 |
\n
|
40 |
\n
|
41 |
Fixed p of 0.5402115158636898 (actual win rate for team 1)
|
41 |
Fixed p of 0.5402115158636898 (actual win rate for team 1)
|
42 |
---------------------------------------------------------
|
42 |
---------------------------------------------------------
|
43 |
success rate: 0.5402115158636898
|
43 |
success rate: 0.5402115158636898
|
44 |
score: 0.004670620126253189
|
44 |
score: 0.004670620126253189
|
45 |
adj score: 0.037237666464714145
|
45 |
adj score: 0.037237666464714145
|
46 |
\n
|
46 |
\n
|
47 |
Trueskill
|
47 |
Trueskill
|
48 |
---------
|
48 |
---------
|
49 |
success rate: 0.5811515863689777
|
49 |
success rate: 0.5811515863689777
|
50 |
score: -0.019440827083669232
|
50 |
score: -0.019440827083669232
|
51 |
adj score: 0.038751475717060516
|
51 |
adj score: 0.038751475717060516
|
52 |
\n
|
52 |
\n
|
53 |
So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better.
|
53 |
So Trueskill appears to be able to predict results better (0.58 vs 0.54), but gets a worse and negative "score". Do we really think that Trueskill is a worse balancing system than p = 0.5402115158636898 for any Team 1? OTOH the suggested fudge in "adj score" says Trueskill is marginally better.
|