On the back of our attempts at a more neutral ranking scheme for the World Cup national teams, and some subsequent conversation on the Rollerderby subreddit and Twitter, we've made a few tweaks to our ranking algorithm, and fed it the games from European Roller Derby Tournament and Road To Dallas as well as the World Cup games it had before. [Technical notes: we've been persuaded that the topological sorting constraint is overly aggressive in asserting groupings from the initial ground truth, which adds some inflexible structure to the sort, so instead we iterate the inference rankings to self-consistency and sort on the relative strengths in the self-consistent inference matrix, using the final inferences only. Code for this has been added to the repository here. ]
We hear that there was some controversy on our previous rankings being published, although none of the teams have actually contacted us to make any comment. (The only actual direct comments we've had have been from people expressing interest in the statistical methods used, which were chosen to be neutral. In fact, this article was written partly on the request of a number of people to see what the inclusion of additional datasets did to our ranking calculations.) We, of course, welcome any criticism of our methods and discussion of the results of our analysis.
In order to allay any possible sense of bias being present in what was designed as a neutral computational statistical ranking, we have also calculated the least squares rankings for the National Teams from the same data set (plus the Italy v Switzerland result from the same timeframe), with both score difference and log(score ratio)1 rankings. In this, we follow Massey, considering that least squares rankings are generally held to be amongst the most accurate ranking methods for predictive sports ranking.2
[Technical notes: The approach used for least squares ranking was to form the usual matrix of games from Massey, but rescaling scores on the 40 minute bouts by 60/40 to estimate the score difference in a full game. For score ratios, this was not necessary, as ratios are scale-invariant measures (which is the reason we decided to use them for our metric), but we do provide the same capping of blowouts as if the zero-scoring team really scored 1/2 a point as we do in our own measure. In the National Teams dataset, this only affects the Sweden-Japan game: altering the game to give a single point to Japan instead does not affect the final placement of Sweden by the ranking, although it does adjust their predicted power. We prepared data using python scripts for processing in GNU Octave for the least squares fit; the scripts used are also available in the same repository as our ranking code.]
Our modified iterative ranking predicts the following ranking of teams, with the old algorithm by the side of it for comparison and the Least Squares ranks in addition4. We've colour coded the teams by their divergence from the official B&T World Cup rankings for interest (Black is the same as B&T, Green is higher ranked than B&T, Red is lower ranked than B&T).
|Topological Sort (old model)||Iterative consistency (new model)||Least Squares Rank with respect to Score Difference (Power is expected score diff)||Least Squares Rank with respect to Score Ratio (Power is expected score ratio)|
|1. USA 2. England 3. Australia 4. Canada 5. Finland 6. Sweden 7. Scotland 8. Argentina 9. Ireland 10. France 11. Germany 12. NewZealand 13. Belgium 14. Norway 15. Netherlands 16. WestIndies 17. Wales 18. Denmark 19. Colombia 20. Spain 21. Greece 22. Brazil 23. Chile 24. Italy 25. Mexico 26. Portugal 27. SouthAfrica 28. Japan 29. PuertoRico 30. Switzerland||1. USA 2. England 3. Australia 4. Sweden 5. Canada 6. NewZealand 7. Finland 8. France 9. Ireland 10. Germany 11. Scotland 12. Argentina 13. Netherlands 14. Belgium 15. Spain 16. Norway 17. Wales 18. Chile 19. Denmark 20. WestIndies 21. Colombia 22. Greece 23. Brazil 24. SouthAfrica 25. Portugal 26. PuertoRico 27. Mexico 28. Switzerland 29. Italy 30. Japan||1. USA 0.0 2. England -191.00 3. Australia -213.273 4. Canada -294.666 5. Sweden -380.901 6. Finland -460.143 7. NewZealand -465.461 8. France -539.895 9. Germany -554.373 10. Scotland -567.636 11. Argentina -570.828 12. Ireland -572.967 13. Belgium -721.009 14. Norway -729.422 15. Spain -743.184 16. Chile -744.361 17. Wales -748.199 18. Denmark -752.251 19. Netherlands -759.382 20. Colombia -768.326 21. WestIndies -788.968 22. Mexico -870.077 23. Greece -886.9503 24. Portugal -887.2898 25. SouthAfrica -893.530 26. Brazil -895.490 27. PuertoRico -944.083 28. Switzerland -955.1116 29. Italy -975.691 30. Japan -1062.202||1. USA 1.0 2. England 0.3009 3. Australia 0.2700 4. Sweden 0.1516 5. Canada 0.1317 6. NewZealand 0.06539 7. Finland 0.05634 8. France 0.04562 9. Ireland 0.04243 10. Germany 0.04091 11. Scotland 0.03381 12. Argentina 0.03077 13. Belgium 0.01542 14. Netherlands 0.01384 15. Norway 0.01355 16. Chile 0.01341 17. Wales 0.01295 18. Denmark 0.01169 19. Spain 0.01157 20. WestIndies 0.01128 21. Colombia 0.009998 22. Greece 0.006950 23. Brazil 0.005783 24. Portugal 0.005648 25. SouthAfrica 0.005354 26. Mexico 0.005206 27. PuertoRico 0.005150 28. Switzerland 0.004456 29. Italy 0.003946 30. Japan 0.001298|
We also see that it is still unambiguous that Germany was unfairly relegated from the Top 16 due to poor group selection in the World Cup (another three big upward movers that we missed before are Spain, Greece and Chile, with Spain looking like another possible-Top-16). We're particularly impressed by the performance of Greece, as they had very little practice time before the World Cup itself.
The "amount of divergence" from the B&T tournament ranking increases as we move away from the top ranks, which is precisely as expected for a single-elimination tournament ranking. (The case of Wales, which is incongruous in being ranked precisely as B&T does for all of the rankings, is probably due to it being a pivot point, on the edge of the Top16 ranking. As our rankings are all tournament-neutral, there's a tendency for teams to shuffle ranking relative to the tournament, pivoting around the tournament boundaries. We also see this around the Top 8 boundary, with more fuzz due to the higher concordance with the B&T rankings in general. That is: this apparent structure is a reflection of the tournament structure itself, rather than the ranking methods here, which are structureless except for the Topological Sort.)
In general, the pure ratio based models (our new model and log(ratio) least squares) agree with high correlation for the majority of the table, with the ranks around the 15th to 19th positions showing the worst concordance. As both methods are global optimisation schemes, we'd expect them to agree substantially on the rankings, given the same metric; the least squares method has the advantage of executing substantially more quickly! The topological sort has the highest rank disagreement with the other models, although it agrees with some general properties of the ordering (it is the only sort to agree generally with some of the B&T Cup ranking properties, as it tends to lower the rank of teams who played less games, a property enforced on some teams by the tournament structure itself). The linear score difference least squares is also surprisingly congruent with the ratio-based metrics, outside of a few anomalies like the lower placement of the Netherlands, and it does tend to uprank and downrank the same teams as the ratio methods, relative to the B&T tournament ranks.
The pattern of up and down ranked teams in the 8-16 rank positions, with substantial agreement across all of the three latter rankings, is largely consequence of the "score difference from last bout" ranking chosen by B&T. As we mentioned in other comments, there are issues with such a ranking mechanism, as score-difference is only a measure of the relative skill difference between two teams, not an absolute measure. As the difference in skill in the top 8 is unambiguously large, the score difference for 8-16 rank teams can be dominated by which of the Top 8 teams they played, rather than the actual difference in skill within the 8-16 rank.
On the basis of this comparison, and for additional interest, we also calculated Least Squares rankings for the Men's National Teams who attended the Men's Roller Derby World Cup 2014. Again, we've colour coded for alterations relative to the official tournament rankings; as the MRDWC2014 allowed draws for 7th and 11th places, we've half-coloured teams which are ranked in the "7,8"th places or "11,12"th places when MRDWC2014 assigned them to the drawn 7th and 11th positions.
|Rank with respect to Score Difference(Power is expected score difference)||Rank with respect to Score Ratio(Power is expected score ratio)|
|1. USA 0.0 2. England -160.067 3. Canada -196.457 4. France -288.076 5. Australia -414.936 6. Wales -425.478 7. Argentina -490.412 8. Finland -566.641 9. Scotland -573.354 10. Ireland -576.750 11. Belgium -676.660 12. Germany -723.834 13. Netherlands -738.020 14. Sweden -788.960 15. Japan -905.176||1. USA 1.0 2. England 0.273 3. Canada 0.267 4. France 0.132 5. Australia 0.0664 6. Wales 0.0531 7. Argentina 0.0394 8. Scotland 0.0271 9. Ireland 0.0236 10. Finland 0.0147 11. Belgium 0.0136 12. Netherlands 0.00898 13. Germany 0.00791 14. Japan 0.00630 15. Sweden 0.00491|
Returning to the Women's National Teams, our main conclusion is that we would really like to see Canada play Sweden at some point in the near future (and Finland take on Sweden in a rematch). In general, we'd like to promote the use of fairer ranking schemes, and more thought in planning large tournaments in order to encourage the fairer ranking of those competing. The example of MRDWC2014 shows that it is quite possible to manage a tournament, with care, to maximise the neutrality of the contest, whilst still admitting other constraints (such as getting teams from different geographical locations to play each other).
1We have to use log(score ratio) for a least squares regression to make the measure linear. This issue with the linearity requirement was one reason we didn't adopt a least squares method for our initial inference model.
2For example, Sports Rankings REU Final Report 2012 notes that least squares minimisation provides the most accurate predictive rankings for Basketball and Football out of all of the (non-simulational) methods they compare, and the Bracketology review of College Basketball rankings and this comprehensive analysis of ranking predictive systems across many sports also show that "Massey"/least squares methods have good predictive power. Even in a comparison of football team prediction, where home-team advantage is not modelled by simple Massey predictions, it is still one of the best "simple" models tested. This is unsurprising3, as least squares regression is one of the most tested means of statistical modelling of (linear) functions in modern science.
3Of course, least squares methods, like our inference scheme, assume that "superiority at a game" is a transitive condition, which is not necessarily true in sports (you can imagine a team whose tactics are simply ill-suited to an opponent of similar ability). However, the real world performance tests of the method suggest that transitivity does hold strongly enough in many sports for least squares methods to provide good metrics.
4 FlatTrackStats uses Elo ranking methods instead, which do not assume transitivity, and have similar performance properties to Massey least squares rankings (the FTS algorithm uses a normalised score difference method, slightly different to our pure ratio, to determine team strength, for the same "scale-invariance" property that we value5, and also apply a non-Gaussian error estimator). Elo rankings tend to perform better with lots of contests between players, as the estimator works by "transferring" points from a losing team to a winning one. This also means that it scales better with huge numbers of contests - it's a good choice for FTS to use, given the size of their bout database. Global estimator methods, like linear least squares, are better suited to tournament style prediction, however, where the number of contest pairs is small compared to the total space, and the games are all played in a relatively short timeframe (and there's no home-field advantage). The supplied python scripts also generate a ranking based on the FTS normalised score difference using least squares optimisation, so the interested reader can generate the pseudo FTS ranking themselves. We don't publish it here to avoid filling the table with too many very similar results (the rankings produced are generally half-way between the score difference and pure ratio least squares rankings, with the only significant deviation being a particularly low estimated ranking for Portugal, which we don't really understand).
5Direct evidence in favour of scale-invariant ratio measures like log(ratio) and FTS style "normalised score difference" comes from the Men's Roller Derby World Cup. Belgium and Japan faced each other twice during the tournament, once in the group stage and once in a full length bout, as did Germany and Ireland. Computing the ratio of scores and the normalised difference of scores for both bouts produces estimated relative strengths for the two teams which match very well (almost perfectly for Germany/Ireland, and within 20% for Belgium/Japan, where we would expect a higher disparity due to Japan's own rapid skill development). Computing the score difference does much more poorly (off by 100%+ in both cases)!