An Examination of Pairwise Comparisons of College Open Ultimate Frisbee Teams Michael Silger 11/28/2012 Under the direction of Dr. Christopher Wikle Department of Statistics University of Missouri 146 Middlebush Hall, Columbia, MO 65211, United States
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
An Examination of Pairwise Comparisons of College Open Ultimate Frisbee Teams
Michael Silger
11/28/2012
Under the direction of Dr. Christopher Wikle Department of Statistics University of Missouri
146 Middlebush Hall, Columbia, MO 65211, United States
1.1 The importance of an accurate ranking system 5 1.2 The current ranking method 6 1.3 Scrutiny of the current ranking method 7 1.4 An alternative approach to ranking frisbee teams 10
2 Data 11
3 Methodology 12
3.1 Explanation of Elo Algorithm 12 3.2 Changing the Update Parameter to a Function 14 3.3 Explanation of Assumptions for Proposed Algorithm 16
1.4 An alternative approach to ranking frisbee teams
A different approach to ranking frisbee teams should be considered using a model that
minimizes the concerns raised by the USAU algorithm. The model should reflect the
same principles exhibited by participants in ultimate frisbee, such as the variability of a
team’s performance. Variability can be captured using statistical models; one of the first
models developed for pairwise comparisons was created by Arpad Elo for the game of
chess. Elo (1978) is able to measure pairwise comparisons of high dimensionality across
a given time period and calculate a rating of relative strength for each competitor. The
Elo model and its extensions can be seen in more complex models such as Glickman’s
(1993) Glicko rating system. The biggest difference between the two models is that the
Glicko system includes an estimate of reliability in the rating of a contender. I chose to
use the Elo method for its simplicity in implementation and programming.
For the game of chess, Elo (1978) starts off each competitor with a provisional rating
period of about 30 games. Once the provisional period ends, a player is given a rating
reflective of their performance in that extent of time. That rating follows a distribution
where the mean is equal to the rating and the variance is a measure of reliability; Elo
(1965) and McClintock (1977) show that many performances of an individual will be
normally distributed on an appropriate scale. The player rating is updated after a given
time period on a continuous or periodic basis, and will eventually converge to its true
strength rating. An unexpected result could be the consequence of a statistical fluctuation
or an actual change in the player’s ability; thus, new performances are weighted based on
how much importance is given to past performances. The weight is a measure of
11
reliability for a team’s distribution. I use the Elo algorithm as the basis for my rating
algorithm with some minor modifications to the assumptions in the model. The specifics
of the Elo algorithm and my adjustments to its assumptions will be discussed in detail in
the methodology section.
2. Data
The data used in this analysis were collected from www.usaultimate.org in the month of
June 2012 and represent the 2012 USAU open sanctioned tournaments. The data were not
readily accessible and had to be harvested using AutoHotKey, a program that allows the
user to automate keystrokes. Using AutoHotKey, I was able to copy all of the information
on the page for each competitor and paste it into a text file. I then converted the text file
to an excel document and parsed the data to obtain the information necessary to create
two data sets. The first dataset was in correspondence with all of the teams listed under
the RRI webpage (450 teams) while the second dataset included only USAU sanctioned
teams (371 teams). It was necessary to evaluate two separate datasets to obtain a strict
comparison between the proposed algorithm and the USAU algorithm. A comparison is
also done between the USAU sanctioned teams and teams that competed in USAU
sanctioned tournaments to determine if the exclusion of additional teams affects the final
rankings. Similar results are expected when comparing the two datasets. Irrelevant
information was omitted; for instance, several outcomes were listed F-F or F-L, both of
which have no merit when considering the winner of a game. Game outcomes with a
single score reported were not included and are shown as 9-_. A sample of the data set is
included in Figure 2.
12
Figure 2: The column labeled RatePer represents the rating period in which the contest took place; a RatePer corresponding to 1 stands for the first week of competition.
3. Methodology
3.1 Explanation of Elo Algorithm
Elo (1978) defines a proper rating system as one that can effectively rank teams and
provide a measure of the relative strength of competitors, however strength may be
defined. The initial assumption in the model is that each competitor’s rating follows a
normal distribution. There are different approaches proposed by Elo (1978) for instituting
pairwise comparisons. I chose to use his method of continuous rating updates, as I will be
calculating the rating of each competitor on a weekly basis. The continuous rating
method is given by the formula:
𝑅! = 𝑅! + 𝐾(𝑊 −𝑊!), (3)
where 𝑅! is the new rating and mean of the player’s distribution, 𝑅! is the old rating, 𝐾 is
the weighting function and 𝑊 is a binary variable taking the value of 0 for a loss or 1 for
a win. In addition, 𝑊! for a particular team is the expected value of winning a game
against some rated opponent, with rating 𝑅!"", given by:
From (3), based on the game outcome, the new rating is updated by adding or subtracting
points to the old ranking. Because 𝑊 is binary, the winner of the game will always gain
points and the loser of the game will always lose points. This is important because,
regardless of the score of a game, the winning team should not lose rating points. Another
central aspect of the Elo model is that part of the rating update is dependent on the
expected outcome of a game. If a team is rated far higher than its opponent, it is reflected
in the expected value calculation. Therefore, the winner of the contest will gain points,
but a high expected value for winning will result in a low point gain. This accurately
portrays the construct of an ultimate frisbee tournament, as highly rated teams will play
lower rated competition in pool play2. This can also deter teams from playing only low
rated competition to build a ranking that may not accurately represent their strength.
During the season, 𝑅! will move toward the mean of its normal distribution. The variance
for this normal distribution is needed to model the consistency of each team and under
Elo’s methodology reflects the weight of a performance for a time period. In this case, K
is defined as a weight to update the new rating with the previous rating Under the Elo
model, this weight is between 10 and 32, with recent results reflected by greater values.
As a modification to the standard algorithm, I wanted to incorporate the score differential
into the model because it contains a significant amount of information. In an attempt to
include the game score in the algorithm, I looked at the model from a Bayesian
perspective with a normal likelihood and a normal prior. Each team’s rating is assumed to
2 Pool play is a round robin style tournament of a small group of teams, usually four or five; often taking place on the first day of competition in a 2-day tournament.
14
be the result of a normal distribution, so the justification for a normal likelihood is
obvious. The prior distribution quantifies our a priori understanding of the unobservable
quantities of interest (Wikle 2007) and should also be considered normal. The posterior
mean can then be written as:
𝐸 𝑋 𝒚 = 𝜇 + !!!
!!!!!!𝑦 − 𝜇 = 𝜇 + 𝐾(𝑦 − 𝜇). (5)
The prior mean (𝜇) is adjusted toward the sample mean estimate (𝑦) where 𝜏2 is the
variance for the prior, 𝜎! is the variance for the sample estimate and 𝑛 is the number in
the sample. The K function in (5) is a ratio of the variances from the likelihood function
and the prior information. This is very similar to our Elo algorithm and serves as the
motivation for a different update function for K.
The ratio of variances in the context of rating teams is hard to define. The variance is
understood as the under and over -performance of a team, but there is no mathematical
measure to show that a team has over or under -performed aside from differences
between ratings and opinion. Essentially, the ratio of variances weight the reliability of
the new information using a priori knowledge. The score of the game is included in the a
priori knowledge, leading to the decision to incorporate score differential in the
weighting update function.
3.2 Changing the Update Parameter to a Function
When considering the updated weighting function, I first tried to avoid some of the errors
I believe to be present in the USAU algorithm. The primary issue with their updating
scheme is that the curve of the points gained is concave; the value of scoring a point on
15
the eventual winner decreases as the game becomes closer. In order to correct this, I
where, 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛 = Points awarded for winning on universe point3,
𝑇𝑜𝑡𝑎𝑙𝑃𝑡 =Total points possible awarded,
𝑝 =percentage of points awarded,
𝑑𝑖𝑓𝑓 =of the score from the game,
𝑊𝑖𝑛𝑆𝑐𝑜𝑟𝑒 =points the winner scored.
The proposed allocation scheme corrects the concavity of the K weighting function as
seen in Figure 3. Figure 3 also illustrates that there is no cutoff value for beating a team,
instead taking into account all information indicated by the final score of the contest.
Table 2 shows that the marginal point allocation increases as the game becomes closer
and there are not any unexpected outstanding values exhibited. Forfeits are also included
in the model and treated as the maximum amount of points gained for the winner. I
allotted 200 points possible with 50 points awarded for a universe point win and a p of
0.80. My values were chosen on a subjective assessment of point allocation in order to
clearly discern differences in the score. My chosen values have scaled down the point
allocation in comparison to the USAU algorithm to avoid over-inflating the K weight
values in order to try and retain the normality assumption. In order to check the accuracy
of the change when converting the weighting parameter to a function, results were
calculated by setting K equal to 24.
3 Universe point occurs when two teams are tied and the next point will win the game for either team.
16
Figure 3: The proposed allocation’s rating point allocation scheme assuming the winning score is 15
Table 2: A table showing the marginal difference in the proposed rating points gained. The left column represents the two scores that are being compared.
3.3 Explanation of Assumptions for Proposed Algorithm
A continuous rating system updates a new rating after a single game; subsequently, each
new rating is blended into the old rating. Elo (1978) proposes different methods of
blending, but I discuss an alternative approach below. Elo (1978) suggests that long
events should be divided into ratable segments for each application of (3). My
methodology considers the entire weekend tournament to be a segment, treating games
played on Saturday as equivalent to those taking place on Sunday. This assumption has
some weaknesses, but I believe that the merit outweighs the drawbacks. A typical 2-day
tournament seeds each team attending according to their perceived strength before either
pool play or immediate bracket play will commence. The results of pool play are then
used to reseed the teams and put them into bracket play for varying places on Sunday.
It is intuitive that pool play and bracket play are dependent on one another, but I believe
that teams performing to expectations will be playing similarly rated competition for
placement games. If a team over or under -performs, it is reflected in the results of
individual games as opposed to the final placement from the weekend. Therefore, each
game will be treated as independent from the next game.
Another consideration when examining the structure of ratable segments is the order of
the games played and how sequence can affect a team’s rating. For instance, if a team is
overrated and loses to a low rated opponent in the first game, the low rated opponent
reaps the reward of playing the over-ranked opponent first. In an effort to combat rating
points gained as a result of the order of the games, I randomly sample matches at each
tournament without replacement. I believe this will nullify any order-imposed gains a
team may receive. I will sample without replacement 100 times, which I believe is
sufficient. I will then blend the 100 ratings by taking the mean rating for each team which
will yield the new rating (Rn). The integrity of the normality assumption of the rating is
still preserved and any sort of rating points gained from the order of games is foregone.
18
The Elo algorithm requires a provisional rating period of 30 games before a participant is
given a rating. Many teams cannot satisfy this condition, as USAU only requires a team
to participate in ten games to be allowed in the ranking system. Therefore, in lieu of using
a provisional rating period, a reiterative process similar to USAU’s algorithm is used.
Reiteration helps take into account the opponent’s strength at the time of the match. The
process helps to avoid over and under -ranked teams by rerating the previous encounters
with the inclusion of new information provided by the new rating (𝑅!). Once a ratings
period is completed (say R1), that week will be used as the initial values for the first
week. The new initial values will then be computed up through the current ratings period
(say R1new). The rank order of each team, R1 and R1new, will then be compared by the
Spearman rank correlation. If the correlation measure is above .99, the process will
continue on to the next week and then repeat the reiteration process. The Spearman rank
correlation is believed to help reduce the number of reiterations required. If the rank
order of teams becomes stable, then there is no reason to reiterate and unnecessarily
spread the ratings of teams or waste computing time. The Spearman rank correlation is
the most intuitive correlation measure because we are concerned with the ranked order of
the teams and therefore it provides a measure of correlation between iterations. In order
to check the robustness of the reiteration process, Kendell’s tau could be considered.
4. Results
In this section, a sample of only 30 teams is included due to the high dimensionality of
the data set and the belief that these teams have the best chance to attend nationals.
19
Table 3: The column heading involving “Parameter K” references 𝑲 as constant under the original logic of Elo (1978). Alternatively, the “Function K” heading references the weighting point differential function created by Murray (2012). The dataset used for each algorithm is denoted in parentheses. Lastly, the PR column is the final rating calculation produced by each algorithm and dataset.
5. Conclusion
Due to the inherent subjectivity of rating schemes, there is no single best or most efficient
strategy to creating a rating system. The first way to recognize whether a rating scheme is
viable is to evaluate whether or not it agrees with the common belief held by the group
that is being rated. The second evaluation would use the rating scheme as a predictive
tool to see if the model produces results similar to those that occur in the real world. The
second evaluative technique was not the focus of this paper and is a measure that can be
explored at a later time. It should be noted that direct comparison of the PR measure
across datasets and rating schemes is not accurate because of the difference in
dimensionality and the rating update method. Also, when comparing the USAU
Algorithm to “Function K” and “Parameter K” values, only the USAU dataset can be
USAU$Algorithm Parameter$K$(RRI$Dataset) Parameter$K$(USAU$dataset) Function$K$(RRI$Dataset) Function$K$(USAU$Dataset)Rank Team PR Team PR Team PR Team PR Team PR
used. To understand how closely related all of the results were; the results can be seen in
Table 3.
Results Compared Correlation USAU Algorithm vs. Function K 0.974 USAU Algorithm vs. Parameter K 0.868 Parameter K vs. Function K (USAU dataset) 0.986 Parameter K vs. Function K (RRI dataset) 0.991
Table 4: The Spearman rank correlation is used to compare the datasets and the different weight update methodologies
The RRI dataset was included to assess whether a difference existed between the final
rankings in the top 30 teams of the competitors in USAU sanctioned tournaments and
USAU sanctioned teams. In the “Parameter K” top 30 results, minor changes in the order
of teams can be observed. In the case of the “Function K” results, only two teams
(Oregon and Wisconsin) differ in rank between the two datasets. This would suggest that
the excluded teams are likely lower level teams that have little impact on the teams vying
for a spot at nationals. Next, the “Parameter K” model was included to provide evidence
that a change in weighting scheme is appropriate by giving baseline results for ranking
teams. From Table 4, a correlation of 0.986 and 0.991 between the “Parameter K” and
“Function K” results is high enough to suggest that changing the weight function to
reflect score is appropriate. When comparing the overall outcomes from the USAU
algorithm with my proposed algorithm of “Function K”, the resulting Spearman rank
correlation stands at 0.974; this high level of correlation suggests that the results are very
similar. In conclusion, based on a subjective assessment of my final results and the high
correlation measure observed, I believe I have developed a sound rating method.
21
6. Discussion
In this section, I will be discussing specific details of the USAU algorithm and my
“Function K” algorithm for the USAU dataset, along with ideas for future work. I believe
I have established a quicker method to rate teams when compared to the USAU
algorithm. The USAU algorithm uses 400 iterations while my algorithm used only 122
iterations. The high correlation threshold allows my program to continue to the next week
if there is insignificant change to the order of the teams. This is different from the USAU
approach because their rating values have the ability to converge to a number while mine
do not. Convergence is attained because winning teams may still lose rating points,
whereas I adjusted the assumption so that winners consistently gain rating points. I
cannot comment on the statistical approach for the USAU algorithm because there is no
accessible formal paper detailing Shalom Simon’s approach.
Burruss (2012) explains that for a team to attain a high rating under the USAU algorithm,
all they should do is win. Simply put, he explains that strength of schedule has little
bearing on a team’s ability to rank in the top twenty, although the team cannot solely play
weak competition as explained in Section 1.3. This idea is conveyed by Table 3 and will
be specifically discussed in the cases of Whitman, Iowa, and Texas A&M. Whitman is a
textbook example of what was described earlier as “gaming the rankings.” They are
within the top 20 in the USAU algorithm, yet outside the top 30 in my algorithm at the
end of the year; I believe this is largely due to the inclusion of forfeits in my model.
Whitman was able to play on par with several of the elite college teams, but at two
tournaments where their rating was in jeopardy, they decided to forfeit. As previously
22
discussed, I included forfeits in my calculations because neglecting them discourages
game play. Iowa and Texas A&M are similar when discussing the effects of winning on a
team’s rating for the USAU algorithm. They earned high ratings by simply winning
games, many by large margins, and had a regular season records of 23-5 and 29-2
respectively. The final ratings of Iowa and Texas A&M fell within my top twenty, and
reflect that my algorithm also values winning as a determinant of higher ratings.
The last part of the discussion will focus on future analyses of this methodology and
other approaches of interest. Ideally, I would establish a less arbitrary means of choosing
the parameters for my proposed function. While it would be possible to estimate these
values, there is still a measure of subjectivity because the true rank of each team is
unknown. Using values between 10 and 30, I could model the K function based on the
volatility of a team and the score of a game. While this would preserve Elo’s normality
assumption, I believe that the results would be similar to my findings, albeit rescaled. I
would like to be able to check the accuracy of my method by predicting game outcomes
at nationals. The only way I can currently accomplish this is through the “Parameter K”
method, as I do not have priors to predict the score of each team. I believe it would be
possible to create such a model, and my current research reflects a good starting point for
this. Next, I would like to do some research in the field of network analysis. I believe that
network analysis would be very promising when discussing the bid allocation process for
regional qualifiers and the national tournament. However, I think it would be difficult to
utilize in rating teams due to the structure of the regular season. Network analysis relies
on using clusters to rank its components, but teams rarely play within their entire
23
conference or region prior to qualifying tournaments. This concern is purely speculative,
and necessitates further research into different methods of pairwise comparisons.
24
References
Burruss, Lou. (2012). Rankings Under the Hood, Skyd Magazine. Retrieved from