Master's Thesis Analyzing USA Ultimate Algorithm

An Examination of Pairwise Comparisons of College Open Ultimate Frisbee Teams

Michael Silger

11/28/2012

Under the direction of Dr. Christopher Wikle Department of Statistics University of Missouri

146 Middlebush Hall, Columbia, MO 65211, United States

2

Contents 2 Abstract 3 Acknowledgements 4 1 Introduction 5

1.1 The importance of an accurate ranking system 5 1.2 The current ranking method 6 1.3 Scrutiny of the current ranking method 7 1.4 An alternative approach to ranking frisbee teams 10

2 Data 11

3 Methodology 12

3.1 Explanation of Elo Algorithm 12 3.2 Changing the Update Parameter to a Function 14 3.3 Explanation of Assumptions for Proposed Algorithm 16

4 Results 18 5 Analysis 19 6 Discussion 21 References 24

3

Abstract The merit of pairwise comparisons in sports is a subject of much debate today,

particularly in the ultimate frisbee community. The goal of this project is to study the

current methods employed by USA Ultimate (USAU), the governing body of college

ultimate frisbee, to see if a more robust and efficient approach to pairwise comparisons

can be discerned. In the search for a more effective ranking scheme, I applied an adapted

approach to Arpad Elo’s chess ranking algorithm, which yielded results similar to

USAU’s official rankings with a Spearman rank correlation of 0.974. My method

provides a more statistically sound approach for ranking teams based on conventional

principles established in the sport of ultimate frisbee.

4

Acknowledgements I would first and foremost like to thank Dr. Christopher Wikle for assisting me on my

endeavor to explore ranking algorithms. He has been pivotal to the success of this project

as well as several statistical interests in my undergraduate and graduate career at the

University of Missouri. I would like to thank Dr. Larry Ries and Dr. Lori Thombs for

giving their time to serve on my committee. I would also like to thank Adam Gold for

assisting me in data extraction, as well as Chelsea Tossing for relentlessly editing my

paper. Lastly, I would like to thank the faculty members who have played a part in my

education and all of my friends who have helped and supported me through this project.

5

1. Introduction

With 450 teams participating in open division conference qualifiers last year, the

increasing popularity of ultimate frisbee has also increased the need for a reliable ranking

system. Efforts to rank ultimate frisbee, however, face problems compounded by both the

high dimensionality of the number of teams and the nonstandard structure of the season.

A regular season is not comprised of playing a game or two per week against conference

opponents, but rather an entire tournament in a weekend against competition from across

the country. Once the regular season ends, the national qualifying system begins. The

steps to qualifying for nationals involve first competing at a conference tournament

between local opponents and then advancing to a regional tournament that determines

national attendance.

1.1 The importance of an accurate ranking system

The regular season encompasses all USAU sanctioned tournaments up until a deadline,

usually at the beginning of April. The goal of USAU is to be able to use the sanctioned

tournament results to accurately rank teams from across the nation. In order to be

considered in the ranking system, a team must compete in 10 games prior to sectionals

and fill out the necessary paperwork to ensure their eligibility. Prior to the beginning of

qualifying rounds, rankings are used to allocate bids to regionals and nationals. Teams

are awarded these bids based on conference and regional tournament placement.

Specifically, bids to nationals are allocated to match the number of teams from each

region ranking in the top 20 at the end of the regular season. The accuracy of the rankings

6

is paramount to teams ranked from 10-30, as those teams have the greatest potential to

benefit from an extra bid allocated to their region.

1.2 The current ranking method

The current algorithm designed to rank frisbee teams was created by Sholom Simon

(“USA Ultimate College Rankings Algorithm,” 2012). Simply put, each team receives

rating points for every game played and those points are averaged. The algorithm takes

the rating for each of the two competing teams and swaps their rating prior to the contest

to reflect the strength of schedule amongst the two competitors. We define Rn to be the

team’s new rating and Ropp to be the opponent’s current rating, such that

𝑅! = 𝑅!"" +!""!

. (1)

Once a game is finished, a team can gain or lose rating points based on the margin of

victory or loss. From equation 1, x is a factor that takes into account the score of a given

game as defined in equation 2:

𝑥 = max (0.66, 2.5 ∗ !"#$%&'(")*!"##"#$%&'()

!). (2)

There is a cap in place so that once the loser’s score has been doubled, the maximum

points gained/lost are attained.

There is a weighted decay function in place to give recent games precedence when

calculating the new rating, whereas games in the first week of competition receive the

lowest weight. The weighting is doubled for games at regionals and tripled for games at

nationals under the assumption that teams are at full strength for these contests. The

algorithm is run every Tuesday, and the new ratings from the weekend are reinitialized as

7

the week one starting values and run through to the current week. The reiteration is done

20 times with convergence attained around the tenth reiteration (“USA Ultimate College

Rankings Algorithm,” 2012). The reiterative process is expected to account for

underrated or overrated opponents.

While the bids to nationals are determined in early April, tournament play continues and

weekly ratings will still reflect current performances. Rating calculation ends when the

national tournament has concluded, resulting in a final ranking for the season, which is

important when considering tournaments to attend in the upcoming season.

1.3 Scrutiny of the current ranking method

There are a few concerns over USAU’s current methodology for ranking frisbee teams.

The first stems from the algorithm’s approach to reflecting strength of schedule. While

effective, the winning team may lose rating points under two circumstances as a result of

a difference in ratings between the two competitors. An exceedingly high difference

between the competitors’ ratings can result in the winning team losing rating points

regardless of score. Alternately, the winner may lose rating points if the opponent scores

more points than expected.

The second concern is the mercy threshold that initiates once the losing team’s score has

been doubled. If a game is played to 15, winning 15-0 is considered equal to winning 15-

7. Considering the loss of information that occurs for teams with shorter seasons, this is

an oversight; the fewer games a team plays, the more vital it becomes to include every

8

piece of information available in their rating. The mercy threshold may also dilute a

team’s true rating by undervaluing large margins of victory.

A third concern is that the algorithm disregards forfeits, and ratings are computed as if

the game was never played. This allows teams to abuse the system by protecting

themselves if they believe their rating will suffer from a loss. If a team on the cusp of the

top twenty is having an off weekend, they are able to forfeit without any repercussions.

Conversely, teams playing well that are forfeited against lose the chance to gain rating

points. This basic flaw in the rating system can undermine the competitive nature of the

sport by discouraging game play.1

The final concern is the method by which rating points gain is calculated for the margin

of victory. The calculation forms a concave curve that disagrees with the fundamental

nature of ultimate frisbee. A close game should indicate that the teams are similar in

strength, and each additional point scored by the loser becomes more valuable. However,

under the USAU model, the marginal value of the opponent’s points scored on the winner

increases as the number of points scored decreases. Figure 1 demonstrates the concave

curve for the USAU point allocation system. There are a finite number of scores that exist

in a frisbee game and Table 1 captures the marginal value for each score possible. As

seen in Table 1, the less competitive the game, the higher the marginal value of each

point becomes until the mercy threshold takes effect. A more intuitive model would have

a convex function in which the marginal value for each point scored increases as the

1 The third concern discussed here is a specific problem referred to as “gaming the rankings” in the ultimate frisbee community.

9

losing team scores more points. The USAU rating points calculator also has marginal

values that carry a large weight when compared to the surrounding values. There is no

logical reason for the large jump in values found around the mercy threshold seen in

Table 1.

Figure 1: A graph giving the USAU point allocation scheme assuming the winning score is 15.

Table 1: A table showing the marginal difference in the USAU rating points gained. The left column represents the two scores that are being compared.

0.00#

100.00#

200.00#

300.00#

400.00#

500.00#

600.00#

700.00#

0# 2# 4# 6# 8# 10# 12# 14# 16#

Loser&Score/Winner&Score 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1716#15 2515#14 2714#13 2913#12 32 3712#11 36 41 4811#10 40 47 54 6210#9 45 54 63 74 849#8 52 63 76 89 103 1188#7 62 77 93 110 129 116 447#6 75 96 118 143 136 54 0 06#5 96 125 158 162 68 0 0 0 05#4 130 176 196 88 0 0 0 0 0 04#3 194 246 116 0 0 0 0 0 0 0 03#2 322 162 0 0 0 0 0 0 0 0 0 02#1 246 0 0 0 0 0 0 0 0 0 0 0 01#0 0 0 0 0 0 0 0 0 0 0 0 0 0

10

1.4 An alternative approach to ranking frisbee teams

A different approach to ranking frisbee teams should be considered using a model that

minimizes the concerns raised by the USAU algorithm. The model should reflect the

same principles exhibited by participants in ultimate frisbee, such as the variability of a

team’s performance. Variability can be captured using statistical models; one of the first

models developed for pairwise comparisons was created by Arpad Elo for the game of

chess. Elo (1978) is able to measure pairwise comparisons of high dimensionality across

a given time period and calculate a rating of relative strength for each competitor. The

Elo model and its extensions can be seen in more complex models such as Glickman’s

(1993) Glicko rating system. The biggest difference between the two models is that the

Glicko system includes an estimate of reliability in the rating of a contender. I chose to

use the Elo method for its simplicity in implementation and programming.

For the game of chess, Elo (1978) starts off each competitor with a provisional rating

period of about 30 games. Once the provisional period ends, a player is given a rating

reflective of their performance in that extent of time. That rating follows a distribution

where the mean is equal to the rating and the variance is a measure of reliability; Elo

(1965) and McClintock (1977) show that many performances of an individual will be

normally distributed on an appropriate scale. The player rating is updated after a given

time period on a continuous or periodic basis, and will eventually converge to its true

strength rating. An unexpected result could be the consequence of a statistical fluctuation

or an actual change in the player’s ability; thus, new performances are weighted based on

how much importance is given to past performances. The weight is a measure of

11

reliability for a team’s distribution. I use the Elo algorithm as the basis for my rating

algorithm with some minor modifications to the assumptions in the model. The specifics

of the Elo algorithm and my adjustments to its assumptions will be discussed in detail in

the methodology section.

2. Data

The data used in this analysis were collected from www.usaultimate.org in the month of

June 2012 and represent the 2012 USAU open sanctioned tournaments. The data were not

readily accessible and had to be harvested using AutoHotKey, a program that allows the

user to automate keystrokes. Using AutoHotKey, I was able to copy all of the information

on the page for each competitor and paste it into a text file. I then converted the text file

to an excel document and parsed the data to obtain the information necessary to create

two data sets. The first dataset was in correspondence with all of the teams listed under

the RRI webpage (450 teams) while the second dataset included only USAU sanctioned

teams (371 teams). It was necessary to evaluate two separate datasets to obtain a strict

comparison between the proposed algorithm and the USAU algorithm. A comparison is

also done between the USAU sanctioned teams and teams that competed in USAU

sanctioned tournaments to determine if the exclusion of additional teams affects the final

rankings. Similar results are expected when comparing the two datasets. Irrelevant

information was omitted; for instance, several outcomes were listed F-F or F-L, both of

which have no merit when considering the winner of a game. Game outcomes with a

single score reported were not included and are shown as 9-_. A sample of the data set is

included in Figure 2.

12

Figure 2: The column labeled RatePer represents the rating period in which the contest took place; a RatePer corresponding to 1 stands for the first week of competition.

3. Methodology

3.1 Explanation of Elo Algorithm

Elo (1978) defines a proper rating system as one that can effectively rank teams and

provide a measure of the relative strength of competitors, however strength may be

defined. The initial assumption in the model is that each competitor’s rating follows a

normal distribution. There are different approaches proposed by Elo (1978) for instituting

pairwise comparisons. I chose to use his method of continuous rating updates, as I will be

calculating the rating of each competitor on a weekly basis. The continuous rating

method is given by the formula:

𝑅! = 𝑅! + 𝐾(𝑊 −𝑊!), (3)

where 𝑅! is the new rating and mean of the player’s distribution, 𝑅! is the old rating, 𝐾 is

the weighting function and 𝑊 is a binary variable taking the value of 0 for a loss or 1 for

a win. In addition, 𝑊! for a particular team is the expected value of winning a game

against some rated opponent, with rating 𝑅!"", given by:

𝑊! =!"!!/!""

!"!!/!""!!"!!""/!"". (4)

WinningTeam LosingTeam WinnerScore LoserScore RatePer TournamentMississippiState MississippiStateB 11 1 1 Cowbell0Classic02012MississippiState FloridaStateB 11 3 1 Cowbell0Classic02012MississippiState Mississippi 11 8 1 Cowbell0Classic02012MississippiState Auburn 11 6 1 Cowbell0Classic02012MississippiState Mississippi 13 10 1 Cowbell0Classic02012Mississippi FloridaStateB 11 3 1 Cowbell0Classic02012Mississippi MississippiStateB 11 1 1 Cowbell0Classic02012Mississippi Rhodes 11 1 1 Cowbell0Classic02012Mississippi Auburn 10 7 1 Cowbell0Classic02012

13

From (3), based on the game outcome, the new rating is updated by adding or subtracting

points to the old ranking. Because 𝑊 is binary, the winner of the game will always gain

points and the loser of the game will always lose points. This is important because,

regardless of the score of a game, the winning team should not lose rating points. Another

central aspect of the Elo model is that part of the rating update is dependent on the

expected outcome of a game. If a team is rated far higher than its opponent, it is reflected

in the expected value calculation. Therefore, the winner of the contest will gain points,

but a high expected value for winning will result in a low point gain. This accurately

portrays the construct of an ultimate frisbee tournament, as highly rated teams will play

lower rated competition in pool play2. This can also deter teams from playing only low

rated competition to build a ranking that may not accurately represent their strength.

During the season, 𝑅! will move toward the mean of its normal distribution. The variance

for this normal distribution is needed to model the consistency of each team and under

Elo’s methodology reflects the weight of a performance for a time period. In this case, K

is defined as a weight to update the new rating with the previous rating Under the Elo

model, this weight is between 10 and 32, with recent results reflected by greater values.

As a modification to the standard algorithm, I wanted to incorporate the score differential

into the model because it contains a significant amount of information. In an attempt to

include the game score in the algorithm, I looked at the model from a Bayesian

perspective with a normal likelihood and a normal prior. Each team’s rating is assumed to

2 Pool play is a round robin style tournament of a small group of teams, usually four or five; often taking place on the first day of competition in a 2-day tournament.

14

be the result of a normal distribution, so the justification for a normal likelihood is

obvious. The prior distribution quantifies our a priori understanding of the unobservable

quantities of interest (Wikle 2007) and should also be considered normal. The posterior

mean can then be written as:

𝐸 𝑋 𝒚 = 𝜇 + !!!

!!!!!!𝑦 − 𝜇 = 𝜇 + 𝐾(𝑦 − 𝜇). (5)

The prior mean (𝜇) is adjusted toward the sample mean estimate (𝑦) where 𝜏2 is the

variance for the prior, 𝜎! is the variance for the sample estimate and 𝑛 is the number in

the sample. The K function in (5) is a ratio of the variances from the likelihood function

and the prior information. This is very similar to our Elo algorithm and serves as the

motivation for a different update function for K.

The ratio of variances in the context of rating teams is hard to define. The variance is

understood as the under and over -performance of a team, but there is no mathematical

measure to show that a team has over or under -performed aside from differences

between ratings and opinion. Essentially, the ratio of variances weight the reliability of

the new information using a priori knowledge. The score of the game is included in the a

priori knowledge, leading to the decision to incorporate score differential in the

weighting update function.

3.2 Changing the Update Parameter to a Function

When considering the updated weighting function, I first tried to avoid some of the errors

I believe to be present in the USAU algorithm. The primary issue with their updating

scheme is that the curve of the points gained is concave; the value of scoring a point on

15

the eventual winner decreases as the game becomes closer. In order to correct this, I

looked at a function created by Murray (2012):

𝐾 = 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛 + (𝑇𝑜𝑡𝑎𝑙𝑃𝑡 − 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛) (!!!!"##)(!!!!"#$%&'()

, (6)

where, 𝑈𝑛𝑣𝑃𝑡𝑊𝑖𝑛 = Points awarded for winning on universe point3,

𝑇𝑜𝑡𝑎𝑙𝑃𝑡 =Total points possible awarded,

𝑝 =percentage of points awarded,

𝑑𝑖𝑓𝑓 =of the score from the game,

𝑊𝑖𝑛𝑆𝑐𝑜𝑟𝑒 =points the winner scored.

The proposed allocation scheme corrects the concavity of the K weighting function as

seen in Figure 3. Figure 3 also illustrates that there is no cutoff value for beating a team,

instead taking into account all information indicated by the final score of the contest.

Table 2 shows that the marginal point allocation increases as the game becomes closer

and there are not any unexpected outstanding values exhibited. Forfeits are also included

in the model and treated as the maximum amount of points gained for the winner. I

allotted 200 points possible with 50 points awarded for a universe point win and a p of

0.80. My values were chosen on a subjective assessment of point allocation in order to

clearly discern differences in the score. My chosen values have scaled down the point

allocation in comparison to the USAU algorithm to avoid over-inflating the K weight

values in order to try and retain the normality assumption. In order to check the accuracy

of the change when converting the weighting parameter to a function, results were

calculated by setting K equal to 24.

3 Universe point occurs when two teams are tied and the next point will win the game for either team.

16

Figure 3: The proposed allocation’s rating point allocation scheme assuming the winning score is 15

Table 2: A table showing the marginal difference in the proposed rating points gained. The left column represents the two scores that are being compared.

3.3 Explanation of Assumptions for Proposed Algorithm

A continuous rating system updates a new rating after a single game; subsequently, each

new rating is blended into the old rating. Elo (1978) proposes different methods of

blending, but I discuss an alternative approach below. Elo (1978) suggests that long

0.00

50.00

100.00

150.00

200.00

250.00

0 2 4 6 8 10 12 14 16

Loser&Score/Winner&Score 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1716#15 3115#14 3114#13 3113#12 32 2512#11 32 25 2011#10 33 26 20 1610#9 34 26 21 16 139#8 35 27 21 16 13 108#7 36 28 22 17 13 10 87#6 38 29 22 17 13 11 8 76#5 41 30 23 18 14 11 8 7 55#4 45 33 24 18 14 11 9 7 5 44#3 51 36 26 19 15 11 9 7 5 4 33#2 61 41 29 21 16 12 9 7 6 4 3 32#1 83 49 33 23 17 12 9 7 6 4 3 3 21#0 67 39 26 18 13 10 8 6 5 4 3 2 2

17

events should be divided into ratable segments for each application of (3). My

methodology considers the entire weekend tournament to be a segment, treating games

played on Saturday as equivalent to those taking place on Sunday. This assumption has

some weaknesses, but I believe that the merit outweighs the drawbacks. A typical 2-day

tournament seeds each team attending according to their perceived strength before either

pool play or immediate bracket play will commence. The results of pool play are then

used to reseed the teams and put them into bracket play for varying places on Sunday.

It is intuitive that pool play and bracket play are dependent on one another, but I believe

that teams performing to expectations will be playing similarly rated competition for

placement games. If a team over or under -performs, it is reflected in the results of

individual games as opposed to the final placement from the weekend. Therefore, each

game will be treated as independent from the next game.

Another consideration when examining the structure of ratable segments is the order of

the games played and how sequence can affect a team’s rating. For instance, if a team is

overrated and loses to a low rated opponent in the first game, the low rated opponent

reaps the reward of playing the over-ranked opponent first. In an effort to combat rating

points gained as a result of the order of the games, I randomly sample matches at each

tournament without replacement. I believe this will nullify any order-imposed gains a

team may receive. I will sample without replacement 100 times, which I believe is

sufficient. I will then blend the 100 ratings by taking the mean rating for each team which

will yield the new rating (Rn). The integrity of the normality assumption of the rating is

still preserved and any sort of rating points gained from the order of games is foregone.

18

The Elo algorithm requires a provisional rating period of 30 games before a participant is

given a rating. Many teams cannot satisfy this condition, as USAU only requires a team

to participate in ten games to be allowed in the ranking system. Therefore, in lieu of using

a provisional rating period, a reiterative process similar to USAU’s algorithm is used.

Reiteration helps take into account the opponent’s strength at the time of the match. The

process helps to avoid over and under -ranked teams by rerating the previous encounters

with the inclusion of new information provided by the new rating (𝑅!). Once a ratings

period is completed (say R1), that week will be used as the initial values for the first

week. The new initial values will then be computed up through the current ratings period

(say R1new). The rank order of each team, R1 and R1new, will then be compared by the

Spearman rank correlation. If the correlation measure is above .99, the process will

continue on to the next week and then repeat the reiteration process. The Spearman rank

correlation is believed to help reduce the number of reiterations required. If the rank

order of teams becomes stable, then there is no reason to reiterate and unnecessarily

spread the ratings of teams or waste computing time. The Spearman rank correlation is

the most intuitive correlation measure because we are concerned with the ranked order of

the teams and therefore it provides a measure of correlation between iterations. In order

to check the robustness of the reiteration process, Kendell’s tau could be considered.

4. Results

In this section, a sample of only 30 teams is included due to the high dimensionality of

the data set and the belief that these teams have the best chance to attend nationals.

19

Table 3: The column heading involving “Parameter K” references 𝑲 as constant under the original logic of Elo (1978). Alternatively, the “Function K” heading references the weighting point differential function created by Murray (2012). The dataset used for each algorithm is denoted in parentheses. Lastly, the PR column is the final rating calculation produced by each algorithm and dataset.

5. Conclusion

Due to the inherent subjectivity of rating schemes, there is no single best or most efficient

strategy to creating a rating system. The first way to recognize whether a rating scheme is

viable is to evaluate whether or not it agrees with the common belief held by the group

that is being rated. The second evaluation would use the rating scheme as a predictive

tool to see if the model produces results similar to those that occur in the real world. The

second evaluative technique was not the focus of this paper and is a measure that can be

explored at a later time. It should be noted that direct comparison of the PR measure

across datasets and rating schemes is not accurate because of the difference in

dimensionality and the rating update method. Also, when comparing the USAU

Algorithm to “Function K” and “Parameter K” values, only the USAU dataset can be

USAU$Algorithm Parameter$K$(RRI$Dataset) Parameter$K$(USAU$dataset) Function$K$(RRI$Dataset) Function$K$(USAU$Dataset)Rank Team PR Team PR Team PR Team PR Team PR

1 Pittsburgh 1852 Oregon 2528 Oregon 2366 Pittsburgh 3122 Pittsburgh 30012 Oregon 1829 Pittsburgh 2453 Pittsburgh 2290 Wisconsin 2986 Oregon 28643 Wisconsin 1786 Wisconsin 2396 Wisconsin 2227 CarletonCollege 2984 CarletonCollege 28604 CarletonCollege 1769 CarletonCollege 2298 CarletonCollege 2129 Oregon 2984 Wisconsin 28595 Tufts? 1740 Minnesota 2264 Tufts 2105 Tufts 2797 Tufts 26746 Iowa 1692 Tufts 2257 Minnesota 2098 Minnesota 2762 Minnesota 26387 Minnesota 1674 CentralFlorida 2227 CentralFlorida 2077 CentralFlorida 2737 CentralFlorida 26098 Colorado 1655 Texas 2116 Texas 1964 Luther 2651 Luther 25279 CentralFlorida 1648 California 2091 NorthCarolina 1941 California 2601 California 247710 Michigan 1616 NorthCarolina 2091 California 1933 TexasAM 2594 TexasAM 247311 California 1612 Luther 2087 TexasAM 1929 Colorado 2587 Colorado 246312 NorthCarolina 1600 Stanford 2077 Luther 1928 Stanford 2578 Stanford 245313 Luther 1594 Colorado 2075 Stanford 1923 Texas 2553 Texas 243114 Texas 1588 TexasAM 2075 Ohio 1916 Iowa 2550 Iowa 242315 Washington 1578 Ohio 2056 Colorado 1915 NorthCarolina 2544 NorthCarolina 241716 Stanford 1566 Iowa 2037 Iowa 1884 Michigan 2507 Michigan 237817 Whitman 1550 MichiganState 2008 MichiganState 1860 Ohio 2494 Ohio 236718 Connecticut 1542 Michigan 1992 Connecticut 1848 MichiganState 2430 MichiganState 230119 Illinois 1530 Connecticut 1983 Michigan 1845 OhioState 2414 OhioState 228820 MichiganState 1527 Washington 1982 GeorgiaTech 1835 GeorgiaTech 2402 GeorgiaTech 228021 Ohio 1527 GeorgiaTech 1981 NorthCarolinaWilmington 1826 NorthCarolinaWilmington 2400 NorthCarolinaWilmington 227722 Vermont 1521 NorthCarolinaWilmington 1969 SouthCarolina 1824 Washington 2397 Washington 227323 OhioState 1516 SouthCarolina 1967 Washington 1824 Connecticut 2374 Connecticut 224824 TexasA&M 1512 Florida 1964 OhioState 1822 Florida 2364 Florida 224325 NorthCarolinaNWilmington 1505 OhioState 1959 Florida 1810 Illinois 2350 Illinois 222926 Georgia 1499 Illinois 1939 Illinois 1787 SouthCarolina 2339 SouthCarolina 221727 Kansas 1497 Georgia 1906 Georgia 1758 Georgia 2311 Georgia 218428 Dartmouth 1488 Vermont 1894 Vermont 1756 Vermont 2264 Vermont 2143

20

used. To understand how closely related all of the results were; the results can be seen in

Table 3.

Results Compared Correlation USAU Algorithm vs. Function K 0.974 USAU Algorithm vs. Parameter K 0.868 Parameter K vs. Function K (USAU dataset) 0.986 Parameter K vs. Function K (RRI dataset) 0.991

Table 4: The Spearman rank correlation is used to compare the datasets and the different weight update methodologies

The RRI dataset was included to assess whether a difference existed between the final

rankings in the top 30 teams of the competitors in USAU sanctioned tournaments and

USAU sanctioned teams. In the “Parameter K” top 30 results, minor changes in the order

of teams can be observed. In the case of the “Function K” results, only two teams

(Oregon and Wisconsin) differ in rank between the two datasets. This would suggest that

the excluded teams are likely lower level teams that have little impact on the teams vying

for a spot at nationals. Next, the “Parameter K” model was included to provide evidence

that a change in weighting scheme is appropriate by giving baseline results for ranking

teams. From Table 4, a correlation of 0.986 and 0.991 between the “Parameter K” and

“Function K” results is high enough to suggest that changing the weight function to

reflect score is appropriate. When comparing the overall outcomes from the USAU

algorithm with my proposed algorithm of “Function K”, the resulting Spearman rank

correlation stands at 0.974; this high level of correlation suggests that the results are very

similar. In conclusion, based on a subjective assessment of my final results and the high

correlation measure observed, I believe I have developed a sound rating method.

21

6. Discussion

In this section, I will be discussing specific details of the USAU algorithm and my

“Function K” algorithm for the USAU dataset, along with ideas for future work. I believe

I have established a quicker method to rate teams when compared to the USAU

algorithm. The USAU algorithm uses 400 iterations while my algorithm used only 122

iterations. The high correlation threshold allows my program to continue to the next week

if there is insignificant change to the order of the teams. This is different from the USAU

approach because their rating values have the ability to converge to a number while mine

do not. Convergence is attained because winning teams may still lose rating points,

whereas I adjusted the assumption so that winners consistently gain rating points. I

cannot comment on the statistical approach for the USAU algorithm because there is no

accessible formal paper detailing Shalom Simon’s approach.

Burruss (2012) explains that for a team to attain a high rating under the USAU algorithm,

all they should do is win. Simply put, he explains that strength of schedule has little

bearing on a team’s ability to rank in the top twenty, although the team cannot solely play

weak competition as explained in Section 1.3. This idea is conveyed by Table 3 and will

be specifically discussed in the cases of Whitman, Iowa, and Texas A&M. Whitman is a

textbook example of what was described earlier as “gaming the rankings.” They are

within the top 20 in the USAU algorithm, yet outside the top 30 in my algorithm at the

end of the year; I believe this is largely due to the inclusion of forfeits in my model.

Whitman was able to play on par with several of the elite college teams, but at two

tournaments where their rating was in jeopardy, they decided to forfeit. As previously

22

discussed, I included forfeits in my calculations because neglecting them discourages

game play. Iowa and Texas A&M are similar when discussing the effects of winning on a

team’s rating for the USAU algorithm. They earned high ratings by simply winning

games, many by large margins, and had a regular season records of 23-5 and 29-2

respectively. The final ratings of Iowa and Texas A&M fell within my top twenty, and

reflect that my algorithm also values winning as a determinant of higher ratings.

The last part of the discussion will focus on future analyses of this methodology and

other approaches of interest. Ideally, I would establish a less arbitrary means of choosing

the parameters for my proposed function. While it would be possible to estimate these

values, there is still a measure of subjectivity because the true rank of each team is

unknown. Using values between 10 and 30, I could model the K function based on the

volatility of a team and the score of a game. While this would preserve Elo’s normality

assumption, I believe that the results would be similar to my findings, albeit rescaled. I

would like to be able to check the accuracy of my method by predicting game outcomes

at nationals. The only way I can currently accomplish this is through the “Parameter K”

method, as I do not have priors to predict the score of each team. I believe it would be

possible to create such a model, and my current research reflects a good starting point for

this. Next, I would like to do some research in the field of network analysis. I believe that

network analysis would be very promising when discussing the bid allocation process for

regional qualifiers and the national tournament. However, I think it would be difficult to

utilize in rating teams due to the structure of the regular season. Network analysis relies

on using clusters to rank its components, but teams rarely play within their entire

23

conference or region prior to qualifying tournaments. This concern is purely speculative,

and necessitates further research into different methods of pairwise comparisons.

24

References

Burruss, Lou. (2012). Rankings Under the Hood, Skyd Magazine. Retrieved from

http://skydmagazine.com/2012/03/rankings-under-the-hood/

Elo, A.E. (1965). Age Changes in Master Chess Performances, Journal of Gerontology,

20, 289-299.

Elo, A.E. (1978). “The Rating of Chess Players Past and Present.”

McClintock, W. (1977). Statistical Studies of the Elo Rating System, 1974-77. Report to

USCF Policy Board, privately produced.

Murray, T. (2012, September 17). An Interpretation and Critique of the USAU Ranking

Algorithm. Retrieved from http://skydmagazine.com/2012/09/an-interpretation-

and-critique-of-the-usau-ranking-algorithm/

Wikle, C & Berliner, M. (2007). A Bayesian tutorial for data assimilation. PhysicaD,

230, 1-16.

(2012). “USA Ultimate College Rankings.” Retrieved from

http://www.usaultimate.org/competition/college_division/college_season/college_

rankings.aspx

Master's Thesis Analyzing USA Ultimate Algorithm

Documents

left column

ultimate frisbee

spearman rank

provisional

usau sanctioned

usau sanctioned

ranking frisbee

lose rating