Highlights - UniFI

Highlights

Football: Discovering elapsing-time bias

in the science of success

L. Galli, G. Galvan, T. Levato, C. Liti, V. Piccialli, M. Sciandrone

� We conjecture that players’ behavior is more and more correlated with

the match outcome as the 90 minutes elapse.

� We demonstrate the effect of this elapsing-time bias by applying a

host of machine learning techniques on a large corpus of finely detailed

football matches of European leagues.

� We show that we can predict the output of a match with high confidence

simply by looking at the last 15 minutes of the game.

� We design a new task and we show that is not affected by elapsing-time

bias.

Football: Discovering elapsing-time bias

in the science of success

L. Gallia,∗, G. Galvanb, T. Levatob, C. Litic, V. Picciallic, M. Sciandroneb

aChair for Mathematics of Information Processing, RWTH Aachen University,Pontdriesch 10, 52062 Aachen (Germany)

bDipartimento di Ingegneria dell’Informazione, Universita di Firenze, Via di SantaMarta 3, 50139 Firenze (Italy)

cDipartimento di Ingegneria Civile e Ingegneria Informatica, Universita degli Studi diRoma “Tor Vergata”, Via del Politecnico 1, 00133 Roma (Italy)

Abstract

One of the fundamental topics in sports analytics is the science of success,

i.e., the study of the correlation between players’ performances and their suc-

cess. This is a very challenging task especially in the case of team sports,

among which football is a prominent example. This paper is concerned with

uncovering a dangerous bias that is present in most of the approaches pro-

posed in the literature that apply statistical techniques or machine learning

models to study the correlation between team performances and match out-

come. In particular we find out that players’ behavior on a time interval is

more and more correlated with the match outcome as the 90 minutes elapse.

As an extreme example, we show that we can predict the output of a match

with high confidence simply by looking at the last 15 minutes of the game.

∗Corresponding author.Email addresses: [email protected] (L. Galli),

[email protected] (G. Galvan), [email protected] (T. Levato),[email protected] (C. Liti), [email protected] (V. Piccialli),[email protected] (M. Sciandrone)

Preprint submitted to Elsevier March 11, 2021

We call this effect elapsing-time bias. We conduct a quantitative analysis

that proves the existence of this phenomenon and shows its consequences.

We then propose a novel way to address the problem. Namely, we design

a new machine learning task that is not affected by elapsing-time bias. All

the experiments are conducted on a large corpus of finely annotated football

matches of European leagues.

Keywords: football, science of success, match analysis, machine learning,

sports analytics

1. Introduction

The expression science of success, in sports, refers to studies, analysis

and techniques whose aim is the understanding of the relationship between

teams/player performances and their success [1].

Such studies generally focus on answering one of two questions: 1) to what

extent is success predictable from the observable performance [1] and 2) what

are the factors that contribute to winning a match [2]. Answering either one

of those questions generally amounts to describe a team performance during

a match by a set of features and analyze their relationship with the match

outcome.

In soccer, team performances are usually described by in-game statistics

like number of shots, ball possession, number of passes (sometimes augmented

with contextual information like home advantage), while the outcome is gen-

erally represented by win, lose, draw categories and more rarely by the goal

difference.

A large number of works either quantify the relationship between game

2

statistics and match outcome or identify the factors that most contribute to

the outcome with statistical techniques. In [3] PCA and clustering methods

were used to separate matches, represented by game statistics, in two groups

and then checked against winning and losing teams. In [4, 5] t-test, mul-

tivariate discriminative analysis and ANOVA are used to determine which

game statistics allow to discriminate between winning and losing teams. In

[6] ANOVA and marginal effects are used, once again, to identify the success

factors in professional football. Alongside game statistics, also contextual

variables (like players’ age or home advantage) are considered. In the case of

[7], ANOVA is used in a multi-class fashion to discriminate between winning,

drawing and losing teams. In [8] magnitude-based inference is used for the

same aim.

Another branch of research [9, 10, 2] tackles the problem by building

a machine learning model that given the game statistics is able to predict

the match outcome. A review on this approach applied to various sports is

presented in [11]. As also stated in [1], this field is still in its infancy in the

case of soccer. In [9] the machine learning task is addressed by extracting

the features from the game log, i.e. an ordered list of events (e.g., passes,

shots, fouls). In [10] an outcome-oriented classification model is used as an

intermediate step to build a ranking system for teams of the Chines Football

League.

The main purpose of this paper is to argue that such an approach hides

an important bias that can lead to misleading conclusions. Specifically, we

claim that a match partial outcome can heavily influence game statistics.

This problem was already hinted in the literature [12, 13, 14, 15, 2, 6]. In

3

[2] and [6], the authors aimed at identifying match statistics that strongly

contribute to winning by employing logistic regression to predict the outcome.

Differently to other approaches, however, they performed their analysis only

on “close” matches since in “unbalanced games there might be game periods

where teams are not really engaged in performing well because the game

result is already decided” [2]. We call the effect of the players’ reaction to

the match (partial) outcome elapsing-time bias.

Although the problem was recognized, no analysis on its impact is per-

formed and the solution of only using “close” matches for the analysis presents

various drawbacks. First of all, it is not easy to define and identify close

matches. Secondly, a large number of matches would be discarded weaken-

ing the analysis. Furthermore, even close matches can suffer from the bias

depending on when goals are taken.

Our main contribution is to demonstrate the effect of elapsing-time bias

by applying a host of machine learning techniques on a large corpus of finely

detailed football matches of European leagues. We first perform the conven-

tional analysis by training the models on in-game statistics to predict the

match outcome. The obtained results are comparable to the ones presented

in the literature on similar datasets. Then, we extract game-statistics on

time intervals of 15 minutes each, thus describing a 90 minutes match by

6 intervals. We show that such an enriched description of the match can

further improve the results. However, we also show that the same results can

be also obtained by employing only the statistics from the last time interval.

This is clearly a non-desirable skill for machine learning models that aim at

understanding success from team performances. In particular, we claim that

4

this is due to the fact that the partial and final outcome are getting closer

in expectation as the 90 minutes are elapsing. We thus believe that these

models are not truly understanding the game, but players’ reactions to its

approaching end.

Our second contribution is that of designing a task that is not affected

by elapsing-time bias. In particular, we propose to re-frame the learning

problem without using the (final) outcome as the mapping output. Instead,

we analyze the match one interval at the time, using as labels the real-time

output, i.e., the number of goals only scored in that interval (supposing no

goals are taken before that interval). We thus show that thanks to this new

problem setting we are able to rule out elapsing-time bias. Moreover, we show

that the new setting promotes a fairly uniform use of the match statistics.

The paper is structured as follows. In Section 2 we describe the dataset.

In Section 3 we report the obtained results on all the different tasks and

settings. In particular, in Section 3.1 we address the original task, in Section

3.2 we introduce the time intervals, in Section 3.3 we highlight the presence

of elapsing-time bias, while in Section 3.4 we show how to solve the issue.

In Section 4 we draw some conclusions and point out future research lines.

Appendix A contains a brief description of the machine learning models

employed and the details about the setup and implementation used in the

experiments.

2. Dataset

Each match is described by a log. In other words, the log might be

considered a detailed running commentary of the match. The total number

5

Figure 1: Histogram of the difference in number of goals.

of events per match is contained between 1000 and 2000 records. The whole

list of type of events recorded in the database is shown in Table 1. The

presented database is not publicly available, but it is very similar to the one

presented in [1]. The main differences are in the number of matches and the

type of events that are registered.

The temporal ordering of each sequence of events is stored in the database

by keeping, for each match, an incremental identifier associated with the

events. Each event is also labeled with the minute (an integer number) in

which the event occurs. Note that the sequence order and the minute are the

only information available concerning the timing of the events. For example,

there is no information about the duration of an event or for how long the

match has been stopped in between. This is also related to the fact that

events might be considered “instantaneous” and no events like “the ball is

controlled by player A” are stored in this database. As an example, the

database record for a goal event is shown in Table 2.

6

challengeaerialpassfoulinterceptiontacklecorner awardedgood skillball touchsavetake onclearanceoffside passpunchturnoverblocked passcardclaimendformation change

formation setcross not claimedoffside provokederrorsmotheroffside givenball recoverydispossessedkeeper pickuppenalty facedshield ball oppsubstitution offsubstitution onchance missedstartkeeper sweepermissed shotssaved shotgoalshot on post

Table 1: Type of match events included in the database.

match id event id minute goal mouth Y goal mouth Z player id type X Y team id1080846 1308 76 50.00 20.90 68312 Goal 95.60 51.90 162

Table 2: Example of the record of a goal happened at minute 76.

7

match id event id end X end Y minute outcome player id type X Y teamId523130 353 57.20 21.00 18 Unsuccessful 25644 Pass 5.00 37.00 67

Table 3: Example of the record of an unsuccessful pass happened at minute 18.

Some of the events that involve a desired outcome are described in the

database as successful or unsuccessful. For example, a pass is labeled as

successful when the ball goes from a player of a team to another of the same

team, while as unsuccessful when the ball goes outside of the field or it is

intercepted by a player of the opposite team.

Furthermore, most types of events directly involve the ball and its actual

position in the pitch. Thus, the database provides the vertical and horizontal

coordinates (X and Y, respectively) of each event and, for some of them, also

the position (end X, end Y ) of the ball after the event has occurred. For

example, for each pass the position of the ball when it is first kicked is stored

together with its position when it is again touched by someone else. An

example of a database record for a pass is shown in Table 3.

As in the common setting of the science of success [1], the aim of this

work is assessing the probabilities of win, draw and loss of the teams, only

by looking at objective information that might be gathered directly from

the game and from the available dataset. For this reason, we do not exploit

any factor like match odds, teams or players ranking, or any other historical

information that cannot be inferred from the game itself.

However, it is clear that goal events do directly characterize the final

score [9, 10] and for this reason we limit our observations to the actions

that are sufficiently far from a goal. In particular, we remove 10 events

immediately before a goal to be sure that the data we use do not contain

8

any telling feature. We stress that no goal is ever considered in the stats, i.e.

shots never include goals. There are no stats that count the total number

of shots, there are only stats that count those that were outside, saved, or

on posts. Note that soccer is one of the few sports in which it is possible to

exclude events that determine the final score without removing the majority

of actions from which statistics are usually extracted.

We exploit the event sequence to extract a set of features that can be used

to train machine learning models. For each team, we compute many different

statistics. For instance, the simplest ones are obtained by counting the total

amount of occurrences of an event, e.g. the total amount of passes. There

are some types of events for which it is meaningful to include a percentage

of success, e.g. percentage of completed passes. The spatial information

is exploited only for some types of events, for example limiting the sum

of occurrences to some specific zones of the field, e.g. how many passes

are located in the offensive zone. In addition, there are some features that

concern actions, namely, consecutive sequences of at least 3 successful passes.

We compute, for instance, the number of actions that lead to a shot and

the average length of the actions (in terms of completed passes). We also

extract some more elaborated features like ball possession and mass center

of the teams. For each of the two teams, we extract 70 features (plus the

common aerial absolute for a total of 141 features), which are detailed in

Table A.6.

For a first set of experiments we extract such features, as is conventionally

done, considering the whole match. In further experiments, instead, to take

into account the temporal nature of the events, we compute each statistic

9

Figure 2: Composition of the feature vector.

on consecutive (non-overlapping) intervals of time. A very natural way to

operate this partition is by considering intervals of 15 minutes. If we consider

a match of 90 minutes, this provides a total of 6 intervals, 3 intervals per

half. A graphical representation of such a feature vector is shown in Figure

2.

For each match the two teams are randomly assigned to either Team A or

Team B. Machine learning is used to train models that classify each match

in Win, Draw or Loss w.r.t. team A. Notice that, since teams are randomly

assigned, wins and losses are balanced. Moreover, in this way, no information

about the home field is implicitly represented (Team A does not play always

in its home field and vice-versa).

3. Experiments

Throughout all the experiments we set up different machine learning mod-

els to work in the multi-class setting and to provide probability estimates

either by choosing an appropriate structure and loss function, as in NNs,

or by extending traditionally binary models like SVM with, for instance,

the one-vs-all and probability calibration techniques. A brief description of

these methods along with the setup and implementation used throughout the

experiments can be found in Appendix B.

To assess the performance of a model we employ the Area Under the Roc

10

Method AUC (macro)XGBoost 0.810 ± 0.006

Random Forests 0.773 ± 0.008SVM (RBF) 0.822 ± 0.007SVM (linear) 0.818 ± 0.007

k-NN 0.746 ± 0.008Logistic Regression 0.819 ± 0.008

NNs 0.817 ± 0.008

Table 4: AUC (mean ± ci) for the different models using statistics over the whole match.

curve (AUC) [16] in the macro-averaging fashion1. In particular, we compute

the AUC on 10 random splits where 90% of the data is used for training and

the remaining 10% as the test set. For each independent 90% split, the hyper-

parameters of the models are chosen with a 5-fold cross-validation scheme

except for NNs where we use early-stopping on a validation set containing

20% of the original training data.

3.1. Study of the Final Output with Whole-Match Statistics

The mean AUC (macro) and confidence intervals are reported in Table

4. Logistic regression, XGBoost, SVM and NNs proved to be statistically

equivalent2, while k -NN and RF scored significantly worse. The AUC of

the top four models is around 0.82, meaning that the features extracted are

highly correlated with the outcome of the match. Moreover, it does not seem

that the performances are highly influenced by the choice of the machine

learning model (considering the top 5 best models).

As a further analysis on the model performances, we show the ROC curves

1We choose macro over micro-averaging as the former weights all the different classes(Win, Draw, Lose) in the same way, regardless of the number of examples for each class.

2This was assessed using a Welch’s t-test to compare the 10 AUC test scores.

11

for the logistic regression model in Figure 3. To extract the curves we mod-

ify the test procedure. Namely, instead of random sampling 10 different test

sets, we perform a 10-fold-cross-validation test. In this way we gather the

predictions of the model on all the examples of the dataset (which are how-

ever unseen when the predictions are computed). We use such predictions to

extract four different curves: one curve for each class (in a one-vs-all fashion)

obtained by comparing the probability of being of a class against the prob-

ability of being of the other two (as in the standard two-classes case) and a

cumulative curve according to the macro-averaging scheme. From Figure 3

we can make the following observations.

� The results of the win-vs-all curve are very similar to those obtained

in Figure 2 of [10]: 0.88 AUC in comparison with 0.89± 0.02. In fact,

the win-vs-all curve is basically simulating the task addressed in [10]

where only the Win and Not win outputs are considered.

� In accordance with the literature (e.g., [9, 11, 10]), it might be noticed

that, while Wins and Losses are well identified by the method, Draws

are not classified as clearly. This means that the machine learning

model is (at least to some extent) able to understand when one of the

teams is prevailing on the other, but has not the same confidence in

classifying matches that conclude in draws.

To help understanding how the models behave w.r.t. the different classes

we show the confusion matrix w.r.t. the latter experiment. The confusion

matrix is obtained by discretizing the probabilities computed by the model by

choosing, for each instance, the most probable class. The confusion matrix

12

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True Positive Rate

Receiver operating characteristic

Win Vs All (area = 0.88)Loss Vs All (area = 0.87)Draw Vs All (area = 0.70)macro-average (area = 0.82)

Figure 3: ROC curves for the logistic regression model.

13

Draw Wi

nLos

s

Predicted label

Draw

Win

Loss

True

labe

l

0.45 0.28 0.27

0.20 0.72 0.08

0.21 0.08 0.71

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Figure 4: Confusion matrix for the logistic regression model.

is shown in Figure 4. We notice that wins and losses have high accuracy

(∼ 72%) and mistaking wins for losses is a rare kind of error (∼ 8%). The

most frequent kind of errors is, instead, the ones where we take draws for

wins or losses and vice-versa. This does not come as a surprise as it confirms

the lower AUC obtained in the draw-vs-all of Figure 3.

3.2. Study of the Final Output with Intervals Statistics

As a second step, we introduce time intervals to analyze what is the

effect of exploiting a more detailed description of the match. In this phase,

we maintain the result of the whole match as the labels, but we extract the

14

Method AUC (macro)XGBoost 0.855 ± 0.006

Random Forests 0.818 ± 0.005SVM (RBF) 0.850 ± 0.005SVM (linear) 0.851 ± 0.006

k-NN 0.77 ± 0.006Logistic Regression 0.851 ± 0.008

NNs 0.847 ± 0.007

Table 5: AUC (mean ± ci) for the different models using statistics extracted every 15minutes from the whole match.

141 features described in Table A.6 from each of the 15-minutes intervals (as

described in Section 2).

The results are showed in Table 5. By comparing it with Table 4 we can

notice that each method gains from 3 to 6 points when exploiting a finer

description of the match. This means that a single aggregation of the perfor-

mances along all the intervals is hiding some useful information. The feature

vector extracted this way is now even more correlated with the outcome of

the match. As in Table 4, also Table 5 shows that the same group of 5

machine learning models ( logistic regression, XGBoost, linear SVM, gaus-

sian SVM and NNs) are performing similarly (they are actually statistically

equivalent). This seems to suggest, once again, that the choice of the model

is not crucial for this task.

3.3. Study of the Partial Output with Partial-Intervals Statistics

We now present a set of experiments that will progressively bring us to

discover the presence of elapsing-time bias. As a first step in this direction,

we focus on understanding if the high correlation between game statistics

and final match outcome can be also found between partial game statistics

15

Figure 5: AUC (macro) for logistic regression trained on different sections of the match.

and partial outcomes. Namely, we consider different sections of the match of

increasing duration (0-15, 0-30, . . . , 0-90) and we train a model to output the

partial outcome at the end of each one. For example, if we consider the first

30 minutes of the match we train a model that receives as input the game

statistics (divided in 15 minutes intervals as before) of the first 30 minutes

and must output who is winning at the end of those 30 minutes. For the sake

of simplicity we show only the performance of the logistic regression model,

since, as we showed in the previous experiments, all the other models are

not significantly better. The results are shown in terms of AUC (macro) in

Figure 5. From Figure 5 we can observe the following.

� The performance of the models for partial outcomes are by far worse

than the performance of the model for the whole match, with roughly

17 points of difference between the models 0-90 and 0-30.

� If we exclude the 0-15 model, there is an evident increasing trend in

16

the performance of the model that is positively correlated with both

(a) labels that are getting closer to the end of the match;

(b) features that are taking into account a larger section of the match

(as well as becoming more numerous).

To explain this phenomenon, and understand whether it is a matter of (a)

labels or (b) features (or both), we first shift our focus to the model itself.

Namely, we analyze how the model exploits the different features. To this

aim, we compute the permutation importance, also known as Mean Decrease

Accuracy. According to this technique, which can be traced back to [17], the

model is treated as a black-box estimator and the importance of a feature

is measured by the decrease of test performance one obtains by replacing

the entire corresponding test column with random noise. For this method to

work, noise must be drawn from the same distribution of the original feature

values. We follow [18] and compute the new column by a random shuffle

of the old one. Notice that the model is not retrained, as in, for instance,

feature selection techniques, but we simply use the model to make predictions

on permuted test sets. Once this procedure is performed on all the columns

(several times) we have a measure of the importance of each feature. We

consider a logistic regression model trained on 80% of the dataset and we

apply the permutation importance procedure to the remaining 20%. We

use this tool to analyze how each model exploits the features. In Figure 6

we show the feature importance aggregated by intervals. Namely, we sum

the feature importance for each feature belonging to the same 15 minutes

intervals.

17

Figure 6: Feature importance for the different models.

We notice that, except for the 0-30 model, all the other models rely much

more on the last interval to make their predictions. This is particularly evi-

dent in the 0-90 models where the last interval is ∼ 20 times more important

than the others.

Given this evidence, to deepen the analysis, we propose to build models

that, to the extreme, only exploit the features from the last available 15

minute interval. Even if the labels are not changed, we will refer to these

models as 0-15, 15-30, . . . , 75-90, for differentiating them from the previous

models. The plot is showed in Figure 7.

Now, by comparing Figures 5 and 7 we can observe the following.

� In both cases (again by excluding the 0-15 model), there is an evident

increasing trend in the performance of the models as the labels get

18

Figure 7: AUC (macro) for logistic regression trained on different sections of the match.

closer to the final outcome of the match. Note that in this new study

the amount of features is not changing.

� Each model of Figure 7 loses some AUC points w.r.t. the corresponding

of Figure 5 (despite the 0-15 model that is exactly the same).

– 15-30: 3 AUC points (0.65 vs 0.68);

– 30-45: 4 AUC points (0.66 vs 0.70);

– 45-60: 3 AUC points (0.72 vs 0.75);

– 60-75: 6 AUC points (0.73 vs 0.79);

– 75-90: 1 AUC point (0.84 vs 0.85).

We can see that the loss of points is not correlated with the amount of

features removed between the models of Figure 5 and these of Figure 7.

In fact, even if the 75-90 model is only using 141 features, in comparison

with the 846 of the 0-90 model, it only loses 1 AUC point w.r.t. it.

19

From these observations, we can first conclude that the increasing phe-

nomenon showed in Figure 5 is not related to the number of features consid-

ered, but instead to the labels getting closer to the final outcome.

Moreover, if we now focus on the final outcome (as in the 0-90 and 75-90

model), we can conclude that the last 15 minutes of the match are the most

important for understanding it. In fact, it is possible to train a model only

on this last interval (the 75-90 model) for obtaining AUC performances that

are comparable to the ones obtained by exploiting the whole match (the 0-90

model). Note that the performances obtained by this 75-90 model are even

better than any of those of Table 4. We believe this can be considered a

strong proof of the existence of what we named elapsing-time bias.

Our conjecture here is that players are reacting to the partial outcome

(e.g., [12, 13, 14, 15]), either by trying to change it before the end or by trying

to maintain it. Moreover, the partial and the final outcome are getting closer

in expectation as the 90 minutes are elapsing. This means that the expected

outcome is becoming a stronger and stronger bias as time is elapsing. Even if

this phenomenon is intuitive and has already influenced some researchers [2,

6], to the best of our knowledge, this is the first quantitative study performed

to show the significant consequences that it has on the soccer match analysis.

3.4. Study of the Real-Time Output with Real-Time Statistics

In this section, we will address the issue of elapsing-time bias by introduc-

ing a novel way to set up the learning problem. In particular, we will study

the correlation between the statistics extracted on a single interval and the

outcome of that specific interval. More precisely, an interval will be labeled

as Win/Draw/Loss if team A scored more/same/fewer goals than team B

20

Figure 8: AUC (mean ± ci) for the different intervals.

within that time interval. This yields six different tasks for each match, one

for each of the six-time intervals (0-15, 15-30, . . . , 75-90). Note that the fea-

tures here are the same as those of Figure 7, but labels are different. Thanks

to this modification of the output, the partial (and final) outcome is never

observed. This means that game statistics are not studied jointly with the

partial (or final) outcome, but instead w.r.t. the real-time consequence of

the 15-minutes interval.

The results are reported in Figure 8 and from them we can observe the

following.

� The results are fairly stable among the different intervals, in contrast

with the ones of Figure 7. The last interval of each half is, apparently,

slightly more difficult to analyze than the others, but this can be prob-

ably explained by noticing that these two intervals are inherently more

chaotic from a game perspective.

21

� As happened in previous experiments, Logistic regression, XGBoost,

SVM and NNs performed equally well, while k-NN and RF scored

significantly worse.

� AUC scores obtained here are not as high as those obtained in Section

3.3, suggesting that the task is more challenging. Thanks to the real-

time nature of this task, these models are never addressing the partial

(final) output.

We can now conclude that elapsing-time bias is not affecting the newly

proposed task. In particular, since in Figures 7 and 8 the features are the

same, the issue is solved by not exploiting them for understanding the partial

(or final) output. Note that game statistics are not modified, so they may

still be affected by the expected output. However, thanks to the fact that we

are not addressing the partial (or final) output, we have a model that is not

exploiting this bias for understanding the game.

To better evaluate the performance of these models we propose two ad-

ditional comparisons against:

� a baseline model which employs as features only the score at beginning

of the interval

� a model trained with the same feature vector of game statistics with

the addition of the score at the beginning of the interval.

We focus only on the Logistic Regression model for simplicity. The results are

depicted in Figure 9. We notice that the inclusion of the initial score in the

feature vector does not seem to change the results significantly. This suggests

22

Figure 9: AUC (mean ± ci) for the different intervals.

that the model leverages the match statistics we propose in a meaningful

and fairly effective way. This is also highlighted by the comparison with the

baseline, which is significantly lower and approaches a value of 0.5 which

indicates random guess.

We notice that the inclusion of the initial score in the feature vector does

not seem to change the results significantly. This suggests that the model

leverages the match statistics we propose in a meaningful and fairly effective

way. This is also highlighted by the comparison with the baseline, which

is significantly lower and approaches a value of 0.5 which indicates random

guess.

Finally, we show that the newly proposed models have the additional

desirable characteristic of exploiting the game statistics in a fairly uniform

way. In particular, we propose a very similar analysis to the one of Figure

23

Figure 10: Feature importance for the different models.

6. In this case, we further partition each of the 15-minutes intervals in 33

sub-intervals of 5 minutes4. In Figure 10 the aggregated feature importance

is now on intervals of 5 minutes. It is interesting to notice that the newly

designed task is yielding models that are observing the whole match in a

much more uniform way. In contrast with Figure 6 this means that there

is not a single interval (e.g., the last 15-minutes) that can be exploited to

understand the whole match.

3The 75-90 model has 4 intervals because it also includes the injury time.4The results are slightly deteriorated in this new 5-minutes setting (∼1-2 points w.r.t.

the 15-minutes model). This is caused by a fragmentation of the features in pieces ofinformation that are too fine grained and thus not easily exploited by the models.

24

4. Conclusions

In this work, we conducted a series of experiments devoted to expose

what we called elapsing-time bias. Namely, we discovered that as the match

progresses we are able to predict the match final outcome with an high con-

fidence simply by looking at the in-game statistics for the last 15 minute

interval. We conjecture that the latter is due to players reacting to the

partial outcome, which gets closer, as time goes by, to the final score.

We believe that such bias poses a serious problem when using the obtained

models to perform any kind of subsequent analysis, e.g., as is often done in the

literature, determine which are the factors that most contribute to winning

a match.

Hence, we proposed to re-frame the learning problem in such a way that

is not affected by elapsing-time bias. In particular, the novel task does not

make use of the final (or partial) outcome as the mapping labels. For each

interval, the mapping output is instead computed taking into account only

the goals scored within that time frame.

Finally, we presented a novel set of experiments to analyze the perfor-

mance of the models in this new scenario. This study shows that the novel

task is not affected by elapsing-time bias.

Appendix A. Models

In this appendix we briefly describe the different machine learning models

employed in this work, which implementation was used and how they were

setup.

25

name description events mean std min maxaerial absolute # of aerial aerial 28.64 13.51 1.0 106.0aerial percentage succ % on aerial aerial 0.49 0.12 0.0 1.0attack event # of almost any event within zone almost any event 192.01 63.25 27.0 578.0ball possession # of possession events possession events 0.49 0.1 0.16 0.84ball recovery # of ball recovery ball recovery 47.2 11.97 1.0 119.0ball touch # of ball touch ball touch 17.17 9.07 0.0 48.0blocked pass # of blocked pass blocked pass 3.71 4.61 0.0 28.0challenge absolute # of challenge challenge 8.37 4.5 0.0 44.0challenge defence absolute # of challenge within zone challenge 2.36 1.96 0.0 15.0chance missed # of chance missed chance missed 0.06 0.27 0.0 3.0claim # of claim claim 1.35 1.37 0.0 11.0clearance # of clearance clearance 28.05 12.62 1.0 100.0corner # of corner corner 5.02 2.87 0.0 21.0cross absolute # of crosses pass 12.17 5.76 0.0 48.0cross not claimed # of cross not claimed cross not claimed 0.05 0.24 0.0 4.0cross percentage succ % on crosses pass 0.24 0.14 0.0 1.0defence event # of almost any event within zone almost any event 191.88 41.85 59.0 496.0dispossessed # of dispossessed dispossessed 11.82 4.61 0.0 35.0error # of error error 0.27 0.54 0.0 4.0error defence # of error within zone error 0.19 0.44 0.0 4.0formation change # of formation change formation change 0.75 0.95 0.0 12.0foul # of foul foul 13.82 4.55 1.0 38.0good skill # of good skill good skill 0.29 0.67 0.0 9.0h index h index on length of actions pass 7.69 1.59 3.0 15.0inner center shots # of shot events within zone shot events 5.35 3.09 0.0 23.0inner left shots # of shot events within zone shot events 0.09 0.31 0.0 3.0inner right shots # of shot events within zone shot events 0.08 0.29 0.0 3.0interception # of interception interception 17.22 7.03 1.0 65.0keeper pickup # of keeper pickup keeper pickup 7.09 3.09 0.0 25.0keeper sweeper # of keeper sweeper keeper sweeper 0.77 1.05 0.0 9.0mean n seq mean lenght of actions pass 5.55 0.98 3.29 11.96mean pass length mean length of passages pass 22.24 1.91 14.68 31.81missed shots # of missed shots missed shots 5.21 2.67 0.0 19.0n excessive seq # of long actions pass 8.79 5.62 0.0 42.0n seq # of actions pass 44.0 10.63 9.0 93.0n shot seq # of pass + shots events pass + shots events 0.98 1.24 0.0 12.0offside given # of offside given offside given 2.49 1.94 0.0 15.0offside pass # of offside pass offside pass 2.48 1.93 0.0 15.0offside provoked # of offside provoked offside provoked 2.48 1.93 0.0 15.0other shots # of shot events within zone shot events 0.42 0.72 0.0 8.0outer center shots # of shot events within zone shot events 5.19 2.94 0.0 23.0outer left shots # of shot events within zone shot events 0.2 0.47 0.0 5.0outer right shots # of shot events within zone shot events 0.15 0.4 0.0 4.0pass absolute # of pass pass 470.08 111.69 156.0 1096.0pass attack absolute # of pass within zone pass 54.72 23.19 4.0 231.0pass attack percentage succ % on pass within zone pass 0.57 0.09 0.1 0.95pass defence absolute # of pass within zone pass 68.72 16.61 22.0 215.0pass defence percentage succ % on pass within zone pass 0.67 0.12 0.23 1.0pass percentage succ % on pass pass 0.74 0.07 0.31 0.93passages possession passage distance covered pass 7906.97 2450.43 1575.01 19466.68penalty faced # of penalty faced penalty faced 0.08 0.28 0.0 3.0punch # of punch punch 0.58 0.86 0.0 9.0red card # of red card red card 0.05 0.23 0.0 3.0save # of save save 6.04 3.27 0.0 31.0saved shot # of saved shot saved shot 6.06 3.29 0.0 31.0shield ball opp # of shield ball opp shield ball opp 0.54 0.79 0.0 7.0shot on post # of shot on post shot on post 0.23 0.49 0.0 4.0smother # of smother smother 0.11 0.36 0.0 6.0substitution # of substitution substitution 2.79 0.48 0.0 9.0tackle absolute # of tackle tackle 20.03 5.92 3.0 47.0tackle defence absolute # of tackle within zone tackle 6.75 3.29 0.0 24.0tackle defence percentage succ % on tackle within zone tackle 0.77 0.19 0.0 1.0tackle percentage succ % on tackle tackle 0.75 0.1 0.15 1.0take on absolute # of take on take on 18.58 7.39 0.0 65.0take on attack absolute # of take on within zone take on 6.71 3.74 0.0 28.0take on attack percentage succ % on take on within zone take on 0.32 0.22 0.0 1.0take on percentage succ % on take on take on 0.44 0.15 0.0 1.0various error # of error events error events 0.39 0.67 0.0 5.0x mass center mean x position of events almost any event 47.57 4.83 23.66 65.95y mass center mean y position of events almost any event 48.53 3.37 32.87 60.88yellow card # of yellow card yellow card 2.03 1.37 0.0 10.0

Table A.6: For every features the table shows the belonging group, events which areinvolved in the extraction, the description and few basic statistics. Features showed hereare extracted from the performances of both teams on the whole match.

26

k-Nearest Neighbor

A popular classification method is k-Nearest Neighbor (k -NN) [19]. This

learning algorithm is memory-based : the fitting procedure amounts to memo-

rize the training set. Given a query point x, k -NN searches for the k nearest

neighbors {xi1 , . . . , xik} of x among the training points according to some

distance function. The output is then produced by a majority vote between

the training labels {yi1 , . . . , yik}. When probabilities are required, no voting

is performed and probabilities are obtained simply by counting the number

of examples of each class among the k neighbors. This is how we produce

probabilistic outputs in our experiments in Section 3.

We used, for its simplicity, k -NN as a baseline against other algorithms.

We employed the k -NN implementation available in the scikit-learn [20] col-

lection with the euclidean norm as distance function. As an example, k -NN

has been used to predict the outcome of football matches given the book-

makers’ odds in [21].

Support Vector Machines

Support Vector Machines [22] were originally developed for binary clas-

sification problems. The idea is to find a hyper-plane (w, b) that separates

positive (P) and negative (N) examples and, once such hyper-plane has been

determined, express the classification function5 f : Rn → [−1, 1] as

f(x) = sign(〈w, x〉+ b). (A.1)

5Here, the two classes are -1, 1

27

In practice, linear SVM may perform poorly when data are highly non-

linear. Non-linear SVMs map the input vectors into a higher dimensional

space F , called feature space, through a non-linear mapping function φ.

In this work we used both the linear version of SVM and SVM equipped

with the RBF kernel: K(x, z) = e−γ‖x−z‖2, with γ > 0.

When probabilistic outputs are required, the distance from the hyper-

plane can be turned into a probability estimate through calibration tech-

niques like isotonic regression or Platt’s scaling. For the experiments we

employed Platt’s scaling technique available in scikit-learn [20].

As for the multi-class case, several extensions of the SVM model case have

been proposed in the literature. Two of the most popular are the one-vs-all

and one-vs-one techniques. Both of these techniques rely on training several

binary SVM classifiers. In our experiments we used the one-vs-all version.

We used libSVM [23] to train SVM. For an extensive treatment of SVM we

refer the reader to the book [24]. SVMs have been widely used also for sport

applications [25, 26].

Neural Networks

Neural Networks (NNs) are a powerful class of functions that can ap-

proximate any non-linear function, provided that enough parameters are

employed, as stated in the universal approximation theorem [27].

Neural networks can be arranged in a variety of ways depending on the

number and size of layers and the choice of activation, output and loss func-

tion employed. In our experiments, we used the hyperbolic tangent as the

activation function, the softmax as output function and the cross-entropy as

the loss function. The number and size of layers were determined using cross

28

validation.

We used tensorflow [28] to implement NNs and we used Stochastic Gra-

dient Descent [29] to train them. Applications of neural networks in sports

prediction can be found, for example, in [30].

Logistic Regression

Although Logistic Regression was first developed as a modification of

Linear Regression for classification, it can be described as a simple neural

network. It is, in fact, equivalent to a single-layered feed-forward neural

network with a single output neuron and sigmoid output function. The

training loss of choice is binary-cross entropy. Its multi-class extension can

be achieved in a one-vs-all or one-vs-one fashion as described for SVM or

by modifying the architecture to include multi-output units. In this case,

softmax is chosen as output function as in NNs. In our experiments, we used

the latter. We took the implementation from the scikit-learn collection [20].

We used the `2 norm for the regularization term.

Random Forests and XGBoost

Random Forests [17] and XGBoost [31] are two very popular machine

learning techniques that both exploit Decision Trees as a building block for

composing larger methods.

Random Forests are probably the most famous application of ensembling:

given a variety of different models, each prediction is obtained combining

their outputs. In the case of Random Forests, this combination is obtained

employing many randomized Decision Trees models that have been built

29

by sub-sampling the dataset along the axes of examples and/or features,

together with a random selection of the splitting feature.

XGBoost is a specific implementation of the Gradient Boosting algorithm

originally proposed by Friedman [32]. Boosting techniques also belong to the

class of Ensemble methods since the final function is obtained by combin-

ing various simpler functions, in this case Decision Trees but differ in the

way they are trained. We refer the reader to [31] for further details on the

Gradient Boosting algorithm.

We used the implementation from [20] and [33] for Random Forest and

XGBoost respectively. In [25] both random forest and XGBoost have been

used to predict the outcome of a football match.

References

[1] L. Pappalardo, P. Cintia, A. Rossi, E. Massucco, P. Ferragina, D. Pe-

dreschi, F. Giannotti, A public data set of spatio-temporal match events

in soccer competitions, Scientific data 6 (1) (2019) 1–15.

[2] H. Liu, M.-A. Gomez, C. Lago-Penas, J. Sampaio, Match statistics re-

lated to winning in the group stage of 2014 brazil fifa world cup, Journal

of sports sciences 33 (12) (2015) 1205–1213.

[3] F. A. Moura, L. E. B. Martins, S. A. Cunha, Analysis of football game-

related statistics using multivariate techniques, Journal of sports sci-

ences 32 (20) (2014) 1881–1887.

[4] C. Lago-Penas, J. Lago-Ballesteros, A. Dellal, M. Gomez, Game-related

statistics that discriminated winning, drawing and losing teams from the

30

spanish soccer league, Journal of sports science & medicine 9 (2) (2010)

288.

[5] C. Lago-Penas, J. Lago-Ballesteros, E. Rey, Differences in performance

indicators between winning and losing teams in the uefa champions

league, Journal of human kinetics 27 (2011) (2011) 135–146.

[6] H. Lepschy, H. Wasche, A. Woll, Success factors in football: an analysis

of the german bundesliga, International Journal of Performance Analysis

in Sport 20 (2) (2020) 150–164.

[7] J. Castellano, D. Casamichana, C. Lago, The use of match statistics that

discriminate between successful and unsuccessful soccer teams, Journal

of human kinetics 31 (1) (2012) 137–147.

[8] H. Liu, W. G. Hopkins, M.-A. Gomez, Modelling relationships between

match events and match outcome in elite football, European journal of

sport science 16 (5) (2016) 516–525.

[9] L. Pappalardo, P. Cintia, Quantifying the relation between performance

and success in soccer, Advances in Complex Systems 21 (4) (2018)

1750014.

[10] Y. Li, R. Ma, B. Goncalves, B. Gong, Y. Cui, Y. Shen, Data-driven

team ranking and match performance analysis in chinese football super

league, Chaos, Solitons & Fractals 141 (2020) 110330.

[11] R. P. Bunker, F. Thabtah, A machine learning framework for sport result

prediction, Applied computing and informatics 15 (1) (2019) 27–33.

31

[12] J. Sampaio, C. Lago, L. Casais, N. Leite, Effects of starting score-line,

game location, and quality of opposition in basketball quarter score,

European Journal of Sport Science 10 (6) (2010) 391–396.

[13] L. Vaz, M. Van Rooyen, J. Sampaio, Rugby game-related statistics that

discriminate between winning and losing teams in irb and super twelve

close games, Journal of sports science & medicine 9 (1) (2010) 51.

[14] M.-A. Gomez, A. DelaSerna, C. Lupo, J. Sampaio, Effects of situational

variables and starting quarter score in the outcome of elite women’s

water polo game quarters, International Journal of Performance Analysis

in Sport 14 (1) (2014) 73–83.

[15] C. Lupo, G. Condello, L. Capranica, A. Tessitore, Women’s water polo

world championships: Technical and tactical aspects of winning and

losing teams in close and unbalanced games, The Journal of Strength &

Conditioning Research 28 (1) (2014) 210–222.

[16] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Let-

ters 27 (8) (2006) 861 – 874.

[17] L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32.

[18] TeamHG-Memex, Eli5 library, https://github.com/TeamHG-Memex/

eli5 (2018).

[19] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learn-

ing, Springer Series in Statistics, Springer New York Inc., New York, NY,

USA, 2001.

32

[20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,

O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-

plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,

Scikit-learn: Machine learning in Python, Journal of Machine Learning

Research 12 (2011) 2825–2830.

[21] E. Esme, M. S. Kiran, Prediction of football match outcomes based on

bookmaker odds by using k-nearest neighbor algorithm, International

Journal of Machine Learning and Computing 8 (1) (2018) 26–32.

[22] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3)

(1995) 273–297.

[23] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector ma-

chines, ACM Transactions on Intelligent Systems and Technology 2

(2011) 27:1–27:27, software available at http://www.csie.ntu.edu.

tw/~cjlin/libsvm.

[24] B. Scholkopf, A. J. Smola, Learning with Kernels: Support Vector Ma-

chines, Regularization, Optimization, and Beyond, MIT Press, Cam-

bridge, MA, USA, 2001.

[25] R. Baboota, H. Kaur, Predictive analysis and modelling football results

using machine learning approach for English Premier League, Interna-

tional Journal of Forecasting 35 (2) (2019) 741–755.

[26] S. Demers, Riding a probabilistic support vector machine to the Stanley

Cup, Journal of Quantitative Analysis in Sports 11 (4) (2015) 205–218.

33

[27] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks

are universal approximators, Neural Networks 2 (5) (1989) 359–366.

doi:10.1016/0893-6080(89)90020-8.

URL http://dx.doi.org/10.1016/0893-6080(89)90020-8

[28] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.

Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorflow: Large-scale

machine learning on heterogeneous distributed systems, arXiv preprint

arXiv:1603.04467 (2016).

[29] A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust Stochastic Ap-

proximation Approach to Stochastic Programming, SIAM Journal on

Optimization 19 (4) (2009) 1574–1609.

[30] B. Loeffelholz, E. Bednar, K. W. Bauer, Predicting NBA games using

neural networks, Journal of Quantitative Analysis in Sports 5 (1) (2009).

[31] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system, in:

Proceedings of the 22nd ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794.

[32] J. H. Friedman, Greedy function approximation: a gradient boosting

machine, Annals of Statistics 29 (5) (2001) 1189–1232.

[33] DMLC, XGboost library, https://github.com/dmlc/xgboost (2019).

34

Highlights - UniFI

Documents