Every set of fans thinks they’re hard done by. It doesn’t matter how big or small the team is, there will still be fans that believe the league and the referees are out to get them. Bayern fans think they never get any calls, and Dortmund fans think Bayern get favored all the time. This goes on in every country and at every level. But who is right in this situation? And is it the product of widespread corruption favoring a certain team, or something else?
I’m not entirely convinced that there is any great conspiracy. There rarely is. On every single election day in Britain, rumor spreads that claims we are given pencils to mark our ballot so that the government can erase and change our vote. The actual reason is that the ink from pens can smudge and spoil the ballot, and pencils are a cheap solution. But that doesn’t stop people showing up with their own pen and spoiling their ballot. The earth isn’t flat, climate change is real, and I’m pretty certain that referees are not actively seeking to ruin any particular team. Referee corruption does exist, but I’m not convinced it is a widespread problem. However, that doesn’t mean that referee bias isn’t real. If referees are biased towards a certain team (or teams), it isn’t necessarily the product of corruption, but may be implicit.
There are lots of possible explanations for referee’s making “incorrect” decisions. Referees might be influenced by the size of the crowd, by reputations, or by the state of the game at the time. If there is a specific conspiracy, we should see teams consistently favored, or punished unfairly, over the course of several seasons, or certain referees making significantly more mistakes than others. In this article I hope to dig into this topic and test the presence of referee bias, and consider potential explanations for it’s existence.
The Existence of Bias: Prior Research
Agency theory provides a starting point for explaining referee impartiality and bias. The principal-agent problem is a situation where a person or group (the agent) acts on behalf of another person or group (the principal) but is incentivized to act in self-interest and against the interests of the principal. Agency theory states that the agent and principal’s incentives must be aligned in order to produce impartiality. Applied to this context, the referee is the agent, and in order to guarantee impartiality, their incentives must be aligned with the football association (the principal). The football association can provide monetary reward and punishment to ensure that referees are incentivized to carry out officiating as the association sees fit, but the problem is that material rewards are not the only factor. Social psychology argues that individual decisions are not solely based on material reward and therefore material incentive is insufficient. Social goods incentivize individual decisions, in this case fan responses to referee decisions, in the form of social approval or sanction. Economists have made similar arguments regarding the influence of social forces in an individual’s utility function. Simply put, referees will not only be driven by financial reward or sanction, and instead may be led astray by social incentive. When referees exhibit bias, it doesn’t mean they must have been incentivized by money.
Many of the social forces influencing referees are not working to directly do so, but are cues that trigger biased decisions. Dohemn & Sauermann (2016) have compiled a very accessible review of the literature on referee bias. The most common explanation is the influence that home team fans have on referees favoring their team (Sutter & Kocher 2004; Garicano et al 2005; Rickman & Witt 2008).
Research finds that referees extend stoppage time when the home team is behind by one goal compared to when the home team is ahead by one goal, and on average findings suggest stoppage time may be extended by as much as two minutes (Garicano et al 2005; Sutter & Kocher 2004). Similar effects have been found when studying the difference in penalties awarded (Sutter & Kocher 2004), and goal difference (Bokyo, Bokyo, & Bokyo 2007). Some studies have even gone as far as to identify incorrect decisions using expert assessments, finding that controversial goals and penalties awarded to the home team were significantly more likely to be incorrect decisions (Dohmen 2008).
The academic debate regarding referee bias has also sought to identify causes beyond the home team advantage. Referees are typically influenced by the size and composition of the supporting crowd, distance of the crowd from the referee, and the potential returns from a win (the importance of the game) (Garicano et al 2005; Dohmen 2008; Petterson-Libdom & Priks 2010). Dohmen (2008) finds that the effect of running tracks between the crowd and the pitch reduce the effects of referee bias while Petterson-Libdom & Priks (2010) use stadium bans in Italy to test the effect of crowd noise/presence on referee decisions, finding that they award significantly less yellow cards and fouls to the home team when the crowd is present. Garicano et al. (2005) find that an increase in crowd size and density increases referee bias. They also find that stoppage time bias increases towards the end of the season suggesting that as the stake increases referees are more likely to favor the home team.
Data & Methodology
This article will use data collected from the German football website Wahre Tabelle (which translates to True Table) on referee errors from 08/09 to 18/19. Wahre Tabelle is a website dedicated to judging referee decisions and calculating the “correct” table, without referee error or bias. Community members nominate possible mistakes (penalties, offsides, red cards etc.) and the editors of the site then study the incident to decide whether the result needs correcting. Finally, they calculate the True Table, with the points tallied up based on the corrected results. For more information, here is a full explanation of how Wahre Tabelle calculates their table.
Wahre Tabelle is not a precise measure of referee bias or error. In reality, we’re relying on judgment to be impartial and fair.The biggest factor is hindsight. Being able to look back at decisions in hindsight and judge them will help clear a lot of calls up. I think the method is reliable enough for us to trust that it is not skewed by systematic bias, and that it is a good enough judgment of referee decisions to use as the basis for this analysis.
I will use regression analyses and hypothesis tests to explain referee bias, testing the effect of crowd size, club stature, league position and home team favoritism. Finally, I will compare individual referee bias to see if there is any evidence of corruption or ineptitude.
To get us started, I looked at recent seasons to see which teams have been favored and how this bias might have altered the final league table. The bad news is that according to Wahre Tabelle we can’t blame the referees for Dortmund’s collapse last season. The actual table finished with Bayern on 78 points and Dortmund on 76, but Wahre Tabelle has Bayern finishing 4 points ahead instead.
In the 17/18 season, Bayern got the same amount of points in both tables (84). However Dortmund were unlucky (receiving 5 less points) and should have finished 2nd ahead of Schalke (who received 4 more).The season before that, Bayern got a little unlucky, getting 82 points instead of 85, though it ultimately didn’t make a difference. Dortmund, however, received the correct number of points. Finally, in the 15/16 and 14/5 season, Bayern received 1 point fewer and 3 points more and Dortmund received 1 point more and 4 points fewer than their True Table points.
These numbers don’t suggest that anyone is getting any major favors, but it’s necessary to go a little deeper in order to see if there is anything untoward going on. Table 1 presents the 20 best Bundesliga teams in the last 12 seasons, ranked by total points.
Table 1: Cumulative Points Table
|Position||Team||Total Points||True Table Points||Points Difference||Corrections Difference|
|Position||Team||Total Points||True Table Points||Points Difference||Corrections Difference|
Unsurprisingly, Bayern and Dortmund top both the actual and amended points tables. However there would be very little difference in the league table if we ranked by the amended points.
In order to identify referee bias, we can use either the corrections difference or points difference. Figure 1 presents teams rank ordered by their corrections difference, which is the cumulative sum of goals denied (goals that should have been) minus the sum of goals awarded (goals that shouldn’t have been). Positive differences mean that bias positively affected the team. Corrections difference can be thought of as the goal tally a team should have had added or subtracted to their total. Figure 1 ranks Bundesliga teams by corrections difference.
The teams most negatively affected were Werder Bremen and Bayern Munich. Borussia Dortmund were in 4th behind Hertha BSC. Bayern were awarded 64 goals, but they were denied 101. This might suggest that Bayern were the victims of negative referee bias, but in reality I think this is a product of them having more goalscoring opportunities. Dortmund, similarly, were awarded 60 goals they didn’t deserve, and were denied 79 that they did deserve.
When we rank teams by points differences, the team that has benefited from incorrect referee decisions the most is Freiburg! They earned a total of 35 points more than they should have, followed by Hannover having received 31 points more than they deserved. Dortmund actually received fewer points than they deserved, but only 7 fewer. This places them 12th in the points difference table, while Bayern Munich are 15th, receiving 10 fewer points than they should have.
Figure 2 splits the corrections for each team by season, in order to identify instances where a team has experienced an especially low or high number of corrections.
Though there are some instances of teams being favored heavily in a particular season (Bayern and Koln), the distribution doesn’t suggest anything untoward is going on.
Following this, I used regression analysis to try and identify the sources of this bias. I ran regressions of the corrections difference on market values, to test whether the big clubs were favored more than smaller clubs. The regression demonstrates a statistically significant and positive effect on corrections differences, which means that the bigger clubs actually suffer more negative referee bias than smaller clubs. However, I suspect this is just because bigger teams will have more attacking opportunities, and the corrections are based on goalscoring opportunities. The bigger teams tend to be better too, and they produce more opportunities for referees to take away goals. Controlling for goals scored should reduce this, and indeed seems to do so. It remains significant (barely) and positive, but the reduction in significance gives me confidence in the conclusion.
I was also unable to identify a relationship between the size of the crowd and referee bias. However, research has identified a crowd size effect, so the most likely explanation is that the aggregation of the data in this analysis is hiding the effect of crowd size. The analysis has been carried out at a season level, meaning that the crowd size variable is measured as the average attendance per season for each club. It is possible that if the analysis was carried out at a match-level, we might observe an effect. Given the wealth of evidence supporting the theory that crowd size matters, I think it is safe to conclude this is probably the result of limitations of this analysis.
Wahre Tabelle also count the number of dismissals received by teams in a season, and compare this with their true table of dismissals over the course of the season. They assign a point value of 3 points for two yellows, and 5 points for a straight red. Unfortunately they have only tallied up these differences for the last three seasons, so the analysis is a little more restricted, but what is immediately noticeable is that referees tend to send fewer players off than they should. It seems like referees tend towards leniency.
Table 2: Cumulative Card Points
|Position||Team||Card Points||True Card Points||Card Points Difference|
|Position||Team||Card Points||True Card Points||Card Points Difference|
There is a statistically significant difference between the actual card points and Wahre Tabelle’s corrected card points, with a mean difference of about -6.68. However, regression analyses fails to identify a statistically significant relationship with any measure of either stature or crowd size. The data for dismissals also doesn’t split by home or away, so I was unable to test this.
Home vs Away
The existing literature on referee bias has primarily argued that the biggest factor is a bias for the home team. This makes some intuitive sense, and is backed up by research in multiple fields that has identified the effect of social incentives on human decision-making. There are real, immediate social costs for a referee making calls against the home team, especially if those calls are disputable. As a result, there’s reason to suspect referees are more likely to favor the home team when making close calls.
I use hypothesis testing methods to test whether the difference in referee bias at home is statistically different to away games. Hypothesis testing tests the “null hypothesis” that the mean (or median) of the two groups are equal. If we are able to reject the null, we can identify that there is a statistically significant relationship. We identify a statistically significant relationship by testing whether the true difference between the mean of the two groups is 0.
I use a bootstrapped Yuen’s trimmed mean test to test whether there is a significant difference between the samples for the corrected points differences (vs the actual points) for home and away teams. This method trims the means by 20%, removing 20% of the largest and smallest observations, which reduces the effect of the extreme values and allows us to better identify the central tendency of the two groups. Following this, a hypothesis test is applied using a bootstrap method. This means the mean of the two groups is computed by simulating the observed data 10,000 times to approximate the true distribution of the groups before calculating a p-value and confidence intervals to test the statistical significance in the difference of the two group means. The test score is statistically significant, meaning we can reject the null, and conclude that the “true” home and away points differences are not the same. In order to further validate this test, I also computed several other hypothesis tests that might identify errors in my assumptions, but all find statistically significant differences between the groups. The estimated difference is about 1.2. This means that teams, on average, gain about 1.2 more points at home from incorrect calls than they would away from home. Though this doesn’t seem like a huge difference, I’d argue it’s pretty substantively important.
Figure 3 visualizes the Yuen’s trimmed means test. Though it looks like there is quite a lot going on here, let me explain. The home “violin” (the plots are called violin plots) is longer and thinner, meaning that the range of home points differences is greater, while the away team differences are often negative (the away violin is thinner above 0). The red line shows the difference in the two means (the dashed lines are the differences in each team’s home and away points difference), while the ξ = 0.3 is the effect size. 0.3 is a moderate effect size, suggesting this is both statistically significant and substantively important.
Clearly, referees favor the home team more than anything else. This supports the existing research on the topic of referee bias.
Individual Referee Performance
Finally, Wahre Tabelle tally up the performances of individual referees. They compile data on each referee over the seasons, including the number of corrections they have applied to their games. If something untoward was going on with just one or two referees, rather than the whole lot, we would potentially notice a spike in the mistakes they’re making.
On the whole, there doesn’t appear to be anything in the data that suggests anything is going on. The distribution of proportions is about normal, and in any cases where the proportions are especially high, the referee didn’t last long in the league. This is presumably because they were ditched for being terrible at their job!
Figure 4 shows the distribution of each referee’s proportion of mistakes per season. There is little evidence to suggest any referees are doing anything corrupt, and there’s very little evidence to suggest any referee is particularly bad at their job either.
For the most part referee’s make errors in around a quarter of games, but there’s also times when certain referee’s proportion of errors rise to and above 50%. The vast majority of the peaks are between 30% - 60%. There’s little reason to suspect any of this is any more than human error and implicit bias.
People will continue to believe that referees are either out to get their team or favor another team, but there’s very little evidence to suggest this goes on in any intentional, corrupt capacity. There will be incidences of corruption, but these will be the minority of cases. For the most part, it seems that referees favor the home team, and do so even more when the crowd is bigger and denser. There is some research that suggests referees might favor bigger teams, or teams at the top of the table, but those findings are far from the consensus.
In reality, the most feasible explanation is the simplest explanation: referees are swayed when thousands of people scream at them. Referees are human, and human beings make mistakes. Individuals are affected by social incentives, and there are few more convincing incentives than a stadium full of angry people telling you what to do.
I think the commonly held view that the biggest and best teams are favored more than others can also be explained by social psychology. Individuals are led by their own confirmation bias and motivating reasoning. We see what we want to see. The reality is that the bigger teams are being gifted more goals than other teams, but they’re also being denied more. However, I suspect fans remember the calls that favor the bigger teams, and forget the calls that don’t align with their view. This is pretty common behavior, and is certainly not exclusive to sports. Partisans will carefully pick and choose their opinions based on the political party position, and will even change their view when they realize the party position is different to their own. We’re all bad at being neutral. Research has even found that being smarter can’t save you from human nature. Smarter partisans are just better at motivated reasoning. We’re all at it!
It turns out the earth isn’t flat. I’m sorry to have done this to you.
Bokyo, Bokyo, & Bokyo (2007) – Referee Bias Contributes to Home Advantage in English Premiership Football
Dohmen & Sauermann (2016) – Referee Bias
Garicano et al (2005) – Favoritism Under Social Pressure
Pettersson-Lidbom & Priks (2010) – Behavior Under Social Pressure: Empty Italian Stadiums and Referee Bias
Rickman & Witt (2008) – Favouritism and Financial Incentives: A Natural Experiment
Sutter & Kocher (2004) – Favoritism of agents – The case of referee’s home bias
Soares & Shamir (2016) – Quantitative Analysis of Penalty Kicks and Yellow Card Referee Decisions in Soccer