Two more of David's articles from his 2023 blog. These involved some heavy duty math, which was normal for him, but pretty much Greek to me.
Article 3: written 5/23/23 How to Implement the Metropolis-Hastings Algorithm
When I create models for ranking sports teams and predicting outcomes of games, one of my favorite methods for fitting the model is the Metropolis-Hastings algorithm. I learned the MH algorithm back when I was in college when I used it on a project looking at star-forming regions. In hindsight, using MH was completely unnecessary for what we were doing for that work (fitting a line to a scatter plot), but I have since used it for a variety of more complicated problems.
So, what is the Metropolis-Hastings Algorithm?
In simplest terms, the Metropolis-Hastings algorithm is a way to fit a model to data. Rather than just providing the best fit values for the model, it creates a sample of the parameter space that can be used to estimate the distribution of the parameters and the correlation between them. This is one of the main reasons to use this rather than a simpler method and it is particularly useful when the distribution of the parameters is not normally distributed.
The Metropolis-Hastings algorithm is an example of a Markov Chain Monte Carlo (MCMC) technique. The way MH samples the parameter space is by starting at some location in the parameter space (called a state), then jumping to nearby locations in that space. This sounds like a random walk, but in contrast to a true random walk, the MH algorithm will randomly select a new state, but then reject it if the new parameter values fit the data significantly worse than the current values (I will give the exact criteria for rejecting a jump in the next section). The result is that as the algorithm performs this semi-random walk, it tends to prefer states that provide better fits to the data, while still allowing for the sampling of lower likelihood states.
How does the Metropolis-Hastings Algorithm Work?
Say you have some data and you want to fit a model to it that has multiple parameters, maybe even lots and lots of parameters. The model will calculate the likelihood of the data given a choice of parameter values (a state).
One of the other reasons to use the MH algorithm is because the model contains a large number of parameters. If you are fitting a line where you only have two parameters (slope and intercept), then you can easily just do a grid search of the parameter space. However, if you have dozens or even hundreds of parameters, then it is no longer feasible to search the entire parameter space that way. The MH algorithm does not have this limitation.
The MH algorithm is iterative. Each iteration, it will start at a state in parameter space, propose a new state to jump to, and either accept or reject that new state. The new state is randomly selected from a proposal function:
A typical choice for the proposal function is a Gaussian since it is symmetric and easy to implement, but there are some fancier choices that can make the algorithm more efficient.
After the proposal function creates a new state, the algorithm has to determine whether to accept or reject it. In general, the algorithm will accept new proposals if they provide a better fit to the data, but it won’t necessarily reject if the fit is worse. To determine whether to accept or reject, the following ratio is calculated, then compared to a uniform random number between 0 and 1. If the proposal function is symmetric, then this ratio is always 1. In fact, this part of the algorithm is the Hastings part; without it, we just have the Metropolis algorithm.
The criteria for accepting or rejecting the proposed state is whether the ratio of likelihoods is greater than or less than a uniform random number (between 0 and 1), which is called u below.
Put more simply, if the new state has a higher likelihood than the old one, the new state is accepted and is used as the starting point for the next iteration. If the new state has a lower likelihood, then it can still be accepted if the ratio is larger than the uniform random number. If it isn’t, the new state is rejected and the next iteration will start at the previous state. And that’s it. The algorithm repeats this process, collecting samples until there is enough to estimate the distribution of the parameters. Seems easy, right? There are a couple things to be careful about.
Optimizing the Acceptance Rate
The choice of the proposal function has a major effect on the efficiency of the algorithm. If the proposal function proposes states that are too far away, it will tend to propose too many low likelihood states, leading to a very high rejection rate. The algorithm will get stuck in the same place for many iterations in a row, causing the random walk to sample very slowly. If the proposal function proposes very small jumps, then the acceptance rate will be high (close to 100%), but it will still sample the space slowly simply because the jumps are small. There is an optimal middle ground where the jumps are large enough to get around the parameter space without reducing the acceptance rate too much. Studies have shown that, as the number of parameters grows, the optimal acceptance rate is about 23%. It is not important to hit this exact acceptance rate, accepting about 1 out of 4 proposed jumps will lead to better sampling.
Autocorrelation
Even when sampling the parameter space most efficiently, the MH algorithm will still reject states more often than it accepts them. This means that the state may remain the same for several iterations. If the state is recorded for each iteration, then nearby points will be highly correlated rather than being independent samples of the parameter distribution. This correlation between samples is called autocorrelation. This can be avoided by not recording the state each iteration. In fact, it is usually correct to only record the state every hundred or even thousand iterations. This allows the algorithm to move through the parameter space enough that the samples are effectively independent.
Burn-in
The algorithm has to start at some state and you could choose any state you like. You might not know ahead of time what a reasonable starting point is; ideally, you would want to choose a state that has a high likelihood. The way around this is to simply let the algorithm run for a while without recording any samples. This process is called the burn-in. It is important to perform a burn-in because the states at the beginning won’t necessarily reflect the underlying parameter distribution you are trying to sample. After the burn-in, the algorithm will have automatically gravitated towards a better starting point.
Conclusion
This was a quick overview of how to implement the Metropolis-Hastings algorithm. While this description is hopefully enough for an interested person to write their own MH code, there is much more to learn like why the MH algorithm works, how to optimize the shape of the proposal function, and additions to the algorithm that make it more effective for complicated models and parameter distributions.
At some point, I hope to post about some of the models I have created for college basketball, college football, and professional soccer which showcase the Metropolis-Hastings algorithm.
Article 4: written 6/10/23 A Predictive Soccer Model
I am a huge nerd. I am also a huge sports fan. One of my favorite hobbies is to be both at the same time by creating models to assess the performance of teams and predict the outcomes of future games. In this post, I am going to describe a model for modeling soccer goals. In future posts, I will add some more complexity to the model. Also, I am going to be using the word “soccer” instead of “football” because I live in the United States. If this bothers you, you can send your complaints to my Twitter* account.
*I am not on Twitter
For this post, I will be applying the model to the 2022-23 English Premier League season, even if that might be a little painful for me (I am an Arsenal fan).
The Model
A common practice when modeling the number of goals scored in a soccer match is to use a Poisson distribution. Rather than assuming a single expected value for each team, I assign an offensive coefficient, a, and a defensive coefficient, d, to each team. The expected number of goals is equal to the product of a and d.
The model also includes a parameter for home advantage, h. The expected value is multiplied by h for the home team’s expected goals and is divided by h for the away team’s expected goals. Typically, h is a number slightly greater than 1.
One way to obtain values for the model parameters (offense, defense, home advantage) is to find the Maximum Likelihood Estimation (MLE). This means I need to maximize the probability given by multiplying all the Poisson distributions for each game.
Finding the values of parameters that maximize this function requires taking a derivative with respect to each individual parameter, setting the derivative to 0, and solving. The problem is that the function above is made up of many, MANY terms being multiplied, so taking the derivative would mean using the product rule about a billion times and then dealing with a somewhat unwieldy function. A nice way around this is to take the log of both sides. This takes advantage of a nice property of logs: The log of a product equals the sum of the logs.
This property works with any logarithm. [Bob Ross voice]: I am partial to the natural log, but you can use any logarithm you like.
Now taking a derivative is a little easier since terms are being added rather than multiplied. The derivative with respect to team k‘s offensive coefficient is
The Kronecker delta is used to just select matches featuring team k. A similar equation exists for each team. Taking the derivative with respect to the defensive coefficients results in a very similar set of equations.
The left side of the equation is just the total number of goals scored by team k. The right side of the equation is the total number of goals predicted by the model. This shows that the MLE parameter values will be those that match the actual number of goals scored by each team. If we went through the same steps for the defensive parameters, we would find that the MLE values of those parameters would match the actual number of goals allowed by each team.
We can go through the same hullabaloo with the home advantage parameter. If we take the derivative of the log-likelihood and rearrange the equation, we get
The left side is the total home goals minus the total away goals. The right side is the predicted home goals minus the predicted away goals. The MLE value of h will make the real and predicted values of home minus away goals the same.
Finding Coefficients Iteratively
The English Premier League contains 20 teams. Since each team has two coefficients (offense and defense) and there is also a home advantage parameter, that means there are 41 parameters in the model. We want to find the values of each parameter so that the predicted number of goals scored and allowed by each team matches reality. This might seem daunting because we are solving a system of 41 equations with 41 variables and each equation includes both offensive and defensive coefficients. This means that a depends on d and vice versa.
Rather than actually solving the 41 equations simultaneously (which you could hypothetically do), I find the solutions iteratively. I start with a guess for each parameter (it doesn’t matter what I guess), then solve for the offensive coefficients given the values of the other coefficients. This particular model makes it especially easy to solve an equation for one iteration because the predicted number of goals is directly proportional to both the offensive and defensive coefficients. If a[k] is doubled, then the predicted number of goals for team k is also doubled.
As an example, let’s look at Arsenal (and try to forget how they blew their chance to win the league for the first time in almost two decades). Arsenal scored 88 goals throughout the season. If we use our initial guess of parameter values to predict the goals Arsenal will score, we get 71. We can make the predicted number of goals equal exactly 88 if we multiply Arsenal’s offensive coefficient by 88/71. We would also multiply each other team’s offensive coefficients by their (actual goals scored) / (predicted goals scored). After doing this, each team’s predicted goals scored will match the real goals scored. We would then do the same for the defensive coefficients. After multiplying each defensive coefficient by (actual goals allowed) / (predicted goals allowed), each team’s actual and predicted goals allowed will match.
We have to continue adjusting the parameters because when we adjust the defensive parameters of each team, the predicted goals scored by each team changes. This necessitates adjusting the offensive coefficients again which changes the predicted goals allowed. If we alternate between adjusting the offensive and defensive coefficients, eventually the values converge to the MLE solution. In actuality, we also need to adjust the home advantage parameter each iteration, as well.
In practice, I find that it takes on the order of 10s of iterations to converge to the MLE solution for a league containing 20 teams.
Results
After leading the league for most of the season, Arsenal slumped down the stretch and ended up losing the title by 5 points. It felt like the most Arsenal thing for them to do. ESPN even published an article on March 30 about how Arsenal fans were terrified of their team blowing the title race. Less than two weeks later, Arsenal began a stretch of four games that blew the title race. In two consecutive away matches (Liverpool and West Ham), the Gunners blew a 2-goal lead and settled for a draw. Then they had to scrape their way to a 3-3 draw at home against bottom-of-the-table Southampton. Finally, Manchester City themselves dealt a nearly fatal blow by winning 4-1 against Arsenal. Even if Arsenal were outmatched by City, they still would have been champions if they didn’t drop 6 points in 3 very winnable matches.
Now that I have vented, let’s look at the results of the model. First, the offensive and defensive parameters.
According to these numbers, Manchester City was the best offensive team (having Erling Haaland on your team helps with that). The best defensive team was Newcastle United (smaller defensive numbers = better defense).
An interesting feature of the model that I haven’t mentioned yet is that there isn’t actually just one MLE solution. If all the offensive coefficients were multiplied by the same factor and all the defensive coefficients were divided by that factor, then none of the model’s predictions would change. The results were scaled so that the averages of the offensive and defensive coefficients were the same.
Manchester United: The red team from Manchester finished third in the league, but the model actually has them as the sixth best team. They ended up with 75 points, whereas the model says they should have expected about 62.4. The team that finished one place behind them, Newcastle United, scored 10 more goals and allowed 10 fewer goals, so the model has them about 12.5 points better than Man U. Newcastle’s issue is that they simply drew too many games. They ended up with just 14 draws, tied for most in the league.
Leicester City: The Foxes really struggled throughout the season and ended up being relegated. Even more painfully for their fans, the model shows that they probably should have been able to stay up. The model had Leicester City on 43.82 points, good enough for 14th place. They had a better goal difference than 4 of the 5 teams above them in the table, but weren’t able to make the most of their abilities. In fact, Leeds United was also relegated despite the model having them safely outside the bottom three. In addition to Southampton at the bottom, the model had Bournemouth and Wolverhampton Wanderers being relegated.
I am going to look more closely at how well this model fit the actual results. I am also going to look at some other parameters that can make the model more accurate. In particular, the model presented here assumes that the probability of scoring a goal is constant throughout a match. Intuitively, we expect that to not be the case; we would expect a team that is trailing to adjust their tactics to emphasize scoring a goal.
Yeah, pretty much Greek to me, but not to him. Norm Schenck, David's dad