Monday, 28 January 2013

Football prediction: Simulating the season so far

I've been working for a while on an agent-based simulation of football matches, to see how close it can get to predicting the real results of Premier League games.

Last week, I explained roughly what the model is and how it works.

This week, is the model any good? I'll be honest, I may have cherry picked that Chelsea result a little for the last post, but this will be a warts-and-all picture of how well the model predicts. Or how it would have predicted the season so far (with one important caveat that we'll come to in a while.)

One last thing before we get stuck in... The most obvious use of a model like this (if it works) is to gamble based on its predictions, but I'm building more for the technical challenge of seeing if I can do it. I'm really interested in using a model like this as a scenario planning tool - what would happen to your season if you signed player x? Or if player y got injured? If the model can be made to work, you could run 'what ifs' and work out the value of players in terms of expected points added to the team's total across a season.

Back to betting, I might have a punt, but I'm not really a gambler. Bet on it if you like (sensibly!) as I start to predict games on Wallpapering Fog and don't forget to add a comment to let us know how you got on. I'll talk about odds a fair bit below, because they're an obvious source of another prediction, to compare with the model. Having said that, if you'd bet on every game so far this season using the model's predictions - up to 20th January 2013 - you'd have just about broken even. Improvements to the model from here, would make it profitable. Got your attention? Here we go.

What I've been doing this weekend, is building some code to run a whole series of games in succession, not just one at a time. Then I fed in the fixtures, starting line-ups and player statistics for each game this season, up to the 20th January, using data from EPL Index. We simulate each individual game 100 times and get an overall predicted likelihood of home win, away win or draw, plus the most likely scoreline.

Remember that caveat I mentioned? Here it is. I'm simulating each player, using their average performance across the whole season so far, which isn't strictly fair. When Fulham played Norwich on the first day of the season, I wouldn't actually have had any 2012/2013 data to feed into the model at the time - only the previous season's numbers. It's something else on the long list of development tasks that need dealing with...

Here are the predictions anyway. Correct calls in green.
Google's determined to open the image below in its G+ gallery, which isn't readable. Here for bigger.

Overall, the model calls 50% of results correctly, on the criteria that the team it gave the most chance of winning, ran out as winners in real life.

I was initially a bit disappointed with that. Only 50%? I was hoping for more.

Then I had a look to see how often the bookies get it right. No doubt this will be incredibly obvious to some readers, but as I said I'm not a gambler. How often did the bookies' favourite win those games? 51%. (odds from

Suddenly 50% doesn't seem all that bad!

A big part of the error comes from draws, both in my model and in the bookies' odds. A draw is almost never the most likely result of a single game, but overall, around 30% of games will end in a draw. My model only called one game as having a draw as the overall most likely outcome - Aston Villa vs. Stoke.

When you simulate game-by-game, you'll predict almost zero draws, which means you'll be wrong 30% of the time before you even start. Predicting a season, where you only simulate each game once would give a 'normal' number of draws, but each individual game's prediction would be much less accurate. It's swings and roundabouts, depending on whether we're trying to predict final league placings, or the result of a single game.

If you'd bet on the model's prediction for every game, using the same stake, at Bet365's odds, you'd have lost 3% of your money so far. If you'd taken the best odds available in the market each time, not just Bet365's, you'd actually be up 1%. That's not a disaster for a first effort! At least we're not ruined.

Let's see what that accuracy looks like. I tweeted this one over the weekend; it shows ten game rolling average prediction accuracy for the model and also for the bookies.

When the shaded area between the lines is red, the bookies are predicting more results correctly than the model, over the previous ten games. When it's green, the model is out-performing the bookies.

It's interesting that a couple of weeks into the season, accuracy for the model and for the bookies plummets to just 10-20%. It may be that the early season is harder to predict - we'll need to run a few more seasons to find out. That period certainly screws up any hopes of winning a fortune as the bookmakers do slightly better, even though both are doing badly.

Here's the same data, cumulatively. The chart shows total accuracy across the season so far, with 20th January on the far right hand side.

You can see that the bookies' favourites have won more often than the model's across the whole season so far, ending with just the 1% gap that I described earlier, 51% to 50%.

What's very interesting to me, is that the model looks like it's improving slowly across the season and closing the gap; the games may be becoming more predictable as the season goes on. I'm not jumping on that conclusion just yet, but I'm certainly going to keep an eye on it.

If you'd bet using the model, only since the New Year at best market odds, you'd currently be up 19% on your original stake.

Let's finish with the next big improvement for the model - at least I hope it will see a significant improvement in predictive power. I'm seeing these developments as positives, even though there's a fair bit of work involved in building them, because our simple view of the world is doing pretty reasonably and there's huge scope for improvement from here.

At the moment, once a team has the ball in the sim, the opposition can't win it back, the team in possession can only lose it. This is fine in a game against an 'average' team, but taking Arsenal as an example, their passing accuracy was 90% against Sunderland and 81% against Manchester City. I've got no doubt that Man City caused that to happen by harrying their opponents, so the model needs to account for it.

Even more important, it's not every player's passing accuracy that drops against higher quality opponents - some will cope much better than others. We need a way to predict what each individual player's passing accuracy will be, against this week's opponents. I'm working on it.

Stay tuned for predictions for Tuesday's games!

No comments: