Monday, 21 January 2013

Simulating football matches: An experiment.

Prediction. The holy grail of analysis. Diagnostics are good and they help us to understand the past, but if we can't use that work to get better at predicting the future, then isn't it all a bit pointless? If your theory doesn't predict the future, then it isn't science. Unlike football punditry, you don't get meteorologists diagnosing that last weekend's weather was rubbish because "the Sun must have had an off day". Scientists test their theories through prediction.

In my day job, we predict advertising. You've run adverts for your business before, so we measure the effect of those and then we can tell you how much you'll sell in the future.

Advertising's mostly quite dull. Can we do it for football?

Since Moneyball, a lot of people have been paying close attention to football statistics, writing up analyses and discovering relationships in OPTA's football data. Man City are even trying to tap into those amateur insights through the MCFC Analytics Project.

I couldn't resist diving in, so armed with a subscription to EPL Index (four quid a month for player stats for each individual game? Yes, please) I've been running a little side project and playing with the data.

If you want to predict football, there are a few ways you could go about it and I did have a crack at this project years ago, but with only very top-line data on past game results. You can build a regression model (a bit like the advertising models I build for a living) that uses each team's form to predict the outcome of the next game.

It works something like this...

Predicted result = f(Home Team form, Away Team form, plus lots of other things like goal-scoring and conceding rates...)

All weighted for the quality of opposition over the past few games.

This type of model kind of works. Basically, it will predict things like Man U should beat Wigan, but we already know that. The model I built a few years ago didn't do any better than my own guesses and that's not tremendously useful.

Top-line models like this also have a massive issue in that there are simply too many variables that need to be accounted for. What are the chances that Man U beat Wigan if Van Persie's injured? Our model based purely on past form will struggle with that, especially if he's played all season and we've got no experience with him not present (and just as important, a different player playing) in the team.

You also get very little in the way of explanation with top-line models like this. Man U will win because they usually win and Wigan will lose because (sorry Wigan fans) they usually lose. What can Wigan do about that? Well obviously they need to improve their form. Thanks, Mr. Consultant, you're fired.

Long story short, we need a different technique and the one I've been using is called Agent Based Modelling (ABM).

ABM simulates the world from the bottom up rather than the top down, which in football means simulating the players rather than the result. We set up an artificial game - using real world OPTA statistics about the performance of individual players - and we run the game to find a predicted result. The result is an outcome of the simulation that we can't control directly.

If you're thinking, "Is he trying to build Football Manager using OPTA data?", that's basically the size of it, yes. Told you it was more fun than advertising.

Inside the model, you kick off the game and from then on, it's all down to the simulation. The player with the ball will make a decision, based on what they do most often in real games - pass, shoot, dribble... each decision is randomly generated, but weighted towards the probability of what that player does most often in real life, based on the OPTA data.

If they choose to pass, the simulation checks for a successful pass and then works out who the ball went to, again a randomly generated choice but weighted by real data. It's the same if they shoot, when we work out the chances that their shot went in. If they lose the ball, it transfers to a player on the opposition, again determined by a weighting of... you get the idea. Then the whole thing starts again with the player who has the ball now.

We play the game through, with players passing, shooting, losing the ball etc. and we get a result, which is our prediction for the match.

Now, you might say, "but there were loads of randomly generated decisions in the model. If we ran it twice we might get different results", and you'd be absolutely right. It's just like the real world and if the same teams play each other a few times, you can get a different result every time.

What we're after is the probability of winning for each team, so we run the match 1000 times (for now - it's a nice round number) and count up how many times each team wins.

After some teething problems (it wouldn't be interesting if it was too easy) the model's starting to turn up sensible results and I promise I'll share its predictions for the next set of Premier League games (29th and 30th Jan). I'm not standing by those predictions yet, but it will keep me honest and motivated to do the development in public and to an extent already has... after tweeting on Saturday that Norwich might have more chance against Liverpool than the bookies thought and then watching them get battered 5-0. I now know why the model did that and it doesn't do it any more!

The model is a huge oversimplification of a real game but over time, it should help to teach us about what's really important. As a quick example, the model currently doesn't treat crosses any differently from other passes - they're a complete or incomplete pass and that's it. If they're a complete pass, then the player who receives the ball might shoot. But if we keep seeing that teams with traditional wingers win more games than the model would predict, then that might need sorting out.

I'll end for today with a bit about the Chelsea vs. Arsenal game on Sunday, to illustrate what you get from the model and the sorts of things that we might be able to do with it. Here are the teams (no subs yet by the way):

Run that one 1000 times and what happens?

Chelsea:    44%
Arsenal:    26%
Draw:       30%

So Chelsea are predicted to win 44% of the time. We get a scoreline from the model too and here are the 15 most likely, adding up to 94% of all results. An interesting outcome is that although we predict Chelsea to win overall, the single most likely result is 1-1.

The actual result was 2-1 to Chelsea, so the model got the winner right and 2-1 was our third most likely score. That looks potentially ok! We'll only find out if it really works by testing across a lot of games though.

Another way to see if the predictions are sensible is to compare with the bookies' odds. There's probably something wrong if we're not in the same ball-park as professionals taking advantage of the wisdom of their crowds of punters. The bookies had these odds (decimal odds, with implied percentage in brackets):

Chelsea:    1.83 (55%)
Arsenal:    3.5 (29%)
Draw:       3.25 (31%)

On an Arsenal win, or a draw, we're almost bang on the market odds.  On Chelsea we're below, but bear in mind that the model's odds have to add up to 100%. The bookmakers' odds add up to 114%, which is why bookmakers make money.

Lastly, just as an example of what else you can do with this type of model, there were some raised eyebrows that Torres started for Chelsea instead of Demba Ba. What if they'd been switched and Ba had started the game? Let's run it another 1000 times...

Chelsea's chance of winning goes up by two percentage points to 46%.

Arsenal's chances go down, right? Wrong. This is why agent based models are good - they can show us things that we don't expect.

With Demba Ba starting for Chelsea, Arsenal's chances of winning actually go up by 1% to 27% (to complete the picture, the chances of a draw drop 3%). The balance of how often Ba receives the ball and how often he gives it away compared to Torres makes all the difference. All in all, the predictions barely move but it's an interesting outcome. We'll see more of these.

OK, that will do for now. Predictions next week and maybe, just maybe, a cheeky punt based on our forecasts.

Oh and if you're running your own project like this, I'd LOVE to hear about it! Stay tuned, this is just the start.


Anonymous said...

Really interesting Neil and yes a lot more fun and interesting than advertising!!

Robert Weatherhead said...

I suppose the question is how do you translate this into a betting model. As you point out, the bookies know what they are doing so the odds are weighted.

Do you try and weight the stake and cover numerous outcomes, or do you apply a threshold element of certainty at which a bet becomes worthwhile.

In this example you could take the Chelsea or draw as being a strong outcome. A five pound bet on each would produce a profit on the total £10 stake but leaves the outside chance of Arsenal win as a loss.

Scores are massively unpredictable (hence the highest % being 14) so can you turn this into a both teams to score, or over a certain number of goals prediction? and then can you weight the stake to mean most outcomes are positive?

Neil C said...

If you trusted the model, you'd probably go with something like a threshold, where the model's predicted likelihood of an event is above the bookies odds.

Trouble at the moment is that when the model strongly disagrees with the bookies, I'd tend to think it's because there's something wrong with the model...

Thinking that for betting it might well have more success simulating large chunks of a season to predict league placings - a lot of the individual game errors should (hopefully) average out.

Tom said...

Maybe use the odds from a betting the market/prices are set by the individual the over round usually drops from around 14% to about 3%. If your percentage chance is higher than the betfair/betdaq price, then you have value.

Unknown said...

"after tweeting on Saturday that Norwich might have more chance against Liverpool than the bookies thought and then watching them get battered 5-0. I now know why the model did that and it doesn't do it any more!"

Why did the model do that? Fwiw I also thought Norwich were overpriced and backed them accordingly.

Neil C said...

Few reasons for the Liverpool v Norwich thing...

The shot to goal calibration was wrong, especially for Norwich - it gave them far too high a chance of scoring. Also, it wasn't weighted enough towards current form. For most games I've run it hasn't mattered all that much, but this one really did and using player stats that included 2011 / 2012 data gave Norwich a lot more chance of winning. Suarez might be a big part of that but I haven't had time to check.

And last one (that wasn't so important) - I was predicting before the teams were announced and had Skrtel starting for Liverpool and Sturridge benched.

I'm still with you that 10/1 was maybe worth a punt though.

dickie said...

I would like to discuss this further over a pint, I have a vested interest in modelling, I am findable on facebook.

Neil C said...

Hmmm, cryptic! I might need another clue...

Unknown said...

I backed Norwich before realising Hoolahan, Pilkington and Bassong weren't playing. Moral of story: wait for team news before taking plunge.

Anonymous said...

Hello, this simulation stuff is very interesting! Could you share what software do you use? Cheers!

Neil C said...

All coded in VBA for the moment until I decide on something better.

For Agent Based models, I've used Anylogic (paid)

and Repast (free)

in the past, but setting up the sims in either would be a pain, so I built from scratch.