Tuesday, 29 January 2013

Football Sim: Predictions for 29-Jan-13

OK, here they are! The first time I've let the football simulator loose in public, on games that haven't actually been played yet. Sounds like a recipe for disaster to me, but let's do it anyway.

If you've got no idea what I'm talking about, read this first.

A few things to be aware of... (in other words, here are my get out clauses, terms and conditions, caveats, call them what you like...)

I got the teams' predicted starting line-ups here.

Your guess is as good as mine (or Fantasy Football Scout's) what Newcastle's starting line-up is going to be.

I've made the new NUFC Frenchmen into completely average players for the purposes of the sim. They might be better than that and they might be worse!

As per my big "season so far" post yesterday, the model calls winners correctly around 50% of the time on average. Follow my tips at your own risk...

It gets exact scores right, around 10% of the time.

Both of those mean that for betting, it seems to just about break even (but not quite). I'm working on it.

That Sunderland prediction looks a little too heavily skewed to a home win for my liking.

But I'm off to put a bet on it anyway.

Monday, 28 January 2013

Football prediction: Simulating the season so far

I've been working for a while on an agent-based simulation of football matches, to see how close it can get to predicting the real results of Premier League games.

Last week, I explained roughly what the model is and how it works.

This week, is the model any good? I'll be honest, I may have cherry picked that Chelsea result a little for the last post, but this will be a warts-and-all picture of how well the model predicts. Or how it would have predicted the season so far (with one important caveat that we'll come to in a while.)

One last thing before we get stuck in... The most obvious use of a model like this (if it works) is to gamble based on its predictions, but I'm building more for the technical challenge of seeing if I can do it. I'm really interested in using a model like this as a scenario planning tool - what would happen to your season if you signed player x? Or if player y got injured? If the model can be made to work, you could run 'what ifs' and work out the value of players in terms of expected points added to the team's total across a season.

Back to betting, I might have a punt, but I'm not really a gambler. Bet on it if you like (sensibly!) as I start to predict games on Wallpapering Fog and don't forget to add a comment to let us know how you got on. I'll talk about odds a fair bit below, because they're an obvious source of another prediction, to compare with the model. Having said that, if you'd bet on every game so far this season using the model's predictions - up to 20th January 2013 - you'd have just about broken even. Improvements to the model from here, would make it profitable. Got your attention? Here we go.

What I've been doing this weekend, is building some code to run a whole series of games in succession, not just one at a time. Then I fed in the fixtures, starting line-ups and player statistics for each game this season, up to the 20th January, using data from EPL Index. We simulate each individual game 100 times and get an overall predicted likelihood of home win, away win or draw, plus the most likely scoreline.

Remember that caveat I mentioned? Here it is. I'm simulating each player, using their average performance across the whole season so far, which isn't strictly fair. When Fulham played Norwich on the first day of the season, I wouldn't actually have had any 2012/2013 data to feed into the model at the time - only the previous season's numbers. It's something else on the long list of development tasks that need dealing with...

Here are the predictions anyway. Correct calls in green.
Google's determined to open the image below in its G+ gallery, which isn't readable. Here for bigger.

Overall, the model calls 50% of results correctly, on the criteria that the team it gave the most chance of winning, ran out as winners in real life.

I was initially a bit disappointed with that. Only 50%? I was hoping for more.

Then I had a look to see how often the bookies get it right. No doubt this will be incredibly obvious to some readers, but as I said I'm not a gambler. How often did the bookies' favourite win those games? 51%. (odds from football-data.co.uk)

Suddenly 50% doesn't seem all that bad!

A big part of the error comes from draws, both in my model and in the bookies' odds. A draw is almost never the most likely result of a single game, but overall, around 30% of games will end in a draw. My model only called one game as having a draw as the overall most likely outcome - Aston Villa vs. Stoke.

When you simulate game-by-game, you'll predict almost zero draws, which means you'll be wrong 30% of the time before you even start. Predicting a season, where you only simulate each game once would give a 'normal' number of draws, but each individual game's prediction would be much less accurate. It's swings and roundabouts, depending on whether we're trying to predict final league placings, or the result of a single game.

If you'd bet on the model's prediction for every game, using the same stake, at Bet365's odds, you'd have lost 3% of your money so far. If you'd taken the best odds available in the market each time, not just Bet365's, you'd actually be up 1%. That's not a disaster for a first effort! At least we're not ruined.

Let's see what that accuracy looks like. I tweeted this one over the weekend; it shows ten game rolling average prediction accuracy for the model and also for the bookies.

When the shaded area between the lines is red, the bookies are predicting more results correctly than the model, over the previous ten games. When it's green, the model is out-performing the bookies.

It's interesting that a couple of weeks into the season, accuracy for the model and for the bookies plummets to just 10-20%. It may be that the early season is harder to predict - we'll need to run a few more seasons to find out. That period certainly screws up any hopes of winning a fortune as the bookmakers do slightly better, even though both are doing badly.

Here's the same data, cumulatively. The chart shows total accuracy across the season so far, with 20th January on the far right hand side.

You can see that the bookies' favourites have won more often than the model's across the whole season so far, ending with just the 1% gap that I described earlier, 51% to 50%.

What's very interesting to me, is that the model looks like it's improving slowly across the season and closing the gap; the games may be becoming more predictable as the season goes on. I'm not jumping on that conclusion just yet, but I'm certainly going to keep an eye on it.

If you'd bet using the model, only since the New Year at best market odds, you'd currently be up 19% on your original stake.

Let's finish with the next big improvement for the model - at least I hope it will see a significant improvement in predictive power. I'm seeing these developments as positives, even though there's a fair bit of work involved in building them, because our simple view of the world is doing pretty reasonably and there's huge scope for improvement from here.

At the moment, once a team has the ball in the sim, the opposition can't win it back, the team in possession can only lose it. This is fine in a game against an 'average' team, but taking Arsenal as an example, their passing accuracy was 90% against Sunderland and 81% against Manchester City. I've got no doubt that Man City caused that to happen by harrying their opponents, so the model needs to account for it.

Even more important, it's not every player's passing accuracy that drops against higher quality opponents - some will cope much better than others. We need a way to predict what each individual player's passing accuracy will be, against this week's opponents. I'm working on it.

Stay tuned for predictions for Tuesday's games!

Monday, 21 January 2013

Simulating football matches: An experiment.

Prediction. The holy grail of analysis. Diagnostics are good and they help us to understand the past, but if we can't use that work to get better at predicting the future, then isn't it all a bit pointless? If your theory doesn't predict the future, then it isn't science. Unlike football punditry, you don't get meteorologists diagnosing that last weekend's weather was rubbish because "the Sun must have had an off day". Scientists test their theories through prediction.

In my day job, we predict advertising. You've run adverts for your business before, so we measure the effect of those and then we can tell you how much you'll sell in the future.

Advertising's mostly quite dull. Can we do it for football?

Since Moneyball, a lot of people have been paying close attention to football statistics, writing up analyses and discovering relationships in OPTA's football data. Man City are even trying to tap into those amateur insights through the MCFC Analytics Project.

I couldn't resist diving in, so armed with a subscription to EPL Index (four quid a month for player stats for each individual game? Yes, please) I've been running a little side project and playing with the data.

If you want to predict football, there are a few ways you could go about it and I did have a crack at this project years ago, but with only very top-line data on past game results. You can build a regression model (a bit like the advertising models I build for a living) that uses each team's form to predict the outcome of the next game.

It works something like this...

Predicted result = f(Home Team form, Away Team form, plus lots of other things like goal-scoring and conceding rates...)

All weighted for the quality of opposition over the past few games.

This type of model kind of works. Basically, it will predict things like Man U should beat Wigan, but we already know that. The model I built a few years ago didn't do any better than my own guesses and that's not tremendously useful.

Top-line models like this also have a massive issue in that there are simply too many variables that need to be accounted for. What are the chances that Man U beat Wigan if Van Persie's injured? Our model based purely on past form will struggle with that, especially if he's played all season and we've got no experience with him not present (and just as important, a different player playing) in the team.

You also get very little in the way of explanation with top-line models like this. Man U will win because they usually win and Wigan will lose because (sorry Wigan fans) they usually lose. What can Wigan do about that? Well obviously they need to improve their form. Thanks, Mr. Consultant, you're fired.

Long story short, we need a different technique and the one I've been using is called Agent Based Modelling (ABM).

ABM simulates the world from the bottom up rather than the top down, which in football means simulating the players rather than the result. We set up an artificial game - using real world OPTA statistics about the performance of individual players - and we run the game to find a predicted result. The result is an outcome of the simulation that we can't control directly.

If you're thinking, "Is he trying to build Football Manager using OPTA data?", that's basically the size of it, yes. Told you it was more fun than advertising.

Inside the model, you kick off the game and from then on, it's all down to the simulation. The player with the ball will make a decision, based on what they do most often in real games - pass, shoot, dribble... each decision is randomly generated, but weighted towards the probability of what that player does most often in real life, based on the OPTA data.

If they choose to pass, the simulation checks for a successful pass and then works out who the ball went to, again a randomly generated choice but weighted by real data. It's the same if they shoot, when we work out the chances that their shot went in. If they lose the ball, it transfers to a player on the opposition, again determined by a weighting of... you get the idea. Then the whole thing starts again with the player who has the ball now.

We play the game through, with players passing, shooting, losing the ball etc. and we get a result, which is our prediction for the match.

Now, you might say, "but there were loads of randomly generated decisions in the model. If we ran it twice we might get different results", and you'd be absolutely right. It's just like the real world and if the same teams play each other a few times, you can get a different result every time.

What we're after is the probability of winning for each team, so we run the match 1000 times (for now - it's a nice round number) and count up how many times each team wins.

After some teething problems (it wouldn't be interesting if it was too easy) the model's starting to turn up sensible results and I promise I'll share its predictions for the next set of Premier League games (29th and 30th Jan). I'm not standing by those predictions yet, but it will keep me honest and motivated to do the development in public and to an extent already has... after tweeting on Saturday that Norwich might have more chance against Liverpool than the bookies thought and then watching them get battered 5-0. I now know why the model did that and it doesn't do it any more!

The model is a huge oversimplification of a real game but over time, it should help to teach us about what's really important. As a quick example, the model currently doesn't treat crosses any differently from other passes - they're a complete or incomplete pass and that's it. If they're a complete pass, then the player who receives the ball might shoot. But if we keep seeing that teams with traditional wingers win more games than the model would predict, then that might need sorting out.

I'll end for today with a bit about the Chelsea vs. Arsenal game on Sunday, to illustrate what you get from the model and the sorts of things that we might be able to do with it. Here are the teams (no subs yet by the way):

Run that one 1000 times and what happens?

Chelsea:    44%
Arsenal:    26%
Draw:       30%

So Chelsea are predicted to win 44% of the time. We get a scoreline from the model too and here are the 15 most likely, adding up to 94% of all results. An interesting outcome is that although we predict Chelsea to win overall, the single most likely result is 1-1.

The actual result was 2-1 to Chelsea, so the model got the winner right and 2-1 was our third most likely score. That looks potentially ok! We'll only find out if it really works by testing across a lot of games though.

Another way to see if the predictions are sensible is to compare with the bookies' odds. There's probably something wrong if we're not in the same ball-park as professionals taking advantage of the wisdom of their crowds of punters. The bookies had these odds (decimal odds, with implied percentage in brackets):

Chelsea:    1.83 (55%)
Arsenal:    3.5 (29%)
Draw:       3.25 (31%)

On an Arsenal win, or a draw, we're almost bang on the market odds.  On Chelsea we're below, but bear in mind that the model's odds have to add up to 100%. The bookmakers' odds add up to 114%, which is why bookmakers make money.

Lastly, just as an example of what else you can do with this type of model, there were some raised eyebrows that Torres started for Chelsea instead of Demba Ba. What if they'd been switched and Ba had started the game? Let's run it another 1000 times...

Chelsea's chance of winning goes up by two percentage points to 46%.

Arsenal's chances go down, right? Wrong. This is why agent based models are good - they can show us things that we don't expect.

With Demba Ba starting for Chelsea, Arsenal's chances of winning actually go up by 1% to 27% (to complete the picture, the chances of a draw drop 3%). The balance of how often Ba receives the ball and how often he gives it away compared to Torres makes all the difference. All in all, the predictions barely move but it's an interesting outcome. We'll see more of these.

OK, that will do for now. Predictions next week and maybe, just maybe, a cheeky punt based on our forecasts.

Oh and if you're running your own project like this, I'd LOVE to hear about it! Stay tuned, this is just the start.

Friday, 11 January 2013

Are we branding? Or selling?

If you've worked in advertising for a while, you'll have come across the question of whether it's ok to compromise the 'creative vision' of a 'brand' ad, by sullying it with practical things like phone numbers and web addresses.

Being a data-type person and so not the sort to think that the majority of TV ads have a great deal of 'creative vision', I don't really understand this question, but it's a question that comes up a lot.

I wrote some time ago about adverts, which don't have the product that they're advertising in them and ads which don't tell you how to buy the product, fall into pretty much the same category. The challenge of making an ad is to make something that holds people's attention, so that you can talk to them about a product. Holding people's attention for 30" without the discipline of showing the product, is easy. It's a music video. You just show whatever you want and anyone can do that, even me.

So we need an ad with the product in it. A comment on the post I linked to above, summed it up perfectly: "Can you describe the advertisement without mentioning the product?" Keep that one in mind as you watch an ad break tonight - it's scary how often the answer is, "yes you can and by the way, what was the product?"

OK, Mr. Super-Brand so now you've communicated your product to me and I want to buy it. What do I do next? Oh, you've gone.

If it's not blazingly obvious what to do next then you need to tell me, because your ad wasn't that gripping and I'm very, very lazy. Give me a phone number, give me a web address, tell me where to buy your product. Do not make me work hard to buy you.

Apple can get away with 'pure' brand ads that have no direction afterwards (but that always, always, have the product in them), because everybody knows where to buy Apple's stuff. Most brands aren't Apple.

If a creative agency tells you that their advert won't work so well if it includes your phone number, or retail stores, or web address, or (God forbid) the product that you sell, then they're not doing their job properly. Their job is to include all of those things and still make the ad interesting enough that people will pay attention. That's hard. It's why creative agencies cost money.

Here's something that's caught my eye recently; Samsung have taken to sticking their logo in the corner of the screen for virtually the entire duration of a TV spot.

In a laptop ad, this makes perfect sense because let's face it, all laptops look the same. Without that logo, you're asking somebody to really pay close attention to the ad in order to realise whose laptop they're looking at.

Why don't many, many more advertisers do this?

TV channels do it. Their only purpose is to entertain and they still stick their logo in the corner of the screen!

But the advertisers who pay to show their products on those TV channels, don't do it. Why on earth not? The only reason can be due to a feeling that it would make the ad look 'cheap'. Putting your logo on the screen isn't subtle. Maybe we're afraid to be caught actually doing marketing, so we pretend not to.

Me? In my 30" spot, I'd want my product front and centre, my brand logo next to it and a dirty great web address on the screen at the end. When you say that would compromise the 'quality' of the ad, are you saying my brand looks cheap? You might want to try another argument.