Monday, 4 March 2013

Big Data madness and my football prediction model

If you read Wallpapering Fog on a regular basis, you'll know I'm a Big Data sceptic. I've written about its limitations before, here and here for starters.

The main problems seem to stem from the fact that very often what non-specialists think is 'Big Data' isn't actually big at all. Your data might not fit easily on an Excel spreadsheet. Big Data doesn't fit on your laptop.

And the second problem - the really big one - is that having loads and loads of data doesn't usually help very much. I was reminded about that this morning with the news that Netflix has come over all Big Data. What tends to happen (as in the Guardian regional poverty example linked above) is that analysts spend ages processing loads of data and end up with an answer they could have reached much earlier, via a much simpler method.

I had a question recently regarding my football prediction model, as to whether I could use more detailed data. The answer? Maybe, but not yet. I'm nowhere near wringing everything out of Opta's top-line player performance stats, and hugely detailed game event data feeds would very likely bog the analysis down.

My football model is agent based, so it can have unpredictable outcomes and runs as a computer simulation, but it's now as good as the bookies at predicting the results of football games and it does that with inputs of only relatively top-line data.

For the record, even with all the different things that happen in a football game, you can still fairly effectively predict results (and win at the bookies for the last three weeks in a row) using only...

Pass completion rates
Goalscoring rates
Player dispossession rates
A measure of how good the opposition are at winning the ball back

And essentially, that's it.

Of course the model isn't perfect and there are tons of improvements to be made, but the crucial point is that if I'd started with Opta's event-level statistics, I'd be nowhere. I'd probably still be trying to pull that feed into a useful database and understand any underlying relationships in the data at all.

I've trained as an economist and more precisely an econometrician, so my instinct is to try to simplify problems. To build the 80% accurate answer, before you go for the 95%, because if you dive straight into complexity, you'll fail. People forget that even Google, the poster child for Big Data, started with a much simpler algorithm than they have now.

Management will learn through experience eventually, but at the moment, IT (largely) staff who are capable of assembling huge amounts of data, are promising nirvana and business owners are listening. Very few companies have any idea what they're going to do with all this information. An unspecific goal of "data mine it" is a business case that should never get past its first review.

This drive to collect and process massive amounts of data, by businesses that don't understand their simple data yet, is madness. Hugely expensive madness.

No comments: