Thursday, 28 February 2013

Just stop it. There's no such thing as "Data Science".

I haven't had a good rant for a while. Rants are what blogs are for. Here comes a rant.

The term "Data Scientist" is getting well out of hand. I'm seeing articles all over the place with titles like "What is a Data Scientist?" and "do you need one?".

Data Scientists are what happens when marketing people and journalists spot a trend that's been going on for ages, but decide to act like they've just discovered fire. I should probably decide to call myself one and then double what I used to charge for work, before I was a Data Scientist.

I'm not doing that though, I'm ranting on a blog.

"Data Scientist" is a tautology.

You know what they call people who use data to do science?



Monday, 25 February 2013

Curse Blogger post scheduler. Predictions for erm... last Saturday

This set of predictions was meant to go live on Saturday morning, but a glitch in Blogger's post scheduler meant it didn't happen, sorry about that. I had a very kind trail from @OptaPro on Twitter this week too, so it was an even bigger disappointment. On the plus side, I couldn't post manually because I'd gone paragliding and I love paragliding, it's even better than statistics and football matches.

Back to football and the model's new and improved so I thought it would be worth reposting these predictions and explaining what I've been up to.

If you'd like a bit of history, try my past posts. I'm using an agent-based model of football matches to try to predict results and as usual, predicted starting line-ups for the teams are from Fantasy Football Scout. At some point, I'll build an engine to scrape the actual announced line-ups half an hour before kick off and re-run the model automatically, but one step at a time...

The big improvement I've been working on, which has turned out to make a small overall improvement in prediction accuracy, is to allow players to have a good or bad game. Previously, each player always performed at their average level - so for example if their passing accuracy averages 80% they'll always pass at 80% accuracy - but now I use the standard deviation of each player's passing accuracy and sample from a normal distribution, to decide how a player will perform. What this means is (to pick a couple of random examples) a very consistent player like Paul Scholes will always pass well in the model. A player like Darren Bent will have passing accuracy that's all over the place, with some very good games and some very bad ones.

This "form" feature is random for the minute, although I have spotted some interesting relationships  in the data and I think to an extent it's predictable when players will have a bad game. Check out this tweet for an example. I've promised EPL Index (where all the data comes from) an article on this though so it will have to wait for a minute.

Onto the predictions! They were predictions, honest.

And how did we do this week?

I actually had a small bet on these and am up for the weekend already, with the Spurs game still to play, so it didn't go too badly. From here, I picked:

Fulham to win (won)
Newcastle to win (won)
Wigan to win (won)
Sunderland to win (lost)
Norwich and Everton to draw (lost)
Spurs to win (playing tonight)

Of the remaining games, I don't trust Arsenal in the model at the moment. They pass well and it's largely a passing-based sim, so it seems to overestimate their chances, although it called this result correctly (just about). Who trusts Arsenal to reliably get a result anyway? The model called Man City and Man United's results correctly, but the odds were rubbish so I left those two.

If Norwich hadn't been allowed to take that last corner, I'd have had an even better weekend! This model's not doing so badly, if I do say so myself. Definitely worth persevering with.

I promise faithfully, on my honour, to have predictions up before the next set of matches this weekend.

Saturday, 2 February 2013

Football Sim: Predictions for 2/3 Feb 13

This is probably going to be the last set of predictions before I put some proper time into improving the model. We know that on current performance, it's going to slowly lose money if you bet on it and that's not tremendously exciting. Improvements from here are much harder than building the simulator in the first place, but I've got a few promising ideas to follow up.

Populating the fixtures with expected starting line-ups is also a complete pain in the neck and takes far too long. I'm going to have to sort that out, because sometimes my Friday evenings are based around beer rather than football match modelling.

Having said that, putting this set of forecasts together has thrown up a few interesting effects and led to me tweaking the algorithm a little already.

Here's what we've got. Starting line-ups from Fantasy Football Scout.

A few of those percentages stick out as disagreeing with the bookies odd this morning. Taking those ones in order...

Everton vs. Aston Villa

Everton are predicted to win, sure, but the bookies give Villa almost no chance and my model thinks they could win it. Why does it think that?

The big reason (that we'll see again for the Man City game) is that the model doesn't really understand defending yet. It will penalise teams that have only average ball retention but which are above average at defending. Conversely for Villa, it doesn't know that their back line has shipped 46 goals so far this season. The model also currently sees a player like Fellaini as a striker with decent shooting accuracy and below average passing - it doesn't understand the physicality of his game.

It's far from perfect! I did say I was doing my development work in public. Anyway, on to...

Manchester City vs. Liverpool

I'm sure this is the defending factor again. Could happen though and maybe this prediction will make some Liverpool fans happy.

Reading vs. Sunderland

I like this one, it's interesting! I've got Reading at 10% (decimal odds 10.0). The bookies odds say they're going to win the game. What's that all about then?

Well first of all, the model's using player stats over the season so far, not just the past few games. Up until Christmas, Reading really weren't good, which drags their performance down.

The big question in this game though is what's going to happen with Adam Le Fondre? The sim doesn't do substitutes yet and he's not in Fantasy Football Scout's predicted starting line-up. We can't do super subs.

Without Le Fondre starting in the sim, Reading will struggle badly to score.

We've played the game 1000 times without him. Let's stick Le Fondre in for Guthrie, play it another 1000 times and see what happens. We'll be giving Le Fondre his super-sub stats over the whole game.

That's quite a difference! Sunderland still win it, mind.

Now let's hope the favourites don't let us down this time and we can do a little better than last Tuesday evening.