Wednesday, 3 December 2014

There's a storm coming to marketing

Rory Sutherland tweeted a fascinating link a few weeks ago. He does that a lot, but this one in particular has stuck with me.

The link points to this article on Wikipedia, about Moravec's Paradox. Essentially, Moravec's Paradox explains that it's easier to program computers to do stuff that we think is complicated, than to do stuff that we think is easy.

Teach a robot to play world class chess? Done. Deep Blue beat Kasparov in 1997 and it's all downhill from there.

Teach a robot to walk as well as a human toddler? Nope. Now we're stuck.

As artificial intelligence improves, Moravic's Paradox suggests that you should be fearful for your job if you work with data analysis and structured processes. On the other hand, there's no imminent danger of somebody building a robot that's adaptable enough to fix the central heating in every different home. The plumbers will be fine. The jobs that we think of as 'easy' - manual labouring and skills that require some physical coordination - are way beyond the capability of today's computing, but the jobs that we think of as 'hard', may not be.

Keep Moravic's Paradox in mind as we look at a couple of new tools.

First, CasualImpact by Google. (Yes, that name needs a space. No it hasn't got one.)

CausalImpact is a tool for estimating what advertising has done to web traffic. You feed it your traffic stats and your advertising stats and it estimates how hard the advertising is working to create more traffic.

In essence, this is how I've been earning a living for the past fifteen years. Google just automated it.

OK, that's over-dramatic, Google hasn't made me redundant, yet. CausalImpact is a very small stepping stone, which only works for website traffic, in many cases won't work at all and you need a fair bit of technical knowledge to be able to deploy it, because it comes as an R plugin.

R plugins are hard, because R is hard. But then people like @jjmulz do helpful things like this.

And suddenly the ground I'm standing on starts to look shakier again. All programming tools are hard until somebody sticks an easy front end on them. If CausalImpact doesn't do it for you, try another Google funded project - the Automated Statistician. The machines are definitely coming.

This is all early days, but you can see where it's headed. Marketing analysis is a process and it has to be a fairly repeatable one, or you'd never be able to sell it to clients as a product. Without a process, every single project would be its own piece of R&D that might or might not work.

Marketing return on investment analysis is difficult, but so is chess and computers are better at chess than we are. You just have to teach them a framework for understanding the game.

What about the other end of the media planning process? The planning bit, before you get to measure what you've achieved? Charging into audience discovery, comes Profiler, from YouGov.

You probably saw YouGov Profiler via social media in the past few weeks. It's great.

Type in almost any subject area and it will tell you about the people who are interested in that topic. The scope of what you can look up is seriously amazing - you have to give it a try.

If you work in marketing, you'll quickly recognise the screens that pop out of Profiler as 'pen portraits'. These portraits are front and centre in every agency's pitch documents and annual plans. First we tell you about the audience who we want to see your adverts and then we tell you how we're going to achieve that.

Click on the 'Media' tab that you get on the output screen, bearing in mind that this is a demo and the full product will have loads more detail.

Damn, somebody's just automated another part of what marketing agencies do.

It is true that few businesses - other than marketing agencies - will buy access to the whole of YouGov's tool, because it would be too expensive for a piece of kit you'll use once or twice per year. Marketing agencies could still act as an intermediary, holding data and tools and running them for clients. We do this a lot now.

Except that if we've learned one thing about the web, it's that the web disintermediates. If you're sat in the middle of a transaction, making money by being a gatekeeper who controls access to a resource, then you should be scared of the internet. High street shops, travel agents, music labels, publishers... sooner or later, intermediary businesses get slapped by the web, because it puts buyers directly in touch with sellers.

If I was YouGov, I'd sell the Big Expensive Tool version of Profiler, but I'd also make it available on a 'pay as you go' model and let individual companies buy data, one query at a time. At the point YouGov or one of their competitors does that, the insight that agencies can create by profiling an audience becomes quite seriously devalued.

Just like Google's CausalImpact, the profiles that a company runs for itself probably won't be as sophisticated as they'd get from a professional analyst working in a marketing agency, but in many cases that won't matter. Amazon can't recommend books like an independent book store can, but it still forced most of the independents out of business.

Marketing agencies have spent years refining their processes. We're proud of our processes and they're what we use to differentiate ourselves from other agencies. We talk constantly about how we have a process to discover things differently, or connect them differently, or to measure the results better.

Computers are good at processes and this is going to become a serious problem for marketing companies and the people who work in them. Pieces of what we do are going to get automated. Pieces of what we do are already being automated.

The real revolution is quite some way off and you probably don't need to worry about it too much yet, because sexy bits of technology that you only just heard about, are usually ten to twenty years away from actually working properly. In the meantime though, we're going to see many innovations that chip away at the agency model and marketing agencies are going to have to work out - again - what it is that they can actually charge clients for.

We only did planning, until our clients mostly evolved onto fairly similar, effective, best-practice media plans.

Then we did 'added value': Processes, discovery, insight and post-campaign analysis.

When the processes, insight and analysis start to be automated, what will we do then?

My strong suspicion is that a marketing agency's true value lies in human interactions and in explaining the world, person-to-person to our clients. Rather than selling 'things'; media plans, PowerPoint decks, research studies and analyses, we're going to have to become much better at charging for these human interactions. If we don't, we'll slowly be automated into irrelevance.

Monday, 10 November 2014

Visualising football analysts on Twitter

Building on my new-found love of network diagrams, I thought it would be fun to visualise a social graph of football analysts on Twitter.

Who should you follow? These guys. They're fascinating.

Click the image for a (much) bigger and zoomable version.

 Large version

Small print:

Lots of users following each other moves those users' nodes closer together.

Following, replying to, or mentioning a user on Twitter gets you linked.

Nodes are sized by number of inbound links (i.e. shouting a lot and following lots of people doesn't get you a big circle, other people mentioning and following you does).

Twitter rate limits mean that once you hit a certain number of followers, you don't get any bigger. That's why all of the core people have nodes that are the same size.

This graph undoubtedly flatters my own profile because it's built from people I follow and talk to.

The starting point for the graph was Twitter users in this list. Who's missing? Let me know!

Thursday, 6 November 2014

Visualising 45,000 football transfers

Football's an international business and it's obvious to anybody watching a Premier League game, that players have been transferred in from all over the world.

But which countries' clubs are the most interconnected? Is the old cliché true, that British players don't travel as much as their foreign counterparts? And can we show the relationships between clubs in an interesting way?

I drew the following images with Gephi, using data on just under 45,000 player transfers, taken from SoccerWiki. Gephi clusters teams by the closeness of their transfer activity; a lot of players moving between teams means that they will group together, while teams that are far apart rarely acquire each other's players.

Some of these images benefit from clicking through to the larger version link and zooming in...

45,000 player transfers

Big version. Zoom in and scroll to see detail.

A rough guide to national connections
The UK and Italy stand apart from an interconnected Europe.

Big version

The British peninsular
Note the Scottish spur and island of Ireland.

A few technical notes:

Node sizing is by number of transfers in and out. A larger node indicates more transfer activity.

SoccerWiki isn't a perfect repository of transfer data, but it's more than good enough to draw this sort of network diagram and overall is a really fantastic resource. Although the way that SoccerWiki stores information makes it impossible to put an exact time-stamp on transfers, data covers a range from 2007 to 2014.

I've dropped any team with fewer than 20 player movements - in or out - in order to clean up the diagram. With everything switched on, it renders very slowly and you get a cloud of small, barely attached teams floating around the edges. They're distracting without adding any information to the visualisation.

Views were rendered using Gephi's 'Force Atlas 2' algorithm.

Tuesday, 15 July 2014

The quiet BI revolution (part one)

Three years ago on Wallpapering Fog, I wrote a post about why our company (or more precisely, since the company's huge, my department) had adopted Tableau software.

At the time, I said:

"I feel like I'm giving away a trade secret here, but what the hell, you're going to hear about it from somewhere soon anyway."

Having just attended the London Tableau Conference, I can confirm that the secret is well and truly out. It was a brilliant event, brimming with enthusiastic people and great ideas, that deserves its own write-up away from this post.

For this post, I'd like to indulge in one of my occasional crystal ball gazes and look at the future of Business Intelligence (BI). Not BI on the cutting edge - although that is an exciting topic - but BI in regular businesses. Businesses that have small analytics teams, no time and aren't PR'ing a project to the trade press, with all of the doubts and the dirty laundry Tippexed out.

So where is BI - and in particular, regular reporting - for a normal analytics team going to head over the next five to ten years?

1. Data Visualisation and Reporting

Data vis as it applies to most businesses, is now a solved problem (what to visualise isn't. That's part two of this post). You can have good looking reports, automatically refreshed and delivered onto any device you like and even on paper, if you must. They're quick to build, easy to adapt and easy to maintain - more so than Excel-based reports ever were and much more flexible.

The only things you can't do easily, are weird and wonderful innovative visuals that nobody's ever seen before and you can't have all of this functionality for free.

On the first of these problems, I'd argue that this isn't a business issue. Businesses need straightforward charts, tables and standard reports, not animated 3D network diagrams, so software like Tableau will do a great job. I'd also argue that if you're looking for real flexibility, Lyra is something that I'm quite excited about...

On the second problem - cost - you just have to bite the bullet. $20,000 spent on the right BI software will transform your analytics department.

(That's if you give the $20k to your analytics department. DO NOT give it to a centralised IT team. They'll very likely ask for another $230k to make a nice round number, disappear for six months and then reappear asking for more money.)

The real change in data reporting, investigation and visualisation over the next five years or so, is going to be from a situation where many businesses don't yet realise that it's a solved problem, to one where they do.

Tableau's solved this problem and in my opinion is by some distance the best of the new breed of reporting and investigation tools, but if it hadn't been Tableau it would have been Qlik View. And if not them, Spotfire. And... you get the point.

What's going to happen over the next few years is that Tableau knowledge will become more valuable - because more businesses will want to hire those skills - and also less valuable, because loads more people are going to know how to use the software. The end result is basic supply and demand. It might swing back and forth for a bit, but we'll settle onto a situation where many (most?) analysts know Tableau as a regular part of their job. There'll be specialists, just like there are specialist Excel consultants, but most businesses will sort themselves out and nobody will be paid a fortune just for knowing how to use Tableau.

So far, no real surprises and if you read Wallpapering Fog regularly then you've probably heard those ideas before. The next two points are where I see a quiet revolution happening.

2. (not) Data Warehousing

You probably already know how this works. Analysts with Tableau do the visuals, but there's a big SQL database in the back end, looked after by a centralised IT team, which contains exactly 73% of what you want to visualise. A big enough gap that you can't just ignore data that isn't in the data warehouse, but not so big that the data warehouse as it stands is useless.

What often happens in response to an incomplete data warehouse, is that analysts build a hack. The data that isn't centralised is pulled in from ad-hoc spreadsheets and mashed together in Excel or Tableau, which works OK until you need more than a couple of people to update those spreadsheets, or somebody's on holiday. This is the issue we often hit in media agencies; you can solve a problem once, but can't roll out the solution everywhere to all clients because some parts of your 'solution' are held together with gaffer tape and bits of string.

What's needed is some software that's built for analysts and allows them to merge different data sources and to schedule updates, without recourse to a database administrator.

If you were at the Tableau Conference last week, then you'll have seen Alteryx sat squarely in this area. Drag-and-drop, hugely flexible and very friendly, I played with the demo a few months ago and I loved it.

But, it is quite pricey. Especially if, like us, you wouldn't plan on using all of Alteryx's capabilities and are only really interested in blending data sources together.

Did somebody say what about Open Source? Here's my tip of the day. Go and download the Community Edition of Pentaho Kettle and persevere through the thirty minute skirmish it will take you to get it installed and working properly. Your reward will be drag and drop data acquisition, blending and output, all for free. This is how I process a lot of my football data and it's brilliant.

In terms of crystal ball gazing, the analytics department now starts to look quite different. It's running a lot of reports on schedules, freeing up time for investigation and innovation. Nobody does the whole "getting into work at 7am on Monday for a frantic three hours of board report running" any more, which retailers in particular are very fond of. And thank God for that.

In our new world, IT only handles data when it needs to flow in large volumes from a point-of-sale or distribution system. IT does the bit that it already does very well now, but everybody stops moaning that the data warehouse doesn't also contain lots of the smaller user-maintained pieces of information that make a business run properly.

If you're thinking that the new world sounds like the same old BI promises, then you're right, it does. We should have been able to do these things ages ago but it didn't work due to the disconnect between analysts and IT and the slow build time, inflexibility and high cost of software. Analysts received questions and understood what output was needed, but usually only IT had the (inflexible) technology to make that output happen automatically.

The big differences now are speed, cost, flexibility and the number of companies that will be working in this new way. It's no exaggeration to say that you're able to go from raw data, to first-version business reports in two days. You can pin those down to a format everybody's happy with in a couple of months (faster if you make decisions quickly) and then you can fully automate them. Reports are able to evolve because they can be rebuilt and republished very quickly, in hours rather than weeks.

Then what do you do next? It's a serious question with which some reporting teams are going to struggle. When nobody needs you to move data from Google Analytics to Excel and chart the same charts every week, what will you do? The time to start thinking about that is now.

3. Data acquisition

This one's not solved; it's currently being solved and we've got a little way to go yet. Data acquisition is the last barrier between analysts, managers and an automated dashboard containing absolutely everything on which they wish to report.

Alteryx and Pentaho Kettle are fantastic data assembly (ETL) tools, provided your data isn't stored somewhere really stupid. Unfortunately, I work in marketing and our industry specialises in making data as difficult as possible to access.

- It's in untidy, bespoke web interfaces, behind login screens.

- It's in the colour key that somebody has chosen to fill cells in Excel

- It's emailed across, with a friendly "Hello! Hope you had a good weekend. Today's spend number is £2,486."

Database that, smartarse.

What I see happening over the next few years is some new tools and some new ways of working. Provided data is delivered in a consistent format, then the likes of Alteryx or Kettle can make the data acquisition and blending problem go away.

Where data is in web interfaces, we can already scrape it using Python or R, but then you need an analyst who knows how to scrape and that's not such a common skill-set. (Top tip: look for a football analyst - by necessity, we're getting quite good at it.)

We're going to evolve towards XML and other data feeds in addition to the usual user facing tables that come from the majority of web data sources, which again brings the likes of Alteryx into play. The data providers who don't do this should gradually become extinct through a process of natural selection.

Eventually, these changes will form an almost universal API. Every provider's data is different, but you'll be able to get to the data in an automated way and that's 90% of the battle. When you've done that, you only need to solve the data transformation problem once.

We'll also see - as is happening already - advanced data providers like Datasift starting to deliver information into services such as Google's Cloud Platform. A few years ago this wouldn't have helped, because you're just swapping one API for another, but when a critical mass of services all use that same cloud, easy connectors start to appear.

So why do I say that data acquisition isn't a solved problem yet?

Well for one, too many sources are still silos, but a second issue is that user input is still much too difficult. There's no Tableau for manual data entry and we still have to call a developer to create web forms and database schemas and data validation and to link it all together for us. Either that, or we have a central spreadsheet for people to fill in and we pray that they don't break it, or all try to edit it simultaneously.

I'm sure this software will come, but I haven't yet seen it. Microsoft Access forms and VBA really isn't it and neither are Google Forms. Microsoft, for all that they had a massive head start and will claim to have solutions to all of these problems, are nowhere in the BI race and are falling further behind.

If you've seen another solution to the problem of regularly taking validated user input without embarking on a software build or trying to lock down a spreadsheet, I'd love to hear about it in the comments.

The future's bright

In our future analytics department a lot has changed, but it's been a quiet revolution. A lot of things that were difficult are now easy and the business analyst's scope has extended well into traditional IT territory. Or, more accurately, that territory is more clearly delineated between the two departments and issues which neither IT nor analysts could previously solve (for a sensible budget in a sensible time-frame), have been dealt with.

Reports have moved to web browser interfaces - except for those staff who absolutely insist that they need printed ones - and automation takes care of putting them together. Analysts can quickly and visually interrogate their data and as an aside, Excel has moved to being a secondary tool for serious analysts, behind Tableau (or a competitor of your choice).

We were promised all of this a long, long time ago. Most businesses might actually get there in the next five years or so. It's interesting that the process of assembling Business Intelligence is being solved backwards... Rather than from data collection, to merge, to visualise, solving the visualisation element has driven a requirement to be able to better blend data, which in turn drives changes in how we acquire it.

And you know what happens after that? Businesses will start to realise that a lot of the information they've spent years trying expensively to assemble, won't on its own work the miracles that they hoped it would. Not without some other major changes happening too.

My favourite quote from last week's conference came from Fawad Qureshi of Teradata.

"Old business process + expensive new technology = expensive old business process"

That will be part two of this post. When you've got to your ultimate suite of business reports and they're easy to maintain, what happens then? What changes? Does anything happen at all?

Thursday, 22 May 2014

The insular world of marketing

It's election day! And it's an election day that I'm personally fascinated by, in terms of whether the pre-election polls are anywhere near accurate.

Take a look at the image above. The Sun and YouGov are predicting a narrow UKIP win.

Do you know anybody who's said they're voting UKIP? I don't. Maybe you've got a batty aunt, or a slightly racist grandparent who makes you cringe now and again in public, but do over a quarter of people you know intend to vote UKIP?

Probably not.

This effect caused me to lose a tenner, betting on the London Mayoral election that saw Boris Johnson beat Ken Livingstone. The bookies has Boris as nailed on favourite, but I only knew one person who planned to vote for him. Nobody I knew could name many people who planned to vote for Boris either.

Of course you often surround yourself with like-minded friends, but work colleagues and acquaintances were vehemently anti Boris and surely your work colleagues are a decent random(ish) sample of different opinions?

It turns out not and I lost my tenner.

If you're here, reading this, then you're likely a thoughtful, analytically minded person with either a marketing or football analysis interest. Probably, you're not planning to vote UKIP and you don't know many - or even any - people who are.

Does this matter? In marketing, I think it does. We're trying to sell products to the population of the UK in general and to do that, we need to understand what motivates people in general, not just people like ourselves.

Walk into any big marketing agency in London and the people you'll meet will predominantly be:

  • Under 35. Many will be under 25.
  • University educated.
  • White.
  • Renting their home.
  • Unmarried
  • No kids
  • Travelling daily on public transport. Mainly on the tube, which obviously only exists in London.
That's a very narrow selection. Even the simple fact that all of these people live in London makes their day-to-day life quite unlike that of 85% of the UK population.

I work for MediaCom North - based in Leeds - and so some of the regional biases are removed in our office, but I bet I still couldn't find a UKIP voter here. I'd be staggered if over a quarter of the voters in the office supported UKIP.

As marketing people, we need to be acutely aware of our own inherent biases so that we can avoid them. Look at the adverts running on TV on any night of the week and ask yourself how many are designed to appeal to an under thirty year old audience. Then ask yourself, honestly, if most of the people buying that product are likely to be under thirty. Cars? Nope. Supermarket shoppers? Nope. Holidays? Nope.

For me, agencies need to be doing much more immersion into the lives of people who don't think like themselves (and I mean real immersion, I love stats as much as the next guy but they're a starting point, not the whole solution). A once a year factory visit or focus group just doesn't cut it.

We should also be hiring and retaining a more diverse mix of people, particularly people over thirty five. If the problem is that those people leave London when they hit their mid-thirties then maybe we need some more innovative solutions to tap into their opinions and experience.

Finally, as a client, I'd be looking seriously at non-London agencies to get some wider perspective. A global car manufacturer would naturally look to the scale of the big London agencies - and maybe they should - but they need to be aware that the people working on their account almost certainly don't own a car, have the money to buy one, or anywhere to park one if they did. That's why virtually all car ads are either full of young people, or a very crude caricature of older people.

Could your agency advertise UKIP and really understand what motivates all of those people who plan to vote for them? Or would you end up with a stereotyped portrait, produced by a youthful, liberal-leaning, well educated planner?

Of course, the question of whether you should take that brief is a whole other issue.

Monday, 19 May 2014

Bigger data isn't necessarily better

Sometimes it's hard being a statistician. Sometimes a long established statistical concept jars with your audience and no matter how hard you try to explain it in plain terms, you can see in the audience's eyes that they don't really believe you. Those suspicious eyes staring back at you are fairly sure you're pulling some shenanigans to get out of working harder, or to wring an answer from the data that isn't really there. What you're saying just feels wrong.

Explaining sampling can be like that, particularly when you're dealing with online data that comes in huge volumes and fighting against a tidal wave of 'Big Data' PR.

The audience's thinking goes...

More data is just better, because more of a good thing is always better.

More data must be more accurate, more robust.

More impressive.

Then a statistician says, "We only need 10% of your file to get you all the answers that you need".

And rather than sounding like an efficient, cost effective analysis, it feels disappointing.

"You only need a spoonful of soup to know what the whole bowl tastes like"

A common question from non-statisticians is to ask, "Overall, I have five million advert views [or search advert clicks, or people living in the North East of England, or whatever], so how big does my sample size need to be?"

Which sounds like a sensible question, but it's wrong.

Statisticians call that overall views number the "Universe" or "Population". It's the group from which you're going to draw your sample.

Once your population is bigger than about twenty thousand, it makes no difference at all to the size of the sample that you need. If you say that you've got one hundred million online advert views, and ask how big your sample needs to be, then the answer is exactly the same as if you had fifty million views. Or two hundred million.

Which probably sounds like statistical shenanigans again.

Think about it like this. I've got lots of ping-pong balls in a really big box and I tell you that some are red and some are white and they've all been thoroughly mixed.You can draw balls from the box one at a time until you're happy to tell me what proportion of each colour you think is in the box. How many ping pong balls do you want to draw?

Seriously, pause and have a think, how many do you want to draw? It's a really big box and you'll be counting ping pong balls for a week if you check them all.

Let's start with ten. You draw ten balls and get four red and six white.

Is the overall proportion in the box 60/40 in favour of white? It might be, but you're not really sure. Ten isn't very many to check.

You pull another ten and this time you get five more of each colour. Now you've got eleven white and nine red. Happy to tell me what's in the box yet? No?

Let's keep drawing all the way up to 100 ping pong balls.

Now you've got 47 whites and 53 reds. The proportion seems like it's close to 50/50, but is it exactly 50/50 in the rest of the box?

Every time you draw more ping-pong balls, you get a bit more sure of your result. But have you noticed that we haven't mentioned once how many balls are in the box in total; only that it was a big box? It's because it doesn't matter.

As long as the population is "big" and we draw balls at random, it doesn't matter how big it is.

Here's how your confidence in the result changes as you draw more ping-pong balls from the box:

The bigger your sample, the better your accuracy, but beyond a certain size - say 5,000 - your result is highly accurate and having an even bigger sample doesn't make very much difference.

"But!", say the objectors, "Online, data is basically free and we can use the whole dataset, so we should!"

And that's true, up to a point. Data storage is so cheap it's close to free, but data processing isn't. A large part of the cost is in your own time - you can wait ten minutes for a results dashboard to refresh, or you can sample the data, wait thirty seconds and get the same answer. It's your choice, but personally I like faster.

Outside the digital world, storage is still cheap, but data collection can get really expensive.

The TV industry in the UK is constantly beaten with a stick based on the fact that TV audience figures are estimated using a sample of 'only' 5,100 homes. It costs a lot to put tracking boxes into homes and this number has been arrived at very carefully, by very well trained statisticians. It's just enough to measure TV audiences with high accuracy, without wasting money.

In fairness, The BARB TV audience panel is challenged by a proliferation of tiny satellite TV channels - because sometimes nobody at all out of those 5,100 homes is watching them - and by Sky AdSmart, which delivers different adverts to individual homes. It may need to adapt using new technology and grow to cope, but nobody is seriously suggesting tracking what everybody in the UK watches on TV, at all times, on all devices. That would be ridiculous.

I'll be blunt. Any online data specialist who uses the 5,100 home sample to beat 'old fashioned' TV viewing figures, doesn't know what they're talking about.

Sampling is an incredibly useful tool and sometimes more isn't better, it's just more. More time to wait, more computer processing power, more cost and more difficulty getting to the same answer.


Monday, 7 April 2014

Visualising Everton 3 - 0 Arsenal

I've been playing with 3D visualisations of Opta football data over the past few weeks, trying to build a picture of the action areas in a game. This post is me thinking out loud more than a finished product, but there's definitely something about 3D mapping that does work.

3D is usually to be avoided (particularly in pie charts!) and I've said as much in my guide to data visualisation for marketers. The problem when visualising touches in a football game on a flat pitch though, is that very often you'll see something like this:

It's obviously displaying too much data. Converting to a heat or contour map helps, but unless differences between areas are very starkly defined, it doesn't make important areas of the pitch really jump out.

So, 3D...

I've taken the data from the Everton vs. Arsenal game yesterday and with R and rgl, used it to create a contoured surface. Add flags for for shot locations and a textured surface for the pitch and you get the images below.

You can see - as we've found before - how Everton concede the centre in favour of the wings and the importance of Leighton Baines on Everton's left. Despite that ball movement through the wings, Everton's shot locations are more central than Arsenal's, with Arsenal taking a number of inaccurate shots from wide on the left. Everton's two goals came from almost the same spot, with the third being an Arteta own goal.

I'll keep posting these from time to time and working on the visualisation. They're not a finished product, but I like the effect and think it's worth persevering with. Any ideas, or games you'd really like to see? Let me know in the comments.

Tuesday, 11 March 2014

Mapping UK Adland

I've been putting together a lot of advertiser spend data recently, for our own internal Tableau dashboards, and thought it might be fun to throw the dataset at R too and make something less functional but a little bit prettier.

These are contour maps showing the locations of UK advertisers spending more than £500k on TV, radio, print and posters last year. Darker equals more businesses in the area and I've deliberately dropped legends to avoid cluttering up the maps.

Huge thanks to the people behind R and the ggmap package, who are much, much cleverer than I am!

UK businesses spending more than £500k on advertising in 2013 (Click for bigger)

Focussing on England and Wales...

It's not all about London...

Nobody goes South of the River...

Friday, 14 February 2014

Premier League attack patterns visualised

Yesterday, I posted some visualisations of approach play in the Premier League. They describe how passes into a 'shooting zone' in front of the goal tend to be more successful when they come directly, rather than from wide areas.

I've started to play with these visualisations for individual teams and a few people have asked how they look, so today I'm posting attack patterns for the current Premier League top seven. We're looking at the number and success rate of passes played into a boxed-out 'shooting zone'. Data covers the first half of the current Premier League season, up to the end of January.

For the following heat maps...

Size of square = number of passes
Colour of square = pass success rate

Large and green is good; large and red is not! It's important to look for clusters of colour rather than concentrating on individual squares because when we're looking at only one team, the number of passes included is lower.

Teams are attacking the goal on the right and are listed in order of current league position. Yes, I picked top seven because everybody wants to see how the Man United one looks.

Mixed approach with occasional long passes from deep. Larger number of incomplete passes from wide on the right.

High success rates with close, central passes and very rarely played long from deep. Significant volume of passes from advanced wide positions, but with low success rates.

Manchester City
Varied approach with good success rates from almost all areas.

Mixed approach with low volume of passes from very wide touchline positions. Attacks from right wing weaker than left.

Tottenham Hotspur
Greater success rates through the centre than from either wing, but high volumes of unsuccessful passes played from advanced and wide.

The Leighton Baines effect. High volume of passes from wide left but with low completion rates. Passes from advanced right also with low completion. Very few attempts through the centre and occasional long balls from deep.

Manchester United
Some approaches through the centre but attacks weighted towards wings. High volume of longer diagonal balls from the right, with low success rates.

Thursday, 13 February 2014

How can an attacking team get close enough to expect a goal?

There's been some great work done in football analytics recently, looking at a team's scoring chances from different positions on the pitch, which has led to the calculation of various Expected Goals (ExpG) metrics. However it's calculated, in essence ExpG gives a player's chance of scoring from a shot, given his position on the pitch. Add up the probabilities for a group of shots and you can work out how many goals a team 'should' have scored from them. Have a look at Statsbomb if you'd like to read up on what's been available up to now.

I've managed to assemble a decent sized database of pass and shot locations from across the first half of the 2013-14 Premier League season and wanted to see if I could take Expected Goals a step further. As an indicator of shot success, Expected Goals typically paints a picture of the penalty area, with the six yard box as a hotspot and becoming colder the further out you move from goal. To a certain extent, its outputs are relatively obvious; if you shoot from closer in, you have a higher chance of scoring and shots from further out are less likely to be converted.

That's not to say Expected Goals isn't a useful metric - far from it - but it doesn't do a great deal for our understanding of how to create goals. We can quantify how much better it is to shoot from closer to the goal, but how do you get closer to the goal in the first place? If your attacks break down trying to reach the shot conversion hotspot, should you even try to get there, or just take your chances from range?

A couple of days ago, I tweeted an image of pass completion data, which we'll be building on in this post.

Pass success rate by destination

The image shows the probability of completing a pass into different areas of the pitch. We're not worried about where the ball is coming from for the moment, but are looking at the chances of passes into different areas being successful.

It's clear to see how - playing from left to right - passing accuracy starts to break down in the opposition half and then drops dramatically at the boundaries of their penalty area.

Even with half a season's worth of passes and shots, we're going to struggle with the number of data points available as this analysis progresses, so let's merge the granularity of that first image into some larger pitch areas.

Pass success rate by destination

We now have a picture of how difficult it is to pass into each area of a football pitch. What about shots?

From the same dataset, here's an average player's probability of scoring with shots from different pitch locations. Penalties are excluded and I've hidden squares with fewer than twenty shots to clean the data up a little.

Shot conversion rate by shot location

As a manager, you're on the horns of a dilemma. Scoring probability climbs to over 30% in the centre of the six yard box, but your chances of passing the ball into that location are slim.

What if we combine the two visualisations?

Pass success rate multiplied by scoring probability, gives an indication of the likely success of an attacking strategy. Pass to an easier area outside the box and shoot from there? Or attempt to work the ball closer, at the risk of losing possession?

Pass success probability * shot conversion rate

It turns out to be far from a clear cut-choice. There's a relatively large area, stretching from the edge of the six yard box, to well outside the area, where penetrating that area with the ball and then scoring once you have are quite evenly balanced at 2-3%. It's not as simple as 'closer to the goal is better' and the balance in one game is almost certainly dependent on passing quality of the individual teams and how well their opponents defend.

If we box out that 2-3% conversion area, we can move the analysis on another step.

Pass success probability * shot conversion rate

How should a team attempt to move the ball into that boxed-out shooting zone? There are three broad choices: Directly from the direction of the centre circle, diagonally, or from the wings.

David Moyes has come in for a lot of criticism this week following Manchester United's draw with Fulham, where his players hit over eighty crosses in ninety minutes. We should be able to show here whether crossing, or a direct approach, is the more successful strategy.

Probability of achieving a successful pass into shooting area

Note that I've changed the colour scale on the above image to peak at 75% rather than 100%, since the average success rate of these passes is lower than when considering the whole pitch. Squares are only shown if they've been the origin of at least twenty passes.

Once you move beyond the eighteen yard line, pass success probability drops off quickly. Touchline crosses from a 'chalk on his boots' classic winger have success rates as low as 30%. Other things being equal, the best chance of passing the ball into our key zone comes from a direct, or diagonal move.

If you're thinking "but that's not fair, most of the passes included here will be targeted at locations outside the box", then you're right. Let's tighten up our key shooting zone, to a central area of the eighteen yard box surrounding the penalty spot.

Probability of achieving a successful pass into close shooting area

Still want to hit crosses all day?

The probability of a pass from the wings finding a team mate in the shooting zone is 30-40%, while moving through the central area has a success rate of 40-50%.

This isn't the end of the story, but it's where I'll stop for now. There are many more factors to be considered, including absolute volume of passes and the fact that a successful pass isn't the same as creating a shooting chance. This analysis will provide a base to work from though and one that I'd like to extend next into different types of teams.

Ultimately, I hope that this type of analysis could answer question such as...

Should teams with worse passing shoot more often from long range? And vice versa, where is the optimal shooting area for a team that passes with a very high success rate?

How do optimal strategies change, based on specific opponents?

(using significantly more data) Can we identify hotspots where passes into the shooting zone have higher success rates? Versus specific opponents? When specific defenders are on the pitch?

Eventually, I believe an approach like this might be able to identify defensive weaknesses in a specific team and optimal attack strategies for their opponents.

Friday, 7 February 2014

24,000 tweets about #Sochi

Who's excited about the Winter Olympics? Happy about the games? Angry about their location?

Let's find out...

Searching Twitter for #Sochi yields 29,800 individual tweets.

Running those through TextBlob yields 24,000 tweets that can be analysed for sentiment - positive, negative, or neutral*.

And throwing the whole lot at Google Fusion Tables lets us map them.

Here they all are. Blue for neutral, green for positive and red for negative.

Or for bigger, go here.

Just the happy people?

And just the angry people.

That was fun.

Thanks to some brilliant people who make brilliant tools; Google for Fusion Tables, and the development teams behind TextBlob and Tweepy for their Python modules.

* Please note that automated sentiment analysis is far from perfect. Especially the way I've implemented it.

The three rules of business data visualisation

I love data visualisation; sometimes just for its own sake, but mostly when it makes life easier.

The Earth Wind Map is an example of the former. It's hypnotically beautiful.

This type of data visualisation isn't so good in business though, except to use as marketing material. If you want to build a stunning animation of your customers' behaviour to put on a big screen in the office, that's great, but watching it for five minutes every Monday morning is unlikely to help you identify problems with your website. If we want to gawp at something beautiful, we call up the Earth Wind Map; if we want to know whether to take an umbrella tomorrow, we go for a simpler forecast.

In business - and I count non-traditional businesses like sport within this too - data visualisation has two main purposes.
  • To help you understand the best strategy to adopt.
  • To get you to that strategy faster than you otherwise could have.

In order to achieve those ends, I work to three simple rules when visualising business data. The ideal business report, (visualisation, dashboard, call it whatever you like), should achieve these three things as quickly and as simply as possible. The higher up the management chain the report's audience, the simpler it needs to be and the more 'added extras' become a distraction rather than useful additions.

I'm not dismissing visualisations for inspiration, or for investigation, but in business the aim of communicating data is to make the right decision and to make it quickly. This is what reports are for and so I try to design reports to communicate these three things.

1. Where am I right now?

For the metrics that you know are important (you have identified those metrics, haven't you?) Where are they right now? This could mean yesterday, a total for the past seven days, a summary of the last fixture your sports team played, or any other - relatively short - time period that works for the business.

It's absolutely vital that you don't get carried away with which metrics you visualise here. I've written before:

"As analysts, we're often the ones selling dashboards, so lets be honest about what they do well. They show data. So to be useful, you have to be someone who needs to see that data - and I mean really needs to see it. Just the number. Not why the number, or where it came from, or what you might want to do about it."

Only visualise metrics where you fully understand what they mean and know at least some of the levers that you can pull to make them change. If sales drop, you know what that means. If some single number that's a complex blend of customer values, retention, acquisition, marketing ROI and God knows what else changes, what are you going to do about it? Simplicity is good. It's also much harder than complexity.

To divert into my football analytics sideline for a moment, this is why I'm not a big fan of numbers like PDO. The definition is complicated, the name is confusing and as a manager it's hard to know what to do about it, when it's not where you'd like it to be.

That's not to say there shouldn't be complicated metrics (for example to use as predictive tools) but I don't want them on my management visualisation.

Very often, the best way to communicate some simple KPI numbers is a simple table. Who says a data visualisation can't be 'just' a table? In the right place, tables are awesome.

Here's a visualisation of website metrics, that will work well provided you already know a bit about your website.

2. Is that good?

So now I know how much I sold last week and how much traffic we got to the website. But is that good? Put each number in context.

Context can mean a comparison with the past, or with a fixed target, or even vs. key competitors.

It doesn't matter how you do this - colour coding, text flags, Harvey Balls - as long as it communicates quickly and clearly. Personally, I'm quite partial to an old school traffic light, if only because even in the marketing industry, it's hard to find somebody who can get 'green is good' wrong.

Our weekly table of web traffic stats gains week-on-week or year-on-year comparisons and a set of traffic lights. Now you can instantly see if any of these numbers need attention.

3. Is it changing?

The last piece of the puzzle is to know if your metrics (otherwise known as KPIs - this isn't revolutionary stuff!) are changing.

Part one told us where we are.

Part two told us if where we are is good.

Part three tells us if the position is becoming better, or worse.

This section is where things can get overcomplicated if you're not careful. If you've got eight KPIs and you want to show a twelve week trend, then you've now got ninety six numbers to communicate. Tables just became a really bad idea.

Sparklines however, are fabulous.

Sparklines are mini charts designed only to communicate spikes and trends in data. All this section of the report is designed to do, is to give a manager a quick visual representation of 'going up', or 'going down' and how fast.

Our website report gains a set of twelve week sparklines and we're done. In one small report, we can see at a glance where we are, whether that's good and whether it's getting better, or worse.

I love data visualisation, but in business, we need to drop the pretty pictures and understand why we're visualising in the first place. Infographics are awful for communication. They're actually worse than writing down your report as long-hand text.

If a business visualisation doesn't help you to understand the best strategy to adopt and do it faster than a table of numbers would, then it's not worth having. Build infographics (if you must), learn D3 and build beautiful animations, but recognise that they're marketing collateral, not serious business tools.

In business, we need to know where we are, if that's good and if it's getting better or worse. The longer it takes to communicate that, the further behind your competitors you'll be.

Monday, 20 January 2014

Betting using my model of Premier League football

I've been getting some questions over the past few weeks about the betting calls I make using my EPL model, so this post will explain how the betting choices work. If you just like to see who the model thinks will win each week then maybe skip this one, but if you're one of the people who's been looking at the calls and thinking, "What? He can't do that!" then this should help to explain my methodology. This is also going to be more than usually geeky so Wallpapering Fog felt like a better home for it than the EPL Index site.

If you're thinking "what EPL model?", have a look on

First, a bit of history on where the model came from. That journey is how we got to here...

I've said before that this model wasn't originally built as a tool for betting and it's true. I first found last season (back when you could access all of their Opta stats for a few pounds a month), subscribed to the Stats Centre and built the model mainly to see if it would work. I had a vague thought that if it did work, then it could be interesting for a football club to use to forecast match results based on picking different players, but also assumed that the bigger clubs would already have sophisticated models of their own to do this type of work.

The model churned out a set of results for the first half of the 2012/13 season and I needed something to compare them with. Was my model any good? Bookmakers' odds are an obvious place to look for alternative results predictions, with easily accessed historical data available ( if you're looking.)

That first version of the model didn't quite equal the bookmakers, in terms of the results that it said were most likely to happen, actually happening. The bookies favourites won games slightly more often than the model's predicted most likely outcomes.

Despite this, the model was projected by that analysis to make a small return if you used it to bet. The model didn't say the bookies favourites would win all of the time, so picked up some wins at decent odds. Bookmakers also almost never say that a draw is the most likely outcome of a game and if you backed a draw when the model said its likelihood was over 25%, you made a healthy return.

I started to predict results on Wallpapering Fog ahead of the games being played.

For betting, the rules were simple. Back a draw if the draw likelihood was over 25%, otherwise back whoever the model said was most likely to win. That's backing winners with no regard whatsoever to the market odds on that game. You could be backing a long shot that the model likes a lot, or backing a very short odds favourite that the model gives only a 40% chance of winning. For draws, the odds are usually around 3.5 but again, I was paying them no attention when picking the bets.

This method has periodically upset more seasoned gamblers, who point out that you shouldn't make picks like that. I do understand why not and I'll come back to it. Please bear with me.

The method arises as a result of having a primary objective for the model of calling as many results correctly as possible, rather than trying to maximise betting profits. This objective is also why I've never looked at the potential returns from using my model to call correct scores, or accumulators, or both teams to score.

It works like this:

1. Get as many results right as possible.

2. See if the strategy that achieves point 1, also makes money.

It did make a profit last season and is winning this season too, so that 'most likely outcome' method isn't as naive as it might look.

For any readers who aren't seasoned gamblers, the issue with backing the most likely outcome regardless of what odds the bookies are offering, is that you could be backing a result you think is a very close call, when the bookies are offering a only poor return if you're right.

If I flip a coin then you know the chance of it coming up heads is 50%. If I offer you odds of 1.5 on a bet on heads (£5 profit if you bet £10), you'd be mad to take it. You might win once, but in the long term, you're guaranteed to lose.

It's time to share some data... If you run the latest version of my model over the first 200 fixtures of the 2013/14 season, betting £10 on the predicted most likely result of each game, or on a draw if the predicted chances are over 27% (it's gone up a little from a 25% draw line since that first version) then here's what happens.

Important note: The data I'm using here to populate the simulation is the data that we had after week 20 had been played. I also know the exact starting line-ups for each of these games, which I won't when I post on a Friday ahead of a weekend's fixtures.

This is very much a best case performance. The model's good. But it's not quite this good.

So betting on the most likely result, regardless of market odd seems to work. Part of the reason for this is that we're imposing quite a harsh line before an upset is picked as a bet. In its raw results, the model predicts too many upsets, so rather than just saying it has to like the underdog more than the bookies do, we have a rule that it must like the underdog enough to actually return a prediction that they will win the game.

Very probably a better gambling strategy would be avoid to betting on certain fixtures at all, but we come back to my bullet points above; I'm forcing myself to give a prediction for every game. There is also very likely a better gambling strategy to be found in this model, but I like the simplicity of betting on the predicted winner. It works.

If you'd like to come up with your own strategy, I've put a link to all of the data behind the first 200 games of this season at the end of this post.

Let's have a look at what happens with an alternative strategy of backing 'value'. What happens when we bet on whichever of the three results (home win, away win, or draw) gives the biggest difference between the model's simulated likelihoods and the bookies odds? If the model's got an 'edge', then this should work.

The 'value' strategy's cumulative profit is in red below, with my usual method remaining in blue.

So the value strategy is also predicted to work, but returns are more volatile, as you'd expect since you're backing more long-odds results. Using the value strategy, you also win 38% of bets, rather than the 56% you're predicted to win by backing the most likely result. Both strategies should work (provided you don't mix-and-match between them) but the 'most likely result' is less risky in terms of long, bad runs.

To recap, the strategy I'm currently following arises from:

1. A self imposed rule that I must bet on every game and stake the same amount on every bet.

2. There is a benefit of moderating the model, so an upset must be predicted as being very likely, before we back it.

3. Evidence (the above, plus last season and this season so far) that backing the most likely predicted result is effective.

If you'd like to dive into the data, see where these numbers come from and pick your own strategy based on the EPL Model's calls, it's all here.