Monday, 9 February 2015

How to do football analysis in Tableau | Part 1

I'm back from a hugely enjoyable OptaPro Forum last week and buzzing with new ideas. It was a fantastic day with some great presentations and it was brilliant to finally put faces to a few analysts' names (and Twitter handles!)

One question came up often enough at the forum to make me think that there might be an audience for this post. How do you use Tableau for football analysis...?

Want to know a secret? Tableau's easy. And it's free. There's really no need to restrict yourself to the purgatory of Excel's chart gallery. This post will cover a few basics and then if there's interest, I might do a follow up that shows how to do some more advanced visuals with X,Y pitch coordinate data.

I'm going to keep this post  (actually, since the length blew up as it was being written, these three posts) at a pretty high level: Where to get data, how to connect to it, how to draw some basic views and then make them interactive.

The guide will assume that you can handle yourself with the basics of copying and pasting and inserting columns in Excel and that "install this software; here's a link" isn't too taxing! In general, I'll try to take things step-by-step, with screenshots so you can see where we're going.

First things first, we're going to need a copy of Tableau. It comes in professional and free versions and unless you've got a grand to spare, you're going to want the free "Public" version. Tableau Public is almost fully featured, with just a few restrictions on the types of data you can load and how you save and export your dashboards (more on this later).

If you haven't already, download and install the latest version of Tableau Public from here.

Next, we'll need some data. Very often acquiring data and cleaning it up is the biggest barrier to football analysis and this post isn't going to cover the various sources you could use. Once you get into very large volumes of data, you start to need programming skills in a tool like Python or R and this is only supposed to be a basic intro! We need a small, simple dataset and WhoScored can help us out.

Let's build some visuals to compare Liverpool and Everton this season.

TL:DR. If you don't care about this bit, you can download the mini dataset that we'll use in Tableau and go straight to Part 2.

Go to the Liverpool page on WhoScored and you'll find a table that looks like this.

It's a nice little dataset with player statistics for this season, but it's not doing us any good stuck on the website - we need to extract that data.

There are a few automated tools that are worth trying for data extraction. is new and is shaping up to be a really great tool, but unfortunately it doesn't pick up the data that we're after when it scans WhoScored's web page. Excel's load data from web feature doesn't either.

We're going to have to do this the old fashioned way.

Click and drag to select everything from the little 'R' in the black title bar at the top of the table, to the bottom right hand corner of the data. Make sure you've got exactly that selected and then Control-C to copy.

(There are probably programmers yelling at their monitors right now. I'm not covering web scraping in this post!)

In Excel, hit paste (Control-V) in cell A1 and you get...

Now go and get the Everton data and do exactly the same thing, pasting it onto a different sheet.

Finally, copy the Everton data from Excel excluding the black titles row and paste it just below the Liverpool data. If you don't skip out the titles row, you'll get column titles mixed up with your player data in Tableau - we only need titles at the top of our table!

As a general rule, Tableau likes data to be in lists. You want your data to be listed downwards, not across the page as loads of columns. WhoScored's data is a list of players with facts about them and that's ideal - a separate set of columns for every player, or data scattered across different sheets won't work.

You can get rid of the Everton data sheet now, you don't need it any more. Give the worksheet with both teams on it a sensible name and save your workbook.

Now we're going to clean the data up a bit. We don't need all of the formatting, so highlight everything, find 'Clear' on the Home Menu (it's got a little pink eraser icon) and choose Clear Formats.

Simple black and white text. Proper analytics.

You could load your list of data into Tableau straight away, but it's still a little messy. Column B has got nothing in it, so highlight that whole column and delete it (Delete icon on the Home menu). The remaining data will move in from the right to fill the gap.

Those "-" signs indicating no data also aren't great for Tableau. Text saying "-" isn't no data, it's a cell with "-" in it. Find and replace "-" with nothing (Control-H) to get rid of them.

The last job in cleaning up is to sort out our player names. If you look in cell B2, you'll see that player names are currently mixed up with their ages and positions. The technical term for this is, "a pain in the arse".

Insert some columns to create a bit of space, starting from column C. Five columns should do it.

Now your data looks like this...

Highlight everything in column B and choose Text to Columns from the Data menu.

Choose "Delimited", tick "Comma" and click finish.

The positions data gets split into column C and because some players can play in a few positions, you'll get data in column D and maybe column E too. For this little project, we don't care about the extra positions, so delete columns D and E. Don't just clear the data out of them, delete them so that everything moves left and your data looks like this.

The playing positions are sorted, but our player's names are still mixed up with their ages. We can split these with a quick text formula.

Age is always a two digit number at the end of the name. Assuming no players are over 99 or under 10 years old, putting this formula in cell D2 will grab the player's age.

"Right" gets the two characters at the end of the text string and "Value" tells Excel to see the result as a number, not as text (e.g. 20, not "20").

The player's name is everything except the last two characters. Put this in cell E2 to grab that bit.

Copy and paste those two new formulas downwards, to fill up the table with names and ages.

You could have waited until Tableau to make these splits as it has exactly the same LEFT, RIGHT and LEN formulas as Excel, but I want you to be able to see the raw data table before we load it. Once you're in Tableau, the data table sits in the background and as a beginner it can sometimes be a little tricky to visualise what your formulas have actually done.

Nearly finished, I promise. If you look at your table of data, you'll notice that there's not a way to know which team each player is on. We've got a record at the top for Raheem Sterling, but it doesn't say "Liverpool" anywhere. That will limit us in Tableau because we won't be able to easily compare teams if that information isn't in our dataset.

The data in column A, labelled "R" isn't useful for anything, so let's put team names there instead. Select cells in column A down to the last Liverpool player, type "Liverpool" and hit Control-Enter to fill all of those cells. Then do the same for Everton and re-title column A as "Team".

The very last job is to add some column titles for the age, name and position columns you've created. Save your workbook and then if it looks like this, we're ready to hit Tableau in Part 2.