Monday, 23 March 2015

In praise of modesty

One of these people invented the World Wide Web.

The other talks about it.

One of these people is the best footballer in the world.

He's not the one in the gold shoes.

More of this please.

Monday, 16 March 2015

The Hitchhiker's Guide to the Solar System | Tableau Iron Viz Entry

So this year, I couldn't resist. Tableau set the challenge to build something - anything - with Wikipedia data and they've put prizes on the line. It's enough to tweak the competitive streak in any analyst...

But what to build? I started out with Football League data as it's familiar and I do a fair bit with those numbers already, but then @ChrisLuv did something really nice along the same lines and I got a bit bored scraping even more football results. It felt like time for a short break from football analysis and it would be nice to learn something new during the build.

I'm not sure where the Solar System idea that I've gone with came from - possibly a friend complaining recently on Facebook that his daughter keeps attacking his ship in Elite: Dangerous - but a quick browse on Wikipedia turned up nice, consistent datasets for planets and moons and some really lovely imagery. Definitely dashboardable.

I've been trying recently to reduce the numbers of screens in my Tableau work, because experience is showing that users can sometimes be reluctant to move around a multi-screen dashboard. If the landing page does a good job, you've got a good chance of converting new users to your interface, but if it requires more clicks to get to data, then things become much less smooth.

So, the task I set myself was a one page dashboard to explore the Solar System, and with enough interesting features to be worthwhile entering into the Iron Viz contest (I hope! That's not an intimidating task, honest).

Building on games like Elite, I also wanted it to look and feel something like the in-flight navigation computer in a space exploration game.

There's a bit of how-to guidance in the rest of this post, but first, here's the Hitchhiker's Guide to the Solar System...

 Hitchhiker's Guide to the Solar System

Navigation's a straightforward task of clicking the planet you want to see at the bottom of the screen and then you can also browse that planet's moons on the right-hand side.

If you really want to pull apart its inner workings, then download the workbook from Tableau Public and have a play. Here's a quick top-line guide to get things started.

Look and feel

Most of setting up the look and feel of the dashboard involved turning off features in Tableau. No axis lines, no grid lines, no titles for the dashboard elements, no tooltips and no row banding in tables. Everything was switched off via the Format menu and the background set to black everywhere.

If you set up one sheet the way you want it, then you can copy that sheet to make more without doing all of the formatting again. (P.S. Please can we have templates, Tableau? Pretty please?)

A green web-safe "Lucida Console" font on top of the black background gives a nice flight-computer feel.

Select a Planet

This is pretty easy once you know how Tableau custom shapes work.

I downloaded a high quality image (more on image quality later) of each planet from Wikipedia, saved them to my local Tableau shapes folder and then mapped them to the planet names.

The selector is a regular chart, with the axis set to logged distances from the Sun, planet images as shapes and planet sizes defined by their radius. Dashboard actions using that chart as a source, control the rest of the vis.

And a couple of extra tricks to tidy things up...

  • The planet labels are on the second axis, so that they all display at the same height rather than each one hugging close to its own icon
  • You can't use planet radius data exactly as it is to set the sizes of shapes, because Saturn has rings and so the planet itself ends up being too small. I created an extra field to boost Saturn's size up to where it should be.

Large image of selected planet

This is done the same way as mapping shapes onto the 'Select a Planet' chart, right? Unfortunately not. You can do it that way, but it looks a bit rubbish, because Tableau compresses shapes to speed things up. If you blow them up too large, they look like this.

It's not totally unacceptable, but seems a real shame to lose those gorgeous high-resolution planet images from Wikipedia.

The trick to getting high resolution images that change on a filter, is to use Map - Background Images. You make a 1x1 scatter chart, with the planet image set to show as background, depending on what's selected on the filter. Don't forget to make the single data point on the chart transparent!

Here are the settings:

The 'target' surrounding the planet is just a floating png image with a transparent background. I drew the four corners in PowerPoint and exported them as a picture. As much as PowerPoint is the work of the Devil, it's a really quick, easy way to create simple coloured lines and add them to your dashboard!

Planet data

This one's easy. It's a straightforward table with a black background and all of the lines and other formatting turned off.

The little benchmarking chart to the right of the table is designed to provide some context. When you look at a statistic like 24.79m/s2 for Jupiter's gravity, is that a lot compared to the other planets?

The chart itself is a Gantt bar, with a dashboard action that highlights the currently selected planet and lets you quickly see how it compares to its neighbours. Using a percentage table calculation instead of absolute values lets us put all the planets on the same mini-chart and stops the ticks for huge "Mass" numbers, compressing everything else down towards zero.

Select a moon, to see data and image

We've seen these techniques before - it's exactly the same as the large planet image and planet data table. I used a simpler chart - without custom shapes - as the selector because some moons are very small and others have very similar orbits, so they can cluster closely together.

Gustav Holst. Just because we can.

This one's a bit of a cheat in that it's not data from Wikipedia, but I just couldn't resist.

Classical music and space games and movies go hand-in-hand, so I used a URL action to call Grooveshark's embedded music player with tracks from Gustav Holst's Planets Suite.

It's not really hard to do - click share on a Grooveshark track and pull the link out of the embed code that it generates. Add that link to your dataset.

Then drop an empty Web Page element onto your dashboard and create a URL action that calls the field containing the link. Tableau will automatically throw the link at the empty Web Page element when you make a selection.

Edit: I previously had some discussion here of issues with Tableau Public, when you try to embed a link to an http address. Thanks to @ChrisLuv you can safely ignore all of that and only need to know that you have to use an https web address for whatever you link to, in order for it to work properly in Tableau Public. Just switch "http" for "https" in whatever you link to and you should be fine.

There's a little warning message hidden underneath the Grooveshark player link, just in case the web browser that's viewing doesn't have flash enabled and so doesn't load the player, but most people should never see it.

I also put that warning text in a calculated field, that hides it when Earth is selected. Holst took inspiration for The Planets Suite from astrology and so there isn't a track for Earth. Wikipedia says so. See, I used Wikipedia data on Holst. It wasn't cheating after all!

And we're done. Phew. Fingers crossed, it's time to press submit on the entry...

Wednesday, 11 March 2015

My data analysis toolkit

Growing out of posts like "Losing touch... Or why Excel and VBA won't cut it any more" and "How to do football analysis in Tableau", I've been asked a steady trickle of questions this year about what analytical software I use.

One of the biggest discoveries I've made as I branched out from Excel and VBA is that there isn't a right answer to what software you should use. There are loads of programming languages, loads of dashboard solutions, loads of databases and you can't possibly get experienced with all of them.

It's better to find yourself a set of tools that work and to know those tools well, than to have bits of knowledge all over the place, but not be using any of your kit to its full potential. If your data size isn't measured in terabytes, then you don't need to be right on the bleeding edge.

For outside of work projects, I've got a couple of machines that aren't anything fancy. An older Core i5 laptop with 4GB ram and a Core i7 laptop with 16GB ram. A regular laptop like that, costing between £500 and £1000 is more than enough for chucking around datasets up to tens of millions of rows and analysing them.

On to the software... As I said, there isn't a right answer to what you should use, but I hope this post might be useful to a few people as a starting point. If you've broken Excel and if Access has you tearing your hair out then read on. Opening Microsoft Access at all, is a very strong signal that you need to get acquainted with some of the tools on this list.

And as an added bonus, almost everything I'm going to talk about is free!

Question: Can I get by without Microsoft Windows?

Answer: Yes! You need Linux Mint.

Mint is a very Windows-like desktop environment and since the Windows 8 (Metro) car crash, it's arguably more Windows-like than Windows is.

Easy to install and easy to use, I wrote a short intro to Mint, a couple of years ago.

Be warned, that if you use Linux for anything more than a bit of web browsing, you are going to end up using the command prompt and Googling to fix broken things. But then you'll be doing that with the DOS prompt in Windows too, because a lot of data analysis software is designed around Linux and needs persuading to work properly with Windows.

If you've got an older machine lying about, stick Mint on it. You might be surprised at how good it is.

Question: Which database?

Answer: MySQL Community edition

Yes, there are newer, fancier Big Data technologies out there and I'll learn them at some point, but I want the SQL language I know, in a fast, familiar, free package and MySQL does that.

On Linux, you might have to do a bit of reading to get it to work properly (watch that you need to install the Server and Client packages), but it powers half the internet - including Twitter - and isn't too hard if you're patient and don't mind Googling an error message or two.

On Windows, make sure you download the MySQL Installer. Don't try to work out what packages you want and install them individually. You will inevitably bugger it up.

Once you've installed MySQL, if you're using it for data analysis rather than to power a small website, you MUST customise my.ini (Windows) or my.cnf (Linux), or it will run like treacle. Out of the box, MySQL is designed to run on really low-powered hardware and its memory usage settings are turned way down. Have a Google.

To go with your MySQL Server, you'll want MySQL Workbench, which you can use to write queries and maintain your database. It's packaged up in the Windows Installer that I mentioned, or you can install it separately on Linux.

Don't know any SQL? You should. Start here.

Question: Which programming language?

Answer: I use Python and I like it a lot.

If you're making the switch from VBA Macros, or taking your first steps in programming, then Python's a powerful, approachable place to start.

On Windows, I'd recommend installing ActiveState's Python package, because it handles all the set up for for you and prevents you from getting into a situation where you're pretty sure you installed Python but it just doesn't work.

On Linux, you already have Python installed and you just need a nice piece of editing software to write your code in.

On Windows and Linux, for writing code I've recently adopted PyCharm following a recommendation from @penaltyblog and it's the most straightforward editor I've seen that still has all the features you'll want as you progress.

It might feel a bit weird coming from Excel Macros that the editor you use to write code and the programming language itself are separate things, but that's Open Source for you - it gives you choices. MySQL is the same; you don't have to use MySQL Workbench to talk to your database if you don't want to and there are loads of other choices, but Workbench is an option that works and that's what this list is all about.

No idea how to use Python? Start here.

Question: What do you use for statistical analysis?

Answer: I use R.

R is free and tremendously powerful. It's a proper, highly capable programming language for working with data and the output you can produce using it is amazing.

Unfortunately, it's also got a learning curve that's close to vertical. Your first steps after working through tutorial examples, will not be easy. I'm definitely not an expert, but I'm improving...

The first step with R is to install R Studio, which makes life much easier. Don't try to install and use R without R Studio - it's unnecessarily painful.

The second step is to have a look at a beginners tutorial.

And when you get stuck, have a look on Stack Overflow. R errors can be really awkward things to Google, partly because it's just a single letter and partly because its user forums are horrible. Stack Overflow and R Bloggers will already have answers to most problems you'll hit and if they don't, you've probably got the wrong end of the stick and are asking the wrong question.

Ask a question on Stack Overflow at your own risk. Chances are somebody else has asked the question before and you could have found it by searching. The bar on what is a good question asked in the right way, is also set quite high!

As an aside, if you don't know R or Python, then they are interchangeable for a lot of work, such as web scraping. R is a statistics package with a lot of general programming capability and Python is a programming language with some really good statistics packages. You could try to stick with just one of them.

Question: And for graphs and data visualisation?

Answer: Tableau Public. And R again.

For a basic grounding in Tableau, try my  "How to do football analysis in Tableau" guide. Tableau's a fabulous piece of software and a brilliant way to get interactive charts and tables onto the web. It's also my first stop for visually interrogating a new dataset and seeing what it contains.

Tableau Public - which is free - is limited vs. Professional in that you can't load more than a million rows of data at a time and you can only connect to spreadsheets, text files and Microsoft Access. It's well worth working around those limitations though, to get access to the powerful tools that Tableau offers.

When I want a visualisation that's more bespoke than Tableau offers, I turn to R packages, because that's what I know. There are a whole host of other amazing technologies out there for bespoke data visualisation, like D3 and Processing, but it's very much a case of picking your battles and learning what will be most useful to you. I'm not a graphic designer and so am sticking with R and Tableau for now.

Question: Is there still a place for Excel?

Answer: Absolutely, there is.

Excel's still a very useful tool, even if only because you know exactly how it works, so you can use it to solve problems quickly. It's also very handy that almost everybody else has got a copy so you can share spreadsheets easily.

More and more though, I'm seeing Excel as a scratchpad for hacking data around before I put it somewhere more permanent. You can see what you're doing with Excel and for small to medium sized datasets, it works really well, provided what you're building is a one-off view of data that doesn't need to be updated.

Actually, I wish Microsoft would recognise this use for Excel, strip out all the crap and cut it right down. Just a blazingly fast spreadsheet, with worksheet formulas, pivot tables and simple charting. Nothing else. It'll never happen though.

As soon as your work starts to morph into repeatable analytics, or proper dashboarding, get it out of Excel. We used to use Excel for everything because it was all we had. That's not true any more.

The free alternative to Excel is Libre Office Calc and it's just about okay. If you haven't got Excel, it's worth installing but you'll end up in MySQL, R and Tableau earlier, because Calc is a lot less capable when you throw a sizeable amount of data at it.

Question: Any other bits and pieces?

Answer: One or two...

Notepad++ is a must for text editing and definitely put the Poor Man's T-SQL formatter plugin on it, so that you can clean up your SQL queries. You don't realise Wordpad is rubbish until you try something else that works properly.

Pentaho Kettle is cool and worth a look if you want drag and drop ETL (extract, transform, load) for your data. I got very excited about Kettle a while ago and I still use it quite a bit, but you may find that R steps up to do the same jobs as you get better at it.

Gephi is what I used to draw my analyst network visualisation. It's a piece of software focussed on networks rather than general purpose analysis, but a lot of fun to play with.

And finally, Digital Ocean is an awesome web service, where you can spin up a virtual Linux PC (they call it a Droplet) for $5 a month. When I want to run a web scraping Python script without the risk that my laptop will reboot half way through, I stick it on Digital Ocean. You can also put a 20GB MySQL server in the cloud this way and access it from anywhere. It won't be super fast for the basic $5 a month, but it's very handy if you access data from a few different places. If you use this link to sign up, you'll get $10 free credit and I'll get a bonus too. Everybody wins!

I hope that this post might set one or two people off trying new tools. As I said at the beginning, this isn't claiming to be the right answer to what you should use, or even the best answer, but it's a tool kit that's working for me and took quite a bit of sifting through different options to arrive at.

If you disagree with any of the choices, do hit the comments section. I'm always looking for better options, but only if learning them will save time in the long run...