Wednesday, 12 August 2015

Adblocking could be the saviour of high quality journalism

Reading the latest news, adblocking is heralded as the harbinger of the online apocalypse.

Many internet users have decided that they don't like to be tracked and profiled.

They don't like to be irritatingly diverted from the article that they're trying to read.

They don't like auto-play videos.

They don't like ads for things they've already bought, following them around the internet.

And they especially don't like their mobile data allowance being chewed up by advertising.

All of this means that the user base of adblocking software is on an upward trajectory and publishers are starting to worry.

According to PageFair, 21% of UK internet users use an adblocker. Amongst younger, tech-savvy audiences, the rate of blocking is much higher.

For publishers, this is a serious problem, because they get paid per thousand eyeballs viewing ads on their webpages.

For example, if I visit The Guardian, I see this

That banner at the top is generating a small amount of money per visitor for The Guardian.

With a blocker? Poof! It's gone. The page resizes into the gap and you'd never even know it had been there. The Guardian gets no money at all for my visit.

As an aside, Ghostery gives me a list of every tracker that's logging my visit to The Guardian and many of these will subsequently try to follow me around the web and see where else I go.

It's quite a big list...

That's FIFTY EIGHT separate trackers, just from a single visit to The Guardian homepage. No wonder some users want to opt out.

So adblocking continues on its upward trajectory, gains increasing penetration into mobile devices, and in the near future hits and then passes 50% of all users. All of our national newspapers go bust. It's inevitable, right?

Well no, actually, I don't think so.

I think adblocking could be very good for high quality journalism, but there'll be some short-term pain before we get there.

First a little bit of technical info...

Adverts are fairly easy for users to block, due to the way that they're bought and sold and embedded onto web pages. When you visit a newspaper's website, it loads the article you want to read from its own server and makes a request to some other server(s) for adverts. If you keep a big list of the servers that are used to deliver advertising, you can block them in your browser. The article loads and the adverts don't. Simple.

Now imagine the adblocking rate is 100% of internet users. What happens?

Any user, visiting any website, sees no third party adverts at all and publishers get no money.

But advertisers still want to talk to the The Guardian's readers. Of course they do, The Guardian's readers spend a long time staring at those pages and they're an educated bunch, with disposable income. Advertisers like people with disposable incomes and will work quite hard to get their brands in front of them.

Advertisers also no longer have an option to place their ads programmatically - splattering them across the web on loads of different sites and trying to follow individual users around - because as soon as they try to do that, the adblockers kick in and their messages disappear.

Suddenly, there are big online advertising budgets available, but no way to spend them programatically across lots of smaller websites.

So newspapers start selling advertising space the old fashioned way. They agree to place adverts on their site - delivered from their own servers, woven into articles and much tougher to block - and they agree a price to do it.

They might even negotiate that deal on the telephone. How quaint.

It's a model that looks a lot more like the traditional way of selling advertising space. Individually struck, higher value deals with large companies, rather than automated small-ads scattered across the web.

Why doesn't that already happen? Well sometimes it does, but currently, large proportions of online advertising budgets are used to buy programmatically. If you've got a choice between negotiations with individual newspapers, or bunging some cash at an ad exchange on the promise that the exchange will find your exact target audience wherever they are on the web and show them your ad, then of course you'll pick the automated route.

Adblocking takes away the automated route. Users don't like the tracking it entails and in increasing numbers, they're opting themselves out.

If you can't track and follow an individual user onto sites where its cheap to advertise, then you're going to have to buy your audiences in bulk, in the places where they congregate.

Adblocking doesn't leave an internet barren of advertising. It leaves large websites, with high quality audiences, in an incredibly strong position to negotiate their own deals.

For newspapers, it removes a huge number smaller websites from their competitive set and puts them back in the much stronger position that they used to enjoy.

Adblocking at scale would redirect online budgets towards those sites that can guarantee large, desirable audiences for advertisers. It could be the saviour of high quality journalism.

Thursday, 6 August 2015

Is Microsoft's Power BI a Tableau killer?

Have you heard of Betteridge's law of headlines? Then you already know the answer to this post's headline. No it's not.

Why not? Good question. And there is one feature where Microsoft's Power BI  manages to lay a glove on Tableau, but we'll save that for later. First, why are we here?

My last post on Wallpapering Fog was part reminiscence, part rant and part lament about where it's all going wrong with Microsoft Excel. You see, I used to like Excel a lot and mostly, I really don't any more. It's still very useful, but in terms of new features, the bad stuff is starting to outweigh the good.

I had a comment on that post from the design leader on Power BI - which is very flattering - asking if I'd tried the general availability version of their new software and I hadn't, yet. I have now. My IT department is also pushing Power BI so I really needed to take a proper look.

A bit of background if you're unfamiliar with Wallpapering Fog - I've been using Tableau (full-fat and Public) for four years and love it, but wouldn't say I'm wedded to it. I'll be very comfortable jumping ship if something better comes along, but for now Tableau is the best BI software on the market by quite a distance. I think using the best and looking for better is a healthy attitude to take.

Before Tableau, I used Excel Services. It was rubbish. Hopefully Power BI is better.

Onto the test. Is Power BI any good?

I went to and signed up.

I got a button that said "Get Started" and clicked it.

The website paused and said "working on it".

Hmmm. I haven't asked you to do anything yet. What could you possibly be working on? As a (reluctant) SharePoint user, I know that "working on it" message all too well. This is not the most promising start.

Still, it's gone in a couple of seconds and I've promised to be as objective as possible with this review. We're in, let's move on.

I'm deliberately going to pile into Power BI without reading any instructions at all, because that's exactly the approach I took with two of its competitors - Qlik View and Tableau - and with both, I was able to make good things happen pretty quickly. I only needed help later, as more advanced features came into play. That's the benchmark.

This will also be an early review and I haven't tested Power BI extensively. I may have missed things, but seeing what you can achieve with a piece of software in 24 hours is a useful exercise. In my experience, great software - like Photoshop or Tableau - will blow you away immediately and then keep on giving as you discover more depth. Before I'd invested any time in Tableau and long before I'd got involved with its community, you can read that it did that.

What to do first? I need to connect to some data so we can put Power BI through its paces. Let's have a look at the options.

I can load a csv file or spreadsheet, but decided to have a crack at an API connector and was presented with a slightly strange array of options.

Acumatica (who?) and Circuit ID (also who?) but not the usual array of social, stock price and economic data connectors. Odd.

Anyway, Google Analytics is there. Everything connects to Google Analytics. Let's try that.

Oops. Five minutes of Power BI thinking about things and then I gave up and refreshed my browser.

Maybe I'm being too ambitious (I'm not, I managed to do this in the competition's software). I loaded the example retail dataset.

A dashboard appeared!

If I'm being picky, some of the formatting is a bit scrappy - especially considering it's the only bundled example - but it's a dashboard. Let's not be too picky yet.

I can click around, highlight and filter and go into edit mode, but after a couple of minutes, I found a button for "Power BI for desktop" and downloaded it. I had hopes of more connectors for local databases and did find those. After I eventually installed it.

Yeah, so like I said. Eventually installed. Including rebooting a work PC that doesn't hurry itself to reboot, that little lot took the best part of twenty minutes.

Why don't I already have the latest version of IE? Because it's rubbish. I use Chrome and Firefox.

On the plus side, from the desktop, the Google Analytics connector works.

From this point on, the niggles stopped and Power BI was... Fine. Not good, not great. Fine.

I loaded some traffic data from Google Analytics and drew a couple of charts. I noticed that dates don't automatically drill from years, through months, to days like they do in Tableau and so manually created a month field.

Power BI's got a weird distinction between measures and columns, so creating that date column involved a brief false start. Apparently dates can only be a column, not a measure. What's the difference? Tableau switches effortlessly between dimensions and metrics, which is a very useful feature. You can split data by a numeric field (using it as a dimension), or you can add that numeric field up. At the point you create the field, the distinction doesn't matter - it's just a column of data and you choose what to do with it afterwards.

Creating charts was straightforward enough, but quite limited. Microsoft have ditched the rows and columns interface from Pivot Tables (that Tableau adopted and supercharged) and gone with a vertical interface. It works, but it's nowhere near as intuitive.

That list of options for summing, averaging and counting is again... fine. Just the basics. No frills.

Enough Google Analytics. I restarted Power BI and connected to one of our advertising datasets on SQL Server. 350k rows that describe what different companies have been spending on advertising over the past few years, to see how Power BI would handle a little bit more data.

It coped fine. Connecting was slightly counterintuitive because you have to tick a little box next to the table name, that's not obviously a tick box. Just clicking on the table name brings up a preview, but leaves the "Load" button greyed out and leaves you scratching your head for a bit.

Data loaded, I quickly put together this summary view of UK advertising spend.

You can filter across the whole page, or just on one chart. If you click on something - a date, or a category - then every element on the dashboard immediately filters to what you've selected.

I like that sort of universal filtering behaviour, when I can control it, but it doesn't look like you can here. Everything filters when you click something, whether you like it or not. That has the potential to confuse the hell out of non-technical end users of your dashboard as they idly highlight a date and most of the data on the view suddenly vanishes.

In keeping with the imposed filtering, you can drop a filter element onto the view (top right on my little dashboard) to make it more obvious what's going on, but you can't attach that filter to specific dashboard elements. You filter everything, or nothing.

If you look closely at my dashboard, you'll see that the dates (which are slanted, yuk) are in the wrong order. That's because in our SQL Server, they're stored as strings.

I tried to make a new column that would be a correctly formatted date and this happened.

What's wrong with that? I don't know, it looks fine to me. Apparently Power BI doesn't know either, because the error message is blank.

I tried calling it Formatted_Date, removing the space. Nope.

I tried a few other ideas. Nope.

Maybe DATEVALUE doesn't work like in Excel. That would be daft.

I gave up.

Overall, Power BI feels like it's doing the bare minimum. You can drop charts, text and maps onto the screen. You can sum and average data. You can format axes and change colours. That's about it.

There are features to create variables (when they work) and to connect data sources together, but BI software will stand or fall on the front end and how you are able to present data back to a user.

Drawing on a couple of my own recent projects, this is Tableau. So's this. Both built in the free, Public version. As far as I can see, you've got no chance of producing visualisations like these in Power BI. It's not outright bad, it works, it's just miles behind the competition.

Power BI hasn't come out swinging. It's a cagey, cautious entry into data visualisation that seems competent, but nothing more than that.

Before we go on, I should mention price, because it's important. Full fat Tableau Desktop is over £1000 a copy. It bloody well ought to be good.

Full fat Power BI is $9.99 a month. These two pieces of software aren't targeting the same market.

It's fairer to compare Power BI with Tableau's free offering - Tableau Public.

Tableau Public is full-fat Tableau Desktop, with database connections stripped out and you can only save your visualisations to Tableau's cloud. We've established that in terms of the sophistication of visualisations you can build, this means Tableau will blow Power BI out of the water. What about when you publish your dashboard to the cloud, for others to see?

Data access and refresh is where Power BI wins over Tableau. It can connect to database sources and APIs and auto-refresh from those sources so that your online dashboard stays updated without you needing to do anything. This functionality is free and for $9.99 a month you can do those refreshes at high frequency, with a bigger data storage allowance, though one that only takes it up to the allowance that Tableau offers for free.

OK Microsoft, you just got my attention. It might be just a line chart and a map, but an auto-refreshing line chart and a map is interesting.

What's the catch?

Well if I had an auto-refreshing dashboard, I'd want to embed it into this blog, or into, but you can't do that without paying $5 a month for SharePoint Online. I use SharePoint at work and it might just be the worst designed piece of software I've ever come across. I'm not even sure it was designed. No, I'm not paying my own money for it.

For now, if you want to build a sophisticated, good looking visualisation, you need Tableau.

If you want to build a more basic visualisation that refreshes itself, then Power BI is an option, but you'll only be able to share it on, not embed it.

There are hopes within the Tableau community that Public will get Tableau's new web data connector, which is coming in the next mini-release. If that happens, then Power BI is dead in the water, because its auto-refreshing (at a low price) USP will be gone.

In a business context, Power BI is in a potentially strong position. If your company has already bought into Microsoft's Office 365 and SharePoint stack - as mine has - then Power BI will integrate with that fairly cheaply and allow users to publish visualisations to each other. I wouldn't be at all surprised to see it gain a fair amount of traction.

Unfortunately, as analysts, that will mean investing our time making the argument that Power BI's competitors, while significantly more expensive, are significantly better. Tableau and Qlik View are changes to how analysts work and massive boosts to their capability and efficiency. Power BI is not that. If you just want to automatically publish a table and chart of financial results to the exec team though, it will certainly be useful.

Overall, Power BI is a 6/10 product and in places it doesn't feel finished. It's useable, but limited. Give it a try, but don't expect too much.


I haven't mentioned Power BI's other innovation of natural language search. The idea is that users can type in "sales in France" (or similar) and the dashboard will show them that.

I haven't mentioned it because it's a gimmick. Cortana (I assume it's got Cortana's engine) isn't the Starship Enterprise computer and it's not going to be intelligent enough to be useful. You'll see this message a lot.

Unfortunately, I already have experience of this feature playing really well in a controlled demo, where the salesman knows it will work. It's going to make it harder to explain to an excited senior exec that Power BI isn't really very good, when "you can talk to it and it just understands!"

Footnote 2:

In this review, I may have said Power BI can't do something, which it actually can. If that functionality comes from bolting on Power-something-else, then I'm not interested. Power BI, PowerView, PowerPivot... the Microsoft data ecosystem is a bit of a mess at the moment and it needs a more coherent offering. Power BI should work out of the box as a single download, as its competitors do and that is how I've tested it.

Tuesday, 2 June 2015

We need to talk about Excel

I'm not sure when it happened. I've got a feeling that the writing's been on the wall since the introduction of the 'Ribbon' menu.

I was the guy who could make Excel dance... Shortcuts flying, interactive dashboards, external data connections and VBA. I loved Excel.

A colleague, observing me building a spreadsheet a few years ago, said, "F*ck me, it's like watching Minority Report".

Proud moment.

Not any more.

Modern Excel is a mess.

It's worth taking a moment to consider how we got here. Excel was first released as Windows software with version 2.0 in 1987. It's nearly thirty years old.

Back when I started out as an analyst in 2000 - who was genuinely excited that a company had seen fit to employ him and to allocate him a desk and a PC - we were using Excel 97. This was the first version to contain proper VBA and also came with Clippy, the universally reviled Office assistant.

Clippy aside, Excel 97 was pretty good. It had most of the useful functions and features that you'd find in modern Excel and it worked.

Crucially, Excel was what you got. It was restricted to 65k rows and its charts looked bloody awful, but there wasn't really an alternative.

Having VBA baked-in made Excel tremendously flexible (and the bane of IT departments everywhere). With a bit of creativity, you could use it for statistical modelling, interactive dashboards, as a calendar, a project planner, a to-do list... And we did. Excel got (ab)used as a solution to every business problem going.

Back in 2000, Excel was the centre of an analyst's world. What happened?

Specialist software has chipped away at Excel's 'jack of all trades' USP.

If a computer can do it, you can probably make Excel do it. That's no exaggeration. VBA is behind Excel and so if some functionality doesn't exist out of the box, then you can add it. You want games in Excel? Here are fifty. Be warned: I make no guarantee those games won't royally screw up your PC. VBA can do that too.

When you break down the uses for Excel, you find new competitors are encroaching on all sides. Competitors that are designed to do a specialist job, to do it really well and that integrate with each other to provide a complete solution. For statistical modelling, you've got R, SciPy, Matlab... For visualisation, you've got R (again), Tableau, Qlik View... For data storage you've got a vast array of options and for data processing (ETL), you've got Alteryx, Pentaho and again, the list goes on.

That's just the things that Excel is actually for. Under the list of things that Excel has been abused to make it do, there are hundreds of better options. Many of them free. If you want a to-do list, for goodness sake pick something that's designed to do that job.

Excel is like a Leatherman multi-tool. You can get most DIY jobs done with it if you try hard enough.

But a Leatherman is rarely the best way to do any specific job. You want a proper screwdriver, or a full-size hacksaw, or to have a corkscrew for your dinner party that's not also attached to a pair of pliers.

A specialist's toolkit looks like this. One tool - the right tool - for each job.

This is Excel's problem in 2015. It's trying to do everything - often by bolting on more plugin tools - and so it's doing almost everything badly.

Excel's a great way to make some very average looking data visualisations, or to store your data in a way that makes it really difficult to manipulate quickly and to refresh. Excel can deliver a crap interactive dashboard to a (SharePoint) web page and it can do statistical modelling that's really hard to repeat, and leaves no audit trail.

Yes, you can sort of fix those issues, with plugins and macros and hacking and creative thinking, but that's back to fixing your motorbike with a Leatherman, when you could have had the full range of Snap-On tools.

Modern Excel has one more problem. And it's a biggie...

You can't be a beginner's introduction and a specialist's cutting-edge tool at the same time

Yes, I'm going to start with a rant about the Ribbon menu. It was a stupid idea when it was introduced and it's still a stupid idea now. When you watch an experienced user manipulate a familiar piece of software, you'll rarely see them touch the mouse, because it's a slow way to do what you want.

Microsoft introduced the ribbon to make features more prominent for selection with the mouse (and presumably with a view to the arrival of touch-screens). With subsequent releases, more and more features have moved into areas where they are difficult or impossible to access with the keyboard; try formatting a chart, or even saving a file in Excel 2013.

This might sound like a petty complaint, but it's a symptom of a very serious issue. The Ribbon and mouse / touch control were a big two-fingers to experienced Excel users.

Excel has been progressively dumbed-down to make it easier to access for inexperienced users.

Which is absolutely fine.

Except that simultaneously, Microsoft has introduced PowerBI, with features that aim squarely at advanced data manipulation and visualisation. I've tried them and to be frank, they're not up to scratch. They're awkward to install, difficult to use and when you do get them to work, they produce very average looking output.

Excel has ended up in a place where it's too advanced and has too many features for novice users and it's not as good as a dedicated toolkit for specialists. That's not a comfortable place to be.

Where now?

Excel has a strong defensive position, in that big IT departments like it because it's part of a suite of Microsoft software that they're already buying. As a business analyst, you also need Excel plus other tools - if only because everyone else still uses it - so it's not going anywhere in a hurry.

That defensive position is being eroded on all sides though. Particularly because you can get a lot of the competitors that I've been discussing for free. If your corporate IT environment isn't completely locked down, then you can make a lot of headway with open source, start to get your best work out into the world and then argue about commercial software licences later...

There is also one thing that Excel is truly brilliant at and it's not to be dismissed lightly. Sometimes you want a multi-tool. Just for a quick job, because it's easier than delving into the big toolbox. Excel is a fabulous tool for this. For quickly reformatting one-off data, for banging out a functional chart, or for correlating a couple of variables, you can't beat Excel.

Microsoft should recognise this use for Excel and take it right back to basics. Turn it into a Leatherman; a lightweight, portable, do-anything, data scratch-pad, that's not trying to be more.

They'll still need a full featured BI solution of course, and possibly something else that targets less experienced users, but stop trying to make Excel the scaffold that holds the whole data analysis structure together. It's not working and if my experience is anything to go by, it's leading experienced users to actively dislike the product.

If Microsoft don't produce that lightweight scratch-pad for data, I firmly believe that somebody else will and that could spell the end of Excel as a tool for serious analysts. Excel will have been replaced for the one task at which it is still the best option.

Monday, 23 March 2015

In praise of modesty

One of these people invented the World Wide Web.

The other talks about it.

One of these people is the best footballer in the world.

He's not the one in the gold shoes.

More of this please.

Monday, 16 March 2015

The Hitchhiker's Guide to the Solar System | Tableau Iron Viz Entry

So this year, I couldn't resist. Tableau set the challenge to build something - anything - with Wikipedia data and they've put prizes on the line. It's enough to tweak the competitive streak in any analyst...

But what to build? I started out with Football League data as it's familiar and I do a fair bit with those numbers already, but then @ChrisLuv did something really nice along the same lines and I got a bit bored scraping even more football results. It felt like time for a short break from football analysis and it would be nice to learn something new during the build.

I'm not sure where the Solar System idea that I've gone with came from - possibly a friend complaining recently on Facebook that his daughter keeps attacking his ship in Elite: Dangerous - but a quick browse on Wikipedia turned up nice, consistent datasets for planets and moons and some really lovely imagery. Definitely dashboardable.

I've been trying recently to reduce the numbers of screens in my Tableau work, because experience is showing that users can sometimes be reluctant to move around a multi-screen dashboard. If the landing page does a good job, you've got a good chance of converting new users to your interface, but if it requires more clicks to get to data, then things become much less smooth.

So, the task I set myself was a one page dashboard to explore the Solar System, and with enough interesting features to be worthwhile entering into the Iron Viz contest (I hope! That's not an intimidating task, honest).

Building on games like Elite, I also wanted it to look and feel something like the in-flight navigation computer in a space exploration game.

There's a bit of how-to guidance in the rest of this post, but first, here's the Hitchhiker's Guide to the Solar System...

 Hitchhiker's Guide to the Solar System

Navigation's a straightforward task of clicking the planet you want to see at the bottom of the screen and then you can also browse that planet's moons on the right-hand side.

If you really want to pull apart its inner workings, then download the workbook from Tableau Public and have a play. Here's a quick top-line guide to get things started.

Look and feel

Most of setting up the look and feel of the dashboard involved turning off features in Tableau. No axis lines, no grid lines, no titles for the dashboard elements, no tooltips and no row banding in tables. Everything was switched off via the Format menu and the background set to black everywhere.

If you set up one sheet the way you want it, then you can copy that sheet to make more without doing all of the formatting again. (P.S. Please can we have templates, Tableau? Pretty please?)

A green web-safe "Lucida Console" font on top of the black background gives a nice flight-computer feel.

Select a Planet

This is pretty easy once you know how Tableau custom shapes work.

I downloaded a high quality image (more on image quality later) of each planet from Wikipedia, saved them to my local Tableau shapes folder and then mapped them to the planet names.

The selector is a regular chart, with the axis set to logged distances from the Sun, planet images as shapes and planet sizes defined by their radius. Dashboard actions using that chart as a source, control the rest of the vis.

And a couple of extra tricks to tidy things up...

  • The planet labels are on the second axis, so that they all display at the same height rather than each one hugging close to its own icon
  • You can't use planet radius data exactly as it is to set the sizes of shapes, because Saturn has rings and so the planet itself ends up being too small. I created an extra field to boost Saturn's size up to where it should be.

Large image of selected planet

This is done the same way as mapping shapes onto the 'Select a Planet' chart, right? Unfortunately not. You can do it that way, but it looks a bit rubbish, because Tableau compresses shapes to speed things up. If you blow them up too large, they look like this.

It's not totally unacceptable, but seems a real shame to lose those gorgeous high-resolution planet images from Wikipedia.

The trick to getting high resolution images that change on a filter, is to use Map - Background Images. You make a 1x1 scatter chart, with the planet image set to show as background, depending on what's selected on the filter. Don't forget to make the single data point on the chart transparent!

Here are the settings:

The 'target' surrounding the planet is just a floating png image with a transparent background. I drew the four corners in PowerPoint and exported them as a picture. As much as PowerPoint is the work of the Devil, it's a really quick, easy way to create simple coloured lines and add them to your dashboard!

Planet data

This one's easy. It's a straightforward table with a black background and all of the lines and other formatting turned off.

The little benchmarking chart to the right of the table is designed to provide some context. When you look at a statistic like 24.79m/s2 for Jupiter's gravity, is that a lot compared to the other planets?

The chart itself is a Gantt bar, with a dashboard action that highlights the currently selected planet and lets you quickly see how it compares to its neighbours. Using a percentage table calculation instead of absolute values lets us put all the planets on the same mini-chart and stops the ticks for huge "Mass" numbers, compressing everything else down towards zero.

Select a moon, to see data and image

We've seen these techniques before - it's exactly the same as the large planet image and planet data table. I used a simpler chart - without custom shapes - as the selector because some moons are very small and others have very similar orbits, so they can cluster closely together.

Gustav Holst. Just because we can.

This one's a bit of a cheat in that it's not data from Wikipedia, but I just couldn't resist.

Classical music and space games and movies go hand-in-hand, so I used a URL action to call Grooveshark's embedded music player with tracks from Gustav Holst's Planets Suite.

It's not really hard to do - click share on a Grooveshark track and pull the link out of the embed code that it generates. Add that link to your dataset.

Then drop an empty Web Page element onto your dashboard and create a URL action that calls the field containing the link. Tableau will automatically throw the link at the empty Web Page element when you make a selection.

Edit: I previously had some discussion here of issues with Tableau Public, when you try to embed a link to an http address. Thanks to @ChrisLuv you can safely ignore all of that and only need to know that you have to use an https web address for whatever you link to, in order for it to work properly in Tableau Public. Just switch "http" for "https" in whatever you link to and you should be fine.

There's a little warning message hidden underneath the Grooveshark player link, just in case the web browser that's viewing doesn't have flash enabled and so doesn't load the player, but most people should never see it.

I also put that warning text in a calculated field, that hides it when Earth is selected. Holst took inspiration for The Planets Suite from astrology and so there isn't a track for Earth. Wikipedia says so. See, I used Wikipedia data on Holst. It wasn't cheating after all!

And we're done. Phew. Fingers crossed, it's time to press submit on the entry...

Wednesday, 11 March 2015

My data analysis toolkit

Growing out of posts like "Losing touch... Or why Excel and VBA won't cut it any more" and "How to do football analysis in Tableau", I've been asked a steady trickle of questions this year about what analytical software I use.

One of the biggest discoveries I've made as I branched out from Excel and VBA is that there isn't a right answer to what software you should use. There are loads of programming languages, loads of dashboard solutions, loads of databases and you can't possibly get experienced with all of them.

It's better to find yourself a set of tools that work and to know those tools well, than to have bits of knowledge all over the place, but not be using any of your kit to its full potential. If your data size isn't measured in terabytes, then you don't need to be right on the bleeding edge.

For outside of work projects, I've got a couple of machines that aren't anything fancy. An older Core i5 laptop with 4GB ram and a Core i7 laptop with 16GB ram. A regular laptop like that, costing between £500 and £1000 is more than enough for chucking around datasets up to tens of millions of rows and analysing them.

On to the software... As I said, there isn't a right answer to what you should use, but I hope this post might be useful to a few people as a starting point. If you've broken Excel and if Access has you tearing your hair out then read on. Opening Microsoft Access at all, is a very strong signal that you need to get acquainted with some of the tools on this list.

And as an added bonus, almost everything I'm going to talk about is free!

Question: Can I get by without Microsoft Windows?

Answer: Yes! You need Linux Mint.

Mint is a very Windows-like desktop environment and since the Windows 8 (Metro) car crash, it's arguably more Windows-like than Windows is.

Easy to install and easy to use, I wrote a short intro to Mint, a couple of years ago.

Be warned, that if you use Linux for anything more than a bit of web browsing, you are going to end up using the command prompt and Googling to fix broken things. But then you'll be doing that with the DOS prompt in Windows too, because a lot of data analysis software is designed around Linux and needs persuading to work properly with Windows.

If you've got an older machine lying about, stick Mint on it. You might be surprised at how good it is.

Question: Which database?

Answer: MySQL Community edition

Yes, there are newer, fancier Big Data technologies out there and I'll learn them at some point, but I want the SQL language I know, in a fast, familiar, free package and MySQL does that.

On Linux, you might have to do a bit of reading to get it to work properly (watch that you need to install the Server and Client packages), but it powers half the internet - including Twitter - and isn't too hard if you're patient and don't mind Googling an error message or two.

On Windows, make sure you download the MySQL Installer. Don't try to work out what packages you want and install them individually. You will inevitably bugger it up.

Once you've installed MySQL, if you're using it for data analysis rather than to power a small website, you MUST customise my.ini (Windows) or my.cnf (Linux), or it will run like treacle. Out of the box, MySQL is designed to run on really low-powered hardware and its memory usage settings are turned way down. Have a Google.

To go with your MySQL Server, you'll want MySQL Workbench, which you can use to write queries and maintain your database. It's packaged up in the Windows Installer that I mentioned, or you can install it separately on Linux.

Don't know any SQL? You should. Start here.

Question: Which programming language?

Answer: I use Python and I like it a lot.

If you're making the switch from VBA Macros, or taking your first steps in programming, then Python's a powerful, approachable place to start.

On Windows, I'd recommend installing ActiveState's Python package, because it handles all the set up for for you and prevents you from getting into a situation where you're pretty sure you installed Python but it just doesn't work.

On Linux, you already have Python installed and you just need a nice piece of editing software to write your code in.

On Windows and Linux, for writing code I've recently adopted PyCharm following a recommendation from @penaltyblog and it's the most straightforward editor I've seen that still has all the features you'll want as you progress.

It might feel a bit weird coming from Excel Macros that the editor you use to write code and the programming language itself are separate things, but that's Open Source for you - it gives you choices. MySQL is the same; you don't have to use MySQL Workbench to talk to your database if you don't want to and there are loads of other choices, but Workbench is an option that works and that's what this list is all about.

No idea how to use Python? Start here.

Question: What do you use for statistical analysis?

Answer: I use R.

R is free and tremendously powerful. It's a proper, highly capable programming language for working with data and the output you can produce using it is amazing.

Unfortunately, it's also got a learning curve that's close to vertical. Your first steps after working through tutorial examples, will not be easy. I'm definitely not an expert, but I'm improving...

The first step with R is to install R Studio, which makes life much easier. Don't try to install and use R without R Studio - it's unnecessarily painful.

The second step is to have a look at a beginners tutorial.

And when you get stuck, have a look on Stack Overflow. R errors can be really awkward things to Google, partly because it's just a single letter and partly because its user forums are horrible. Stack Overflow and R Bloggers will already have answers to most problems you'll hit and if they don't, you've probably got the wrong end of the stick and are asking the wrong question.

Ask a question on Stack Overflow at your own risk. Chances are somebody else has asked the question before and you could have found it by searching. The bar on what is a good question asked in the right way, is also set quite high!

As an aside, if you don't know R or Python, then they are interchangeable for a lot of work, such as web scraping. R is a statistics package with a lot of general programming capability and Python is a programming language with some really good statistics packages. You could try to stick with just one of them.

Question: And for graphs and data visualisation?

Answer: Tableau Public. And R again.

For a basic grounding in Tableau, try my  "How to do football analysis in Tableau" guide. Tableau's a fabulous piece of software and a brilliant way to get interactive charts and tables onto the web. It's also my first stop for visually interrogating a new dataset and seeing what it contains.

Tableau Public - which is free - is limited vs. Professional in that you can't load more than a million rows of data at a time and you can only connect to spreadsheets, text files and Microsoft Access. It's well worth working around those limitations though, to get access to the powerful tools that Tableau offers.

When I want a visualisation that's more bespoke than Tableau offers, I turn to R packages, because that's what I know. There are a whole host of other amazing technologies out there for bespoke data visualisation, like D3 and Processing, but it's very much a case of picking your battles and learning what will be most useful to you. I'm not a graphic designer and so am sticking with R and Tableau for now.

Question: Is there still a place for Excel?

Answer: Absolutely, there is.

Excel's still a very useful tool, even if only because you know exactly how it works, so you can use it to solve problems quickly. It's also very handy that almost everybody else has got a copy so you can share spreadsheets easily.

More and more though, I'm seeing Excel as a scratchpad for hacking data around before I put it somewhere more permanent. You can see what you're doing with Excel and for small to medium sized datasets, it works really well, provided what you're building is a one-off view of data that doesn't need to be updated.

Actually, I wish Microsoft would recognise this use for Excel, strip out all the crap and cut it right down. Just a blazingly fast spreadsheet, with worksheet formulas, pivot tables and simple charting. Nothing else. It'll never happen though.

As soon as your work starts to morph into repeatable analytics, or proper dashboarding, get it out of Excel. We used to use Excel for everything because it was all we had. That's not true any more.

The free alternative to Excel is Libre Office Calc and it's just about okay. If you haven't got Excel, it's worth installing but you'll end up in MySQL, R and Tableau earlier, because Calc is a lot less capable when you throw a sizeable amount of data at it.

Question: Any other bits and pieces?

Answer: One or two...

Notepad++ is a must for text editing and definitely put the Poor Man's T-SQL formatter plugin on it, so that you can clean up your SQL queries. You don't realise Wordpad is rubbish until you try something else that works properly.

Pentaho Kettle is cool and worth a look if you want drag and drop ETL (extract, transform, load) for your data. I got very excited about Kettle a while ago and I still use it quite a bit, but you may find that R steps up to do the same jobs as you get better at it.

Gephi is what I used to draw my analyst network visualisation. It's a piece of software focussed on networks rather than general purpose analysis, but a lot of fun to play with.

And finally, Digital Ocean is an awesome web service, where you can spin up a virtual Linux PC (they call it a Droplet) for $5 a month. When I want to run a web scraping Python script without the risk that my laptop will reboot half way through, I stick it on Digital Ocean. You can also put a 20GB MySQL server in the cloud this way and access it from anywhere. It won't be super fast for the basic $5 a month, but it's very handy if you access data from a few different places. If you use this link to sign up, you'll get $10 free credit and I'll get a bonus too. Everybody wins!

I hope that this post might set one or two people off trying new tools. As I said at the beginning, this isn't claiming to be the right answer to what you should use, or even the best answer, but it's a tool kit that's working for me and took quite a bit of sifting through different options to arrive at.

If you disagree with any of the choices, do hit the comments section. I'm always looking for better options, but only if learning them will save time in the long run...

Monday, 9 February 2015

How to do football analysis in Tableau | Part 3

If you've been working through these posts, so far you've copied some football data from a website into Excel, cleaned it up and learned to build simple tables and charts with that data on a Tableau worksheet.

In Part 3, we're going to create our first interactive dashboard.  We'll make a dashboard that lets you select a team and then it will show you which are that team's strongest players.

If you had fun messing about with Tableau in Part 2 and learning by building loads of different views (I really hope you did. Screw instruction manuals. Including this one), then you might want to start a brand new Tableau workbook at this stage and connect to your Excel data again. Have a look at the early bits of Part 2 if you can't remember how that worked.

Make a new worksheet and drag Mins, PS%, Assists and Goals into the view, then split the rows by Player Name and Position. You're looking for a view like this.

PS% is a downright unhelpful name for Pass Completion %, so right click it in the Measures Area (bottom left of the screen, remember?) and rename it. Tableau will then automatically change the name wherever else you've already used it.

It might be good at this point to also change Pass Completion from Sum to Average. It doesn't really matter, because you've told Tableau to split the data by player and there's only one row of raw data for each player, so Sum and Average are the same. If you'd split by team instead though, and then summed pass completion, the data wouldn't make any sense because Tableau would add all of each team's players' pass completion rates together.

Right click the little green lozenge for Pass Completion % and change it from sum to average. If you really want to see the difference that makes, take the Player Name split out of the view for a moment and try Sum and then Average.

It always helps when you're using Tableau, to keep in mind what your underlying dataset looks like, otherwise sum and average (and as you get more advanced, weighted averages) can get you into trouble.

Back to the table... At the moment, our table has got players from both teams in it and we don't want that, because we're building a dashboard that will show us one team at a time.

Drag "Team" from the Dimensions area and drop it in "Filters".

You'll get a pop up showing you Liverpool and Everton. Tick Liverpool and click OK.

Now our table only has Liverpool players in it. Name the worksheet "Player List" (double click the worksheet's tab at the bottom of the screen) and then we're done with this one for the moment.

Add a new, blank worksheet and we'll also make a chart for our dashboard.

Let's try for a chart that will show us who's played a lot of minutes and who's playing well.

Drop "Rating" from the Measures area onto the Columns shelf and "Pass Completion %" onto the Rows shelf.

Tableau will guess that you're trying to draw a scatter chart, because you've dropped Measures where you'd normally drop Dimensions.

What we've got isn't tremendously useful though. It's a scatter, but with all of the players summed into a single point and we want them split out.

Drop "Player Name" onto Detail in the Marks area and Tableau will give you a point for each player.

Detail is essential for scatter plots and for maps. It tells Tableau at what level you want to see your data split out.

This still isn't a great chart though. Nobody has really low passing percentages or a really low rating, so let's tell Tableau that we don't need to see zero on the axes. Right click each axis, choose "Edit Axis", untick "Include Zero" in the top right corner, and press OK.

We're getting there. The chart shows who has good pass completion and who gets a good rating on WhoScored. If you hover the mouse over a point, it will tell you which player it is and whoever is up in the top right hand corner has good passing and a good WhoScored rating.

We can do better than that though. Grab "Mins" from your Measures area and drop it on Size.

Now the chart is showing who's played well (ish. This isn't very advanced analysis!) and whether they've had lots of minutes this season. Larger blobs means more minutes.

Drop "Team" from Dimensions onto Color and you'll be able to See Liverpool and Everton separated out. It would be nice if Liverpool were in red though...

Double click on a colour in the legend and you can change it.

(Worried about colour blind people? You could drop "Team" onto Shape as well. Don't leave it there though, because it will look a bit rubbish and we're only going to show one team at a time in a minute.)

Finally, drop "Player Name" onto Label and suddenly you'll be able to see who's who.

Rename your worksheet to "Player Scatter". We've got two worksheets and can build a dashboard!

The idea of dashboards is that you can drop multiple worksheets onto them and link them together. It lets you build screens that contain a mix of tables, charts and maps, where each piece of the dashboard is coming from one worksheet.

Make a new dashboard screen, by clicking the button next to your sheet tabs at the bottom of the screen.

In the new dashboard screen that pops up, you should see all of your worksheets appear in a list in the top left. Grab "Player List" (you did rename your sheets, didn't you?) and drop it onto the large white dashboard space.

The table you built earlier appears.

Now grab "Player Scatter" and do the same thing. Keep the left mouse button held down and you can see what will happen if you drop the worksheet into different places - at the top, on the left, or on the right. We want it on the right.

You should get something like this. Tableau put the chart on the screen and also the legend for your chart in separate boxes on the right hand side that you can move if you like. It looks rubbish. We can fix that.

Depending on the resolution of your monitor, Tableau has drawn you a default dashboard screen size and shape, but it's unlikely to be exactly what you want.

As a little aside, you can tell Tableau to draw a fixed dashboard size - that will never change no matter who loads it - or one that will try to adapt to different users' monitors by resizing itself. In my opinion, dashboards that resize themselves are usually crap and to be avoided. Tableau's not good enough at reshaping each element on the screen and if somebody loads your lovingly constructed views on a low res monitor, they'll look awful. You can set limits to the resizing, but it's much better to just fix the size and make users with low resolution screens scroll around a bit.

Use the dashboard size options in the bottom left to make a screen that is Exactly ('Exactly' is in the dropdown box) 1150 x 600.

That gives you a nice widescreen shape that will fit on a decent resolution laptop monitor. If you find it's half off your screen and there are scroll bars, you can make it smaller if you want.

Now that you can see both of your dashboard elements properly, it's time to make this dashboard interactive.

Remember adding the Team filter to your table so that it only showed Liverpool? We want to show that filter on this dashboard screen.

If you click in the area occupied by "Player List", you'll see a grey box appear around it. In the top-right corner of that grey boundary box, there's a tiny little down arrow, next to the x. It's really small but it is there and it's one of the things brand new users tend to miss.

Click that little down arrow, find "Quick Filters" and click "Team".

The filter that you created earlier will be pulled onto the dashboard, with only Liverpool ticked. If you hadn't already created that filter, Tableau would have just made it for you and also automatically put it on the underlying worksheet.

Tableau's default filter for a text column like Team, is a list of options with a tick box next to each, so in this case you get an option for Everton and Liverpool and (All) at the top. Sometimes this is what you want, but we'd like this dashboard to just show one team at a time, with no option to select both.

Filters have options on a little down arrow, just like the one we used on the "Player List" screen. Click in the area that your filter is occupying so that its grey boundary box is shown and use the menu to select "Single Value (Dropdown)".

That's given us a tidier dropdown box that will always take up the same amount of space on the dashboard if later on we decide to add more teams to our data.

You can also get rid of the (All) option on that box, so that people are forced to choose a team. In the same menu you just used, go to "Customize" and untick "Show All Value".

Our menu is ready!

Well, nearly. Change it from Liverpool to Everton and watch what happens. The table will switch from Liverpool to Everton, but the chart stays the same. That's no good. Currently, the filter is only linked to our table, not to our chart.

Use the menu on the filter (tiny little down arrow again) one last time to select "Apply to Worksheets" and "Selected Worksheets".

Then tick "Player Scatter" in the box that pops up and click OK.

Now try changing the filter. You've got an interactive dashboard!

When you add filters, you can choose to apply them to just one worksheet, to a selection of worksheets, or to everything that uses this data. The key thing to bear in mind is to make it obvious to users how your dashboard works. Filters that randomly change some things but not others are really, really confusing.

As far as this quick introduction goes, we're done! You've acquired data, loaded it, drawn tables and charts, built a dashboard and made it interactive and that's quite enough for today.

The only thing that remains is to share your masterpiece with the world.

In the File menu at the very top of the screen, click "Save to Web".

You'll get a popup box asking for your Tableau Public login. Remember your Tableau Public login? You made it in Part 2.

Give your workbook a name and hit Save. After a few moments, Tableau will pop up your workbook in all its glory, on the web. You can share it with anybody from here and they'll have exactly the same interactivity through filters, as you did in the desktop software.

Here's my version.

Helpfully, just as I wrote this guide last weekend, Tableau completely overhauled the look and feel of the old Tableau Public site. Thanks guys. Suffice to say, there are share and download links on the web dashboard screen and you can also access your profile and any other dashboards you've uploaded. Have a browse around the site and all will become clear.

Tableau really is a fantastic tool and you can achieve amazing visualisations, much faster, than with anything else I've come across (and I reviewed quite a few). I love it and I hope these three posts will help to remove that initial trepidation for a few football analysts. Get stuck in, connect to data and start playing. You'll get it and if you don't, Google it. And if you still don't, feel free to tweet me @neilcharles_uk.