Data Science Explainer

The Importance of Zero – A Graph Crimes Story

This morning I read an article on the latest covid wave on the Australian ABC news site.

Although our tracking of covid waves is now all but non existent, since we’re not collecting case data, there are some numbers, presumably gathered from hospitals. It’s hard to figure out exactly where the data used in the article comes from, since it’s not on the page the article links to, nor is it on any of the pages those pages link to, but assuming it’s real – or as real as it can be under the circumstances – let’s look at the story it’s telling.

Here is a graph from that article. It’s a graph of National Covid-19 cases on a 7 day rolling average from August to October 2023. It looks scary, with a sharp upwards jump at the end. What’s wrong with this graph?

A graph headed "National Covid-19 cases on a 7 day rolling average. Case numbers between August and October 2023." The X axis goes from Aug 1 to Oct 24, the Y axis goes from 550 to 950. The data on the graph starts at around 780, going gently up and down until a sharp dip on Oct 3 to 580 or so, and a sharp peak on Oct 24 at around 940. The dip and peak look quite dramatic.

If you said “The Y axis doesn’t start at 0”, you are correct. Why does that matter? Let’s see…

The ABC helpfully provided a “get the data” link, which gives you the numbers they used to construct this graph. Here’s what happens if you make the same graph with the Y axis starting from 0.

A graph headed "Case numbers vs Data" The X axis goes from Aug 16 to Oct 15, the Y axis goes from 0 to 1000. The data on the graph starts at around 780, going gently up and down before rising a little on Oct 24 at around 940. The dip and peak look quite minor.

See how much less dramatic the uptick at the end is? That’s because starting from 0 gives a much better sense of scale. The first graph uses slightly more than the range of the data for the Y axis, so it starts at 550 and goes to 950. This, by the way, is the default in a lot of graphing software, including Excel. The second graph starts at 0 and goes to 1000.

Let’s put them close together to make it easier to see. (I’ve also made them smaller so you can see them both at the same time on a phone screen.)

I’m certainly not saying we don’t have a covid problem. Given the data that we’re NOT collecting (and check out this podcast episode with Margaret Hellard and Richard Denniss for more on that), we almost certainly have a much bigger problem than this graph shows. However, this is a nice clear illustration of how not including a zero on the scale gives a very distorted impression of the story.

It’s one of the first things many of the guests on Make Me Data Literate say, when they look at graphs in the media. Where’s the zero? What story is this graph trying to tell, and is it the truth?

This is a key part of data literacy. How accurately does this graph represent the data story? In this case, not very accurately at all!

Leave a Reply