Data Science Explainer

Axes of awful

I’ve ranted before about the importance of the y axis on graphs starting at 0. Most software autoscales graphs so that the range on the y axis is just slightly larger than the range of the data, which makes graphs difficult to compare, and often wildly over-emphasises the scale of change. You can see a classic example in the post linked above.

But. It turns out that, just sometimes, using the range of the data as the scale allows you to see change that would otherwise be very difficult to see. This is particularly important for climate and weather data, because small changes can be incredibly significant.

Consider that the world is heading for (or possibly has already exceeded) a warming of 1.5 degrees, which is quite catastrophic, and we desperately need to avoid hitting 2 degrees of warming. That 0.5 is deeply significant. But when you apply that to local maximum temperatures, it can be quite difficult to see. It also makes no sense, really, to apply climate change numbers to local weather, in part because climate and weather are two different things, and also because climate change can take temperatures down as well as up. But sometimes you want to see what’s going on in your own hood. It makes it more personal.

To that end, I grabbed the temperature data for Melbourne from the Bureau of Meteorology’s brilliant weather station data download page, and set to work. The page lets you download monthly or daily data, but I wanted to play with some code, so I got daily maximum temperatures. There were lots of stations to choose from.

screenshot of a table of weather stations with distance (from chosen place, in this case Melbourne), Station number, Station Name, Date of First recorded value, Date of Last recorded value, how many years' of data, percentage, and Data availability for a list of Melbourne weather stations, including Melbourne (Olympic Park), Kew, Brighton Bowls Club, Essendeon Airport, etc. Years range from 159.8 for Melbourne Regional Office to 0.3 for Laverton Salines.

I wanted a station with a good long recording period, so I went with Essendon Airport. It turns out this dataset is missing some years from 1972-2003, for some reason, but it was plenty to work with.

First I wrote a program in Python to just give me temperatures for December. You can do this in a spreadsheet, but I know Python better than I know spreadsheets, so it was easier for me to do it this way. It’s important to remember that there is no “best way to do things” and no points for using one system over another. Whatever works for you and your students is the right way to do it!

I then wrote another script to give me a file containing the average December maximum for each year in the dataset. I opened that file in Excel and graphed it, which gave me this. Note that I had to faff about quite a lot to take it from the default graph choice to a graph that made some kind of sense!

A graph of Melbourne Average December Temperatures. The values bump around a line and seem to range from 20ish to 30ish. The values on the x axis are the years, but if you look closely it jumps from 1971 to 2004. There's a trendline that might be going upwards a bit, but it's quite hard to tell.

Once I added a trend line, you could see that the average temperature was rising overall, but with a y axis range from 0 to 35, it wasn’t easy to see how large the change was. This is a situation where not starting the y axis from zero makes sense – to zoom in on the change and get a better sense for how large it is.

So I changed the minimum value on the graph to 20, which gives us this graph.

The same graph as before, but this time the y axis starts from 20, not from 0. The line is more jagged, but most importantly, the trend line clearly goes from just under 24 to just over 25.

Now the trend line is much easier to read. It clearly starts just below 24, and continues to a little over 25. That’s a change of just over one degree from 1939 to 2023. Now I’m curious to try it for other weather stations, to see if the trend holds right across Melbourne, and Australia. I love the way finding something out with one dataset raises more questions, and more opportunities to explore!

This is another example of how there are no absolute rules in data science (except for: there’s no such thing as a perfect dataset – that one holds inviolable!). Everything is context. The y-axis not starting at 0 is sometimes ok. Pie charts are sometimes a great way to compare values. A line graph is sometimes useful for discrete data.

There are no hard and fast rules in data science. Anyone who says otherwise is selling something.

Leave a Reply