Our first featured Dataset is renewable energy installations around Australia by postcode.
I downloaded the csv file “Postcode data for small-scale installations – SGU-solar” which is a beautifully rich dataset that offers a range of options for exploration.
When you open it in a spreadsheet package it looks like this:
Can you work out what has happened to the postcodes? The first is 0! The second is 200. Australian postcodes are four digits, so what the heck is that about? This is an example of your spreadsheet hiding things it thinks you don’t need to know about – in this case, leading 0s. Mathematically speaking, there’s no difference between 0, 00, 000, and 0000. They all just mean 0. So spreadsheets (and other software) tend to remove the leading 0s, which means postcode 0 is actually 0000, 200 is actually 0200, etc.
Now let’s look at the first two solar columns. The first is historical installations from 2001 to 2016. We don’t seem to have any data from before 2001, but that’s not because nobody was installing solar before that. It turns out that it’s because 2001 is when the government introduced the mandatory renewable energy target and began tracking renewable energy.
Next question: how much solar is actually operating now? Answer? We don’t know. This data tracks installations. It doesn’t track people getting rid of their solar panels, or the panels ceasing to work. Installations are a reasonable measure of how much solar we have, but not perfect.
This opens the way for a great conversation about the data we want, versus the data we have, and how many data studies work with flawed or missing data, simply because it’s all we have available.
Ok, so let’s look at the first column. Having it sorted by postcode is logical, but not terribly interesting. Let’s look at the top 20 postcodes – to do that, we can sort the entire table by the second column (how many installations happened between 2001 and 2016), in descending order. In other words, put the largest values up the top.
A quick glance shows us that the majority of the top 20 postcodes start with a 4, meaning they’re in Queensland. (If you’re not sure which postcode is where, as I’m not, you can check at a postcode site.) The top postcode, 4670 covers 53 regions, including Bundaberg. There’s a surprisingly large gap between the top postcodes and the bottom of the top twenty, which is interesting. Most of the postcodes in this list that aren’t in Queensland are in Western Australia. except for 3029, which is West of Melbourne, around Hoppers Crossing, and 3977, which is South East of Melbourne, in the Cranbourne area.
There’s a rich conversation to be had around why these suburbs have so much more solar than other places in Victoria. Toorak, for example, a notoriously wealthy suburb, comes in at 1701 on the list. My suspicion is that areas with a lot of new housing are more likely to have solar, as it gets put in when the house is built as a way to increase the energy rating of the house. But this is a topic worth exploring! You don’t have to know all the answers, as it’s an opportunity for the kids to research and explore, and come up with their own theories for why it might be the case.
Let’s look at column 4: solar installations in January 2017. How different are the top 20 if you sort the whole table by this column?
Now WA scores better, and the rest is still largely over to Queensland, except for one Victorian postcode (Cranbourne area again), and this time one NSW representative.
Why do WA and Queensland do so well on both historic and recent measures? This is an opportunity to explore the politics and have your students find out what incentives there are to install solar in those states. Could it be due to solar feed in tariffs, government incentives, or home energy rating requirements?
You can keep going and explore the different columns, or you could step it up a notch and start to look at how the columns are related. For example, are postcodes with a lot of historical solar installations also likely to have a lot of recent ones? You can do that roughly by eye, simply by looking at whether the top twenty when sorted by those two columns is similar or very different, or you can go heavy on the statistics and try to work out whether both values are equally predictive of a postcode’s place in the ranking. (I won’t go into that here, lest I scare away the non-statto’s among us!)
You can use this dataset to explore different attitudes to solar power around the country, and the possible reasons for them. You can use it to question which incentives work and which ones fall flat, or whether solar incentives actually make a difference.
Now, what if you wanted to visualise this data? Well, you could find out the names of the top 10 postcode areas and graph them. (You could just graph the postcodes but it’s not terrible meaningful to anyone who hasn’t memorised the postcodes of Australia!) Top 10 is a fairly arbitrary selection, aimed at not putting too many places into the one graph. It would make more sense to choose a place in the data where there’s a big drop from one value to the next. In this case I might go top 5, since there’s a big drop from 5 to 6. It shows you the top performers well, but doesn’t show you much else.
Another technique would be to colour a map by number of solar installations. Say, bright red for >9000, and becoming paler for each drop of 1000. This would be rather time consuming given that there are 2795 postcodes listed, so this is an opportunity to consider aggregating your data. What happens if you use average stats for each state?
You can do that in Excel or any other spreadsheeting package by sorting the data by postcode, and then just copying and pasting each state into a separate sheet, but it’s lso nice and easy in Python. (I’ve been lazy and lumped the ACT in with NSW.)
Interestingly this shows that the state that dominates the top 20 doesn’t perform as well when you average over all of its postcodes, so there is another rich conversation to be had about different ways of ranking data outcomes, and how you can characterise data in accurate but misleading ways.
It’s a great example of not needing complex technical skills to explore a dataset. Being able to program unlocks more ways of looking at the data, but to get started all you need here is the ability to short a spreadsheet by different columns, and a wealth of information is at your fingertips!
We will publish more datasets and more explorations as we go along, but in the meantime why not find your own datasets, and explore the things it can tell you? There are no right or wrong answers in this game, just different ways to play with the data. The more you play, the greater your data literacy.