Conversations about renewable energy

Our first featured Dataset is renewable energy installations around Australia by postcode.

I downloaded the csv file “Postcode data for small-scale installations – SGU-solar” which is a beautifully rich dataset that offers a range of options for exploration.

When you open it in a spreadsheet package it looks like this:

Screen Shot 2018-05-06 at 3.22.53 pm

Can you work out what has happened to the postcodes? The first is 0! The second is 200. Australian postcodes are four digits, so what the heck is that about? This is an example of your spreadsheet hiding things it thinks you don’t need to know about – in this case, leading 0s. Mathematically speaking, there’s no difference between 0, 00, 000, and 0000. They all just mean 0. So spreadsheets (and other software) tend to remove the leading 0s, which means postcode 0 is actually 0000, 200 is actually 0200, etc.

Now let’s look at the first two solar columns. The first is historical installations from 2001 to 2016. We don’t seem to have any data from before 2001, but that’s not because nobody was installing solar before that. It turns out that it’s because 2001 is when the government introduced the mandatory renewable energy target and began tracking renewable energy.

Next question: how much solar is actually operating now? Answer? We don’t know. This data tracks installations. It doesn’t track people getting rid of their solar panels, or the panels ceasing to work. Installations are a reasonable measure of how much solar we have, but not perfect.

This opens the way for a great conversation about the data we want, versus the data we have, and how many data studies work with flawed or missing data, simply because it’s all we have available.

Ok, so let’s look at the first column. Having it sorted by postcode is logical, but not terribly interesting. Let’s look at the top 20 postcodes – to do that, we can sort the entire table by the second column (how many installations happened between 2001 and 2016), in descending order. In other words, put the largest values up the top.

Screen Shot 2018-05-06 at 3.34.48 pm

A quick glance shows us that the majority of the top 20 postcodes start with a 4, meaning they’re in Queensland. (If you’re not sure which postcode is where, as I’m not, you can check at a postcode site.) The top postcode, 4670 covers 53 regions, including Bundaberg. There’s a surprisingly large gap between the top postcodes and the bottom of the top twenty, which is interesting. Most of the postcodes in this list that aren’t in Queensland are in Western Australia. except for 3029, which is West of Melbourne, around Hoppers Crossing, and 3977, which is South East of Melbourne, in the Cranbourne area.

There’s a rich conversation to be had around why these suburbs have so much more solar than other places in Victoria. Toorak, for example, a notoriously wealthy suburb, comes in at 1701 on the list. My suspicion is that areas with a lot of new housing are more likely to have solar, as it gets put in when the house is built as a way to increase the energy rating of the house. But this is a topic worth exploring! You don’t have to know all the answers, as it’s an opportunity for the kids to research and explore, and come up with their own theories for why it might be the case.

Let’s look at column 4: solar installations in January 2017.  How different are the top 20 if you sort the whole table by this column?

Screen Shot 2018-05-06 at 3.44.32 pm

Now WA scores better, and the rest is still largely over to Queensland, except for one Victorian postcode (Cranbourne area again), and this time one NSW representative.

Why do WA and Queensland do so well on both historic and recent measures? This is an opportunity to explore the politics and have your students find out what incentives there are to install solar in those states. Could it be due to solar feed in tariffs, government incentives, or home energy rating requirements?

You can keep going and explore the different columns, or you could step it up a notch and start to look at how the columns are related. For example, are postcodes with a lot of historical solar installations also likely to have a lot of recent ones? You can do that roughly by eye, simply by looking at whether the top twenty when sorted by those two columns is similar or very different, or you can go heavy on the statistics and try to work out whether both values are equally predictive of a postcode’s place in the ranking. (I won’t go into that here, lest I scare away the non-statto’s among us!)

You can use this dataset to explore different attitudes to solar power around the country, and the possible reasons for them. You can use it to question which incentives work and which ones fall flat, or whether solar incentives actually make a difference.

Now, what if you wanted to visualise this data? Well, you could find out the names of the top 10 postcode areas and graph them. (You could just graph the postcodes but it’s not terrible meaningful to anyone who hasn’t memorised the postcodes of Australia!) Top 10 is a fairly arbitrary selection, aimed at not putting too many places into the one graph. It would make more sense to choose a place in the data where there’s a big drop from one value to the next. In this case I might go top 5, since there’s a big drop from 5 to 6. It shows you the top performers well, but doesn’t show you much else.

Another technique would be to colour a map by number of solar installations. Say, bright red for >9000, and becoming paler for each drop of 1000. This would be rather time consuming given that there are 2795 postcodes listed, so this is an opportunity to consider aggregating your data. What happens if you use average stats for each state?

You can do that in Excel or any other spreadsheeting package by sorting the data by postcode, and then just copying and pasting each state into a separate sheet, but it’s lso nice and easy in Python. (I’ve been lazy and lumped the ACT in with NSW.)

Screen Shot 2018-05-06 at 5.04.20 pm

Interestingly this shows that the state that dominates the top 20 doesn’t perform as well when you average over all of its postcodes, so there is another rich conversation to be had about different ways of ranking data outcomes, and how you can characterise data in accurate but misleading ways.

It’s a great example of not needing complex technical skills to explore a dataset. Being able to program unlocks more ways of looking at the data, but to get started all you need here is the ability to short a spreadsheet by different columns, and a wealth of information is at your fingertips!

We will publish more datasets and more explorations as we go along, but in the meantime why not find your own datasets, and explore the things it can tell you? There are no right or wrong answers in this game, just different ways to play with the data. The more you play, the greater your data literacy.



Bringing Data Science TO YOU

ADSEI is super excited to be partnering with CSIRO and AeRO (Australian eResearch Organisations) to run some public events at the C3DIS Collaborative Conference on Computational and Data Intensive Science at the Melbourne Convention and Exhibition Centre (MCEC).

FOR the general public we have a science panel event at 7:30 on Wednesday May 30th:  Data Intensive Science: from Astronomy to Zoology  Come and hear Scientists talk about the uses and abuses of data, and ask your questions about science, data, and everything! You can book tickets for only $5 per person.

For Year 10-12 Students there is a student day on May 30th from 10am until 2pm. 

This is an outstanding opportunity for students to learn about cutting edge STEM research, hear talks from world class scientists, and to meet researchers using Computation to solve problems in areas as diverse as Biology, Climate Science, Astronomy, Marine Science, Bushfire Prevention and Management, and much, much more.

Teachers can bring students to this day for FREE, and student groups are welcome to enter the Visualisation Competition. 

Students can work in teams to choose a real world dataset (such as the workforce equality dataset on, analyse it to answer an interesting question and visualise the results.

More information on the student day and optional visualisation competition can be found here.

FOR Teachers there is a workshop on integrating STEM into your classes using Data Science. For $250 you can see keynotes at the conference, attend a workshop where you will build classroom projects and lesson plans around CSIRO and other datasets, attend the Poster Session drinks and the Gala reception in the evening.



Girls in STEM

If I hear one more person say “Girls just aren’t interested in tech” or “girls naturally go into the life sciences, it’s biological” I swear I will explode in a way that puts thermonuclear weapons in the shade.

At the same time, I get very frustrated with programmes that aim to attract girls to technology using 3D printed jewellery and sparkly shiny things.

I applaud people making efforts to get girls into tech. I really do. And having a diverse range of such programmes probably gives us a better shot at attracting a diverse range of people to the field. Which is great.

But I have two problems with the sparkly pink approach. First of all, I think it grossly underestimates and trivialises girls. Are we, as a gender, so shallow that it takes sparkly pink things to attract us? I reject that premise utterly.

And the second problem is that lack of girls is merely the obvious, measurable diversity issue in tech. We have a severe diversity problem that is not measurable with chromosomes.

The issue we have is that we are attracting the same types of people to STEM fields, especially technology, that we already have in those fields. That’s natural, to some extent – like attracts like. But if we are to design new technologies to be truly inclusive – like making our payment devices accessible for the blind , or creating wireless microphones for female speakers*[footnote] –  then we need a truly diverse range of designers who will question, challenge, and innovate with everyone in mind, not just people like them.

If we only have people in technological roles who have been immersed in technology their whole lives, then we will only have products designed for those people. And that can render those products inaccessible, and indeed inexplicable, to the rest of us.

So we need to attract a broader range of people into tech than we are at present. And I don’t believe that sparkly pink things are going to cut it.

We are grossly underestimating not just girls, but all of our kids, if we think that they are only attracted to fun and frivolous things. Attract girls with sparkly pink and boys with video games – you’ll just get more monoculture. What we need to do, more than anything, is to show our kids the relevance of technology. What can you use this stuff for? How can you make a difference? What does it mean?

When we used to teach our year 10s programming by having them write code to draw pretty pictures, we had low numbers choosing to study computing in year 11, and very few girls (around 5 at best). The single most common piece of feedback we got was “Why are you making us do this? It’s just not relevant or interesting.”

When we started to teach Data Science using authentic datasets with real problems to solve, we doubled the number of girls going into Computing in year 11 (although as a data nerd I do have to point out that one data point does not make a trend! What it does make is an excellent start.), and the most common piece of feedback we got is now “This is SO useful, and so relevant to what I want to do.”

That’s why I’m so passionate about the Australian Data Science Education Institute. Because if we can support teachers to put Data Science into the way they teach everything – from history and geography through to science and maths – using real datasets, then we are showing the kids how technology is relevant to everything they do.


[footnote] The microphone issue may sound trivial, but I was presented with a wireless microphone last week that had a receiver designed to clip onto a belt. I was wearing  a dress. With no belt. Fortunately I had a scarf around my neck that I could tie around my waist for clipping the receiver onto. But I should not have to rearrange my clothing in order to accommodate the technology. And what would we have done in the absence of that scarf? Seriously, how hard can it be to design devices that work for everybody??

Data Science for Primary Schools

People tend to assume that Data Science is a high level skill, only applicable to high school – and the senior years of high school at that. But engaging with data is something we can do from very early on.

Got a kinder class you want to do some data with? How about getting the class to keep track of who does what activity each day, using tally marks on a white board or flip chart, and then work out which activity is the most popular? Then do it again only this time tally which activities girls do and which activities boys do? (The results may surprise them.) This is data science.

In primary school, kids can collect and analyse data from their own environments. They can do a rubbish audit and work out which types of rubbish are the biggest problem in the yard.

The younger kids can do that simply by piling the chip packets in one pile, the ice cream wrappers in another, and the cling wrap in a third, and then looking at which one is bigger.

The older kids can be making graphs. They could look at which types of rubbish are more common on canteen days, versus when the canteen is not open. Then they could work out a solution to their worst rubbish types – for example, if it’s chip packets, maybe the canteen could use large chip packets and distribute them in smaller lots in reusable containers.

Or they could do a biodiversity audit of a section of garden in the playground, perhaps comparing a garden which has only one type of plant with a garden which has a variety. They could plant a veggie garden and measure plant grown in a bed with compost versus a bed without compost.

Anything that allows them to collect data about their own environment and then uses that data to enact positive change – reducing rubbish, increasing biodiversity, attractive native birds to the playground, etc.

It’s really important that we start engaging kids with data science and computation early, because by the time they reach High School they’ve often already lost interest. And that’s a problem for them, and a much bigger problem for our society! But more on that in another blog post.

PS If you’re a primary teacher and need some help with the Australian Digital Technologies Curriculum, you might like to check out my “Demystifying the Digital Technologies Curriculum”  posts on my old blog.

Basic Data Literacy

It’s easy to get caught up in highly technical aspects of Data Science. To focus on complex numeric analysis using programming languages like Python or R, and think of outputs like fantabulous heatmaps and stunning geospatial visualisations.

But an article I saw in The Age today highlighted some of the deceptive data practices we see every day. Some of them are wholeheartedly deliberate, designed to mislead us and persuade us of untruths. Some, like this one I suspect, are purely accidental. But the journalist who wrote this article should never have let it stand, and the readers need to be able to think critically about what these numbers mean. Read the paragraph below for a moment.

Screen Shot 2018-02-23 at 3.38.47 pm

Can you see why it got my hackles up?

“From a 2-million-ton butter stockpile… dwindled to less than 12 days’ supply.”

So hands up if you know how many tons of butter constitutes a days’ supply?

Are you actually able to compare those two figures without further research? I certainly couldn’t. As it happens, I did further research and found a web page that puts global butter consumption at 8,000,000 tons annually.  I have no idea how valid that webpage is, but let’s roll with it for a moment. To compare the figures we divide 8 million by 365 to get a daily figure, and then multiply by 12 to get back to tons per 12 days. I get 263013 tons (and some assorted decimal places which I am going to wickedly ignore for now, but that will be a whole other blog post).

By dividing 263,000 by 2,000,000 we find that it’s roughly 13% of the stockpile we had before. Which is, it’s true, a significant decline. But now we can see how much of a decline it is, which trying to compare 2 million tons with 12 days’ supply made impossible. Even better, let’s represent it visually:

Screen Shot 2018-02-23 at 4.07.07 pm
Butter stocks then and now


This is just the default graph from Google sheets, but it conveys the size difference quite effectively.  (There are, of course, plenty of ways we could improve the graph, but one snark at a time, ok?)

It wasn’t hard maths. Or a challenging graphing exercise. Or even a tricky research problem – although, as noted, I have no idea how valid the figure of 8 Millions tons per year is, or indeed what year the measurements were taken (although the graph on that page implies either 2004 or 2009, but from the information available on that page I can’t be sure about the aggregate figures). But then I don’t know how valid the figures are in the original article (and, to be honest, I’m just not that excited about butter consumption, unless it’s on my own toast, and if we’re measuring that in tons then I probably have a problem).

The point is that if you’re going to use figures to support your argument, and you’re going to compare them by saying they “dwindled” from one value to another, it’s not rocket science to make those figures easily comparable.

This is one of the reasons Data Science needs to be core in schools. To make sure that when we present our own data, we present it in a way that’s both valid, and easy to interpret. And to ensure that when others show us data, we can analyse it critically, and call it out when it doesn’t make sense. Whether it’s by mistake, or by design.