Fairness is not the default

KJ Pittl from Google spoke brilliantly at C3DIS (The Collaborative Conference on Computational and Data Intensive Science) about fairness in Machine Learning in May. Although I’ve thought and read a lot about this topic, her talk was electrifying. I want to try to capture here the points that I thought were key, and none registers more strongly than this one:

“Humans have not got a history of being fair. Fair is not the default.”

To back up this point, KJ used the following slides, which really speak for themselves.


I am almost certain that none of these situations came about by malicious intent. They were just design decisions by a small group of people, for a small group of people, and they simply assumed it would work for everyone the way it worked for them.

But right there, that’s why we urgently need diversity in tech, and in data science. Because as long as the groups that are designing our future are largely homogeneous, they won’t be able to say “But are there any people of colour in our image set?” – a question that could have averted this:

Screen Shot 2018-08-03 at 5.00.03 pm

or to say “Hey, do you know that blind people won’t be able to use this device to enter their PINs?”

Or “But what happens if you’re in a wheelchair or pushing a pram?”

or “What if you’re homeless?” “What if you have kids?” “What if you’re part time?” “What if English isn’t your native language?” “What if your eyesight isn’t great?” “What if you have food allergies?” “What if you’re a refugee?” “What if you don’t have a car?” or any one of the myriad questions that might prevent us from designing a future that inadvertently locks a section of our population out.

Diversity helps us design better solutions, but it also helps us ask important questions of the solutions we have. And given that, by default, our systems will not be fair, inclusive, or equitable, we really want to make sure those questions get asked.

Why robots are a disaster for tech education

It’s very tempting to see robots and other shiny tech toys as fantastic motivators for STEM education. After all, who doesn’t love playing with cool toys? Unfortunately this kind of hardware has huge drawbacks in the classroom. To show you why, let me tell you a story.

On the weekend I took my kids to Oz Comic Con. My 11 year old, Jen, is a HUGE tech nerd and loves all things hardware, software, mathematical, and, of course, STAR WARS. Dressed as a Jedi and wielding a lightsaber, Jen was magnetically drawn to the stall selling star wars drones. Jen had been saving for Comic Con for months, so the $50 cost, while more than they have ever spent on anything before, was well within their reach.

I did a quick bit of online research and it seemed like a good buy.

Behold Jen’s X-Wing in all its glory.


You can imagine the excitement when we got it home, but we were out to dinner that night and didn’t have time to unbox and charge it. The next day Jen bounced out of bed and went straight to the box. Eating, drinking, and other necessities of life were not on the agenda, so it was lucky it was a public holiday and I didn’t have to try to get them to school.

Once charged (the drone), batteries installed (the controller), and with the beginner-pilot’s safety cage installed, we fired it up. The controller even buzzed when we inserted batteries and had Yoda saying “feel the force!”. The excitement was INTENSE. The instructions said to power up the controller and the drone, flip the left hand lever up and down, whereupon it would beep, and the flashing lights would then stop flashing to show that the devices were synced.

But there was a catch. Beeping occurred as expected, but the lights on both devices continued to flash. We powered both devices off and on again. We tried different batteries. We even went shopping for new batteries. We spent all day trying to get the damned thing to work, to no avail. 3 days later it still didn’t work and we were waiting for tech support from the drone company to reply to our emails.

Now you may think we were doing something wrong – and perhaps we were – but I have a PhD in Computer Science, and my husband is an Electrical Engineer. If we can’t make it work, what hope does your average teacher have?

Unlike with programming, a student, a teacher, and even an electrical engineer have very little hope of debugging a device such as this one, because there is no feedback. There’s no way of knowing its internal state. Short of taking the device apart and resoldering each of the connections and testing each component (not skills taught in your typical primary education course last I checked), there’s no way to troubleshoot these things.

Whether Robot, Raspberry Pi, or Arduino, hardware all suffers from these issues. There’s a significant chance that they won’t work out of the box. Even if they do, connections come loose and they might stop working mid-lesson, or not work next time they come out of the cupboard. And what we teach kids with these kinds of intensely frustrating experiences – when they are trying to do the same things as everyone else, but for them it doesn’t work – is that these problems are insurmountable. That they have no control over technology, no power to fix it when it breaks, and no way of understanding how it does what it does.

These are not the lessons we want to be teaching our kids.

*Update: The company got back to us the day after I wrote this, and very quickly replaced the drone. 10 days after the initial purchase we have a drone that works – but Jen’s enthusiasm – and confidence – has taken a severe battering.

ADSEI in the news

ADSEI has been in the news lately. Check out our Executive Director, Dr Linda McIver, on ABC Radio Sydney’s Focus Program, talking about Big Data and data literacy.

There was a profile piece on Linda in the Australian Financial Review, in BOSS magazine.

And an Op Ed in The Age, the Sydney Morning Herald, and other Fairfax publications on why kids need to be data literate:

Linda also gave a recent YOW night talk on how kids can solve our data problems with Citizen Data Science:

Data Science Education is an idea whose time has clearly come!


Conversations about renewable energy

Our first featured Dataset is renewable energy installations around Australia by postcode.

I downloaded the csv file “Postcode data for small-scale installations – SGU-solar” which is a beautifully rich dataset that offers a range of options for exploration.

When you open it in a spreadsheet package it looks like this:

Screen Shot 2018-05-06 at 3.22.53 pm

Can you work out what has happened to the postcodes? The first is 0! The second is 200. Australian postcodes are four digits, so what the heck is that about? This is an example of your spreadsheet hiding things it thinks you don’t need to know about – in this case, leading 0s. Mathematically speaking, there’s no difference between 0, 00, 000, and 0000. They all just mean 0. So spreadsheets (and other software) tend to remove the leading 0s, which means postcode 0 is actually 0000, 200 is actually 0200, etc.

Now let’s look at the first two solar columns. The first is historical installations from 2001 to 2016. We don’t seem to have any data from before 2001, but that’s not because nobody was installing solar before that. It turns out that it’s because 2001 is when the government introduced the mandatory renewable energy target and began tracking renewable energy.

Next question: how much solar is actually operating now? Answer? We don’t know. This data tracks installations. It doesn’t track people getting rid of their solar panels, or the panels ceasing to work. Installations are a reasonable measure of how much solar we have, but not perfect.

This opens the way for a great conversation about the data we want, versus the data we have, and how many data studies work with flawed or missing data, simply because it’s all we have available.

Ok, so let’s look at the first column. Having it sorted by postcode is logical, but not terribly interesting. Let’s look at the top 20 postcodes – to do that, we can sort the entire table by the second column (how many installations happened between 2001 and 2016), in descending order. In other words, put the largest values up the top.

Screen Shot 2018-05-06 at 3.34.48 pm

A quick glance shows us that the majority of the top 20 postcodes start with a 4, meaning they’re in Queensland. (If you’re not sure which postcode is where, as I’m not, you can check at a postcode site.) The top postcode, 4670 covers 53 regions, including Bundaberg. There’s a surprisingly large gap between the top postcodes and the bottom of the top twenty, which is interesting. Most of the postcodes in this list that aren’t in Queensland are in Western Australia. except for 3029, which is West of Melbourne, around Hoppers Crossing, and 3977, which is South East of Melbourne, in the Cranbourne area.

There’s a rich conversation to be had around why these suburbs have so much more solar than other places in Victoria. Toorak, for example, a notoriously wealthy suburb, comes in at 1701 on the list. My suspicion is that areas with a lot of new housing are more likely to have solar, as it gets put in when the house is built as a way to increase the energy rating of the house. But this is a topic worth exploring! You don’t have to know all the answers, as it’s an opportunity for the kids to research and explore, and come up with their own theories for why it might be the case.

Let’s look at column 4: solar installations in January 2017.  How different are the top 20 if you sort the whole table by this column?

Screen Shot 2018-05-06 at 3.44.32 pm

Now WA scores better, and the rest is still largely over to Queensland, except for one Victorian postcode (Cranbourne area again), and this time one NSW representative.

Why do WA and Queensland do so well on both historic and recent measures? This is an opportunity to explore the politics and have your students find out what incentives there are to install solar in those states. Could it be due to solar feed in tariffs, government incentives, or home energy rating requirements?

You can keep going and explore the different columns, or you could step it up a notch and start to look at how the columns are related. For example, are postcodes with a lot of historical solar installations also likely to have a lot of recent ones? You can do that roughly by eye, simply by looking at whether the top twenty when sorted by those two columns is similar or very different, or you can go heavy on the statistics and try to work out whether both values are equally predictive of a postcode’s place in the ranking. (I won’t go into that here, lest I scare away the non-statto’s among us!)

You can use this dataset to explore different attitudes to solar power around the country, and the possible reasons for them. You can use it to question which incentives work and which ones fall flat, or whether solar incentives actually make a difference.

Now, what if you wanted to visualise this data? Well, you could find out the names of the top 10 postcode areas and graph them. (You could just graph the postcodes but it’s not terrible meaningful to anyone who hasn’t memorised the postcodes of Australia!) Top 10 is a fairly arbitrary selection, aimed at not putting too many places into the one graph. It would make more sense to choose a place in the data where there’s a big drop from one value to the next. In this case I might go top 5, since there’s a big drop from 5 to 6. It shows you the top performers well, but doesn’t show you much else.

Another technique would be to colour a map by number of solar installations. Say, bright red for >9000, and becoming paler for each drop of 1000. This would be rather time consuming given that there are 2795 postcodes listed, so this is an opportunity to consider aggregating your data. What happens if you use average stats for each state?

You can do that in Excel or any other spreadsheeting package by sorting the data by postcode, and then just copying and pasting each state into a separate sheet, but it’s lso nice and easy in Python. (I’ve been lazy and lumped the ACT in with NSW.)

Screen Shot 2018-05-06 at 5.04.20 pm

Interestingly this shows that the state that dominates the top 20 doesn’t perform as well when you average over all of its postcodes, so there is another rich conversation to be had about different ways of ranking data outcomes, and how you can characterise data in accurate but misleading ways.

It’s a great example of not needing complex technical skills to explore a dataset. Being able to program unlocks more ways of looking at the data, but to get started all you need here is the ability to short a spreadsheet by different columns, and a wealth of information is at your fingertips!

We will publish more datasets and more explorations as we go along, but in the meantime why not find your own datasets, and explore the things it can tell you? There are no right or wrong answers in this game, just different ways to play with the data. The more you play, the greater your data literacy.



Bringing Data Science TO YOU

ADSEI is super excited to be partnering with CSIRO and AeRO (Australian eResearch Organisations) to run some public events at the C3DIS Collaborative Conference on Computational and Data Intensive Science at the Melbourne Convention and Exhibition Centre (MCEC).

FOR the general public we have a science panel event at 7:30 on Wednesday May 30th:  Data Intensive Science: from Astronomy to Zoology  Come and hear Scientists talk about the uses and abuses of data, and ask your questions about science, data, and everything! You can book tickets for only $5 per person.

For Year 10-12 Students there is a student day on May 30th from 10am until 2pm. 

This is an outstanding opportunity for students to learn about cutting edge STEM research, hear talks from world class scientists, and to meet researchers using Computation to solve problems in areas as diverse as Biology, Climate Science, Astronomy, Marine Science, Bushfire Prevention and Management, and much, much more.

Teachers can bring students to this day for FREE, and student groups are welcome to enter the Visualisation Competition. 

Students can work in teams to choose a real world dataset (such as the workforce equality dataset on, analyse it to answer an interesting question and visualise the results.

More information on the student day and optional visualisation competition can be found here.

FOR Teachers there is a workshop on integrating STEM into your classes using Data Science. For $250 you can see keynotes at the conference, attend a workshop where you will build classroom projects and lesson plans around CSIRO and other datasets, attend the Poster Session drinks and the Gala reception in the evening.