Raising Heretics on a Diet of Open Data

Following on from my rant on how ChatGPT would not be a threat to our assessment, if only our assessment were authentic and valid to start with, here is the text of a talk I gave today at Everything Open Conference in Melbourne, on how Open Data can help kids become rational heretics.

I’d like to begin by acknowledging the first scientists, the first environmentalists, the traditional owners and custodians of these unceded lands, the Wurundjeri Woi Wurrung people of the Kulin Nation, their elders, past and present. We have much to learn from them.

When I started teaching CS in a secondary school, I arrived to find a course that used robots, and pretty pictures, and other toys to try to engage folks with tech. And it failed. It failed dismally. The kids couldn’t see the point of learning this stuff. They couldn’t imagine it ever being relevant to them. They were bored, disengaged, and wholly unwilling, and this was at a science school, where you’d think the kids would be all over technology.

When we switched to using Data Science, we were teaching the same coding skills, but now we were doing it in the context of real datasets, with a large side of data literacy. We were using those datasets to answer meaningful questions. And suddenly kids were super engaged. Rather than complaining that they didn’t see the point, they came to me saying “Omg, this is so useful, I used it in my Science project,” (which was actually part of the reason I started teaching Data Science, because the graphs in those science projects regularly made me cry), “I used it in my maths exam. And there was a graph on the news last night and it was outrageous, there was no zero on the scale, it was so misleading…!”

There’s a lot of talk about boosting the pipeline. About getting more women and non binary folks into tech in general, and Data Science in particular. But as long as we focus on recruitment and, at a pinch, university education, as the means to address the problem, we will continue to fail, because we know that kids are being put off STEM, technology, and data skills as early as lower primary school.

Kids’ interest in STEM is caught or lost in those early years, when STEM skills, if they are being taught at all, are often being taught by folks who have never done STEM themselves, never been taught to teach STEM, and, all too often, are actually quite terrified of it . Relatedly, kids’ self-identified skill, or lack thereof, in maths (and probably tech, though it hasn’t yet been studied as far as I know) also solidifies quite early – around grade 4 or 5. Yet we aim a lot of our STEM recruitment drives at late high school, when kids are deciding their future careers (or are told they are, at least. I suspect many folks in this audience can attest that it is possible to switch courses, and even careers, at any time).

It makes sense to focus on high school kids if you think they are deciding their path, but the truth is they have pruned their available paths already. “Nothing involving maths – I’m no good at maths”, or “Nothing involving technology, I hated those robots”

We need to teach those kids that STEM skills are meaningful and useful. That they can use them to effect change. AND we REALLY need to teach those kids that STEM skills are not terrifying, super difficult, and impossible to learn, but that those skills are actually readily learned, accessible to anyone. We know that one way to engage kids with STEM is to solve real problems with Data Science, so clearly we need to bring Data Science into schools from the very beginning. Well, I have good news and great news.

The good news is that we are already building data science into education, and kids are loving it. The great news, though, is that Open Data gives us the power to give kids deeply meaningful and engaging projects, and school Data Science gives us the sheer people power to solve serious data problems at the same time.

We all know there’s more data out there than the field of Data Science could analyse even if we collectively forego sleep and food forever. Hands up if you’ve heard of the Japanese term Tsundoku? A stack of books beside your bed that you haven’t got around to reading yet. Now hands up if you have a data Tsundoku?

So how about we throw kids some of that data? Get them working with raw, messy, and above all REAL data and challenge them to make sense of it. Now we can give them the chance to break new ground, make real discoveries, AND put a dent in our data tsundoku at the same time.

What if we taught probability using gender pay datasets instead of black and white balls in an urn? “Charlie is a non binary software engineer. Given that they have been working in the field for three years, what is the probability they are receiving the same pay as James, a cis, het, white man doing the same job?” But, of course, we need open pay data in order to run that project!

The amazing thing about using real datasets for classroom projects is that the possibility exists for the kids to find out new things. To make discoveries that no one has made before. They’re suddenly doing real science, with new datasets that haven’t been fully analysed. They have the chance to ask questions no one has asked with that data.

For example, the first data science project I ran with my year 10s used a dataset from the AEC. We downloaded a csv file where each line was a vote for the Victorian senate candidates in the federal election. Over 3 million lines of text, so it wouldn’t even open in Excel. They had to learn to code, though, for the most part, their assignments could be completed with 10-20 lines of code, depending on their question. So they weren’t learning A LOT of code, but they were learning a manageable amount of code that enabled them to do something real, and made it clear that coding was something they could actually do, and that it was worth doing.

Imagine if everyone’s first experience of coding was successfully solving a MEANINGFUL problem!?
The assignment was to find a question the dataset could answer, analyse the data to get that answer, and then visualise the results. The visualisations were largely done by hand, there was no requirement to code that part. (Mostly because making kids interact with Python graph libraries is a fast road to tears – Heck, most of the visualisation libraries make ME cry – but also because they could make their visualisations much more creative and compelling – and learn more in the process – if they had the freedom to create them in whatever way they wanted.)

Every student (of around 180) had to find a different question to answer.

They asked things like “how did the people at my local polling booth vote, compared to the people in my whole electorate/the rest of victoria?”

What was the proportion of women as candidates, and how much of the vote did they get compared to the men? An excellent question, and one which raised the thorny problem of gendering people based on their given names, plus whatever was revealed by google search. A great opportunity to discuss some of the flaws and complexities inherent in data!

Which parties did people vote for who voted above/below the line?

Which parties were more likely to receive the 2nd preference of people who voted 1 for (some party).

Where are Pauline Hanson’s One Nation voters? Metro, urban, or rural?

Which party’s voters were more likely to follow the how to vote cards? (It will probably not surprise you to learn that Green voters are the most disobedient in this respect)

In short, they got to take a large dataset and ask the questions that most interested them. This dataset had just been released (it was just after the 2016 federal election) so most of these questions had not yet been asked. The focus was on the data literacy aspects – what questions can the dataset answerAnd how can we communicate the answers to those questions accurately and compellingly?

I picked this dataset because I had a student who was very interested in politics, and so I spent some time rummaging around the AEC site to see what I could find. But even the kids who weren’t interested in politics were really interested in this project. They were able to make it meaningful, and to see the relevance to the rest of their studies and their futures. This turns out to be central to motivation. Kids who can’t see the point of learning something, for the most part, won’t learn it. And making it “fun” is very dependent on the definition of fun. There is no single thing that every student will agree is fun.

But if you’re using real data, kids can see the relevance to things they are interested in, even if this particular dataset doesn’t do it for them.

When we use real datasets, there are a lot of questions to ask before you even begin to analyse them.

Questions like:

  • How was the data collected?
  • What are the problems/limitations of the way the data was collected?
  • What was the sample size?
  • What biases are embodied in the data?
  • What were the limitations of any sensors used?
  • Has the data been processed at all, if so how, and what was lost in that process?
  • What does each of the fields mean?
  • How do the fields relate to each other?
  • What definitions underpin this data? What assumptions have been made in those definitions?

Already you’re starting a data literacy conversation that builds critical thinking and problem solving skills, and we haven’t even opened the file!

We have a tendency to kind of bend at the knees when we see a graph, or some statistics. Teaching Data Science using real datasets builds a culture of rational scepticism that makes it normal to ask where that data came from, how reliable is it, what biases might there be?

Of course, there are challenges. If you have 180 students all asking different questions of a dataset, the teacher is not going to have 180 answers they can check those assignments against. This is actually an upside, not a downside, because now the kids have to say why they think their answer is valid. They have to check it, test it, challenge it, and try to prove themselves wrong. They have to see how many other explanations there might be for the results they got, and figure out ways to test for those. So not only are they learning to be rationally sceptical about data, but now they are learning to be rationally sceptical of their own work. They’re learning to critically evaluate – without being able to look up the answer in the back of the textbook, or on the answer sheet prepared by their teacher.

When we give kids real things to do, and the power to create change, they see the purpose of tech & data science skills, and are eager to learn. Black and white balls in an urn, or teaching robots to push each other out of circles, don’t have nearly the same impact. The more open data we have, the greater the potential for projects that empower kids to make real change in their communities.

Imagine kids exploring pedestrian data in their local town centre, or tracking covid cases in their community. Imagine them evaluating the impact of nearby development on threatened species, or looking at the impact of dredging in Port Phillip Bay on dolphin numbers and behaviour.
Imagine them analysing traffic around their school and devising safer traffic management for school drop off and pick up times?

Or using google mobility data to analyse covid lockdowns and figure out which country really had the strictest and the longest lockdowns in the world.

Or using public health and road accident data to try to figure out which is really the most dangerous in the long term – cycling, or inactivity.

Imagine… well… imagine exploring real, current data on anything the students are interested in!

As someone who is currently mobility impaired, I am fascinated and enraged by how much FURTHER all the accessible stuff is. The lift or ramp is miles away. The button you use to open the doors is so far from the doors themselves that anyone who is moving slowly can’t get there before the damned things close again. The “ambulant” toilet that has a door that opens outwards and bars I can use to help sit down and get up again is Always ALWAYS at the far end of the toilets.
What about a project where kids measure how much further people have to go to get around in a wheelchair, or on crutches?

What about one where kids measure, track, and try to fix (or have fixed) anything that’s a hazard for someone who’s vision impaired? Uneven footpaths, low hanging signs (also a hazard for any tall people on the clumsy side (hi)). Or measure and track accessibility on websites or social media – what about a project tracking the use of alt tags on images on mastodon?

Really, when you start thinking this way, there’s no limit to what you can study. And in every project, when kids come up with solutions to problems, because there’s no textbook answer, and no perfect solution to a real world problem, they have to evaluate their own solutions. Who does it help? Who does it harm? Does it make things better or worse? How can we improve it? Imagine if that was the standard approach to programmes in government!

None of these are easy questions. None of them have easy answers. But that’s fantastic preparation for life in the real world, where easy questions and obvious answers are conspicuous by their absence. And the Australian Data Science Education Institute is building these kinds of projects – using open data, or getting kids to collect their own data about problems in their local area – for kids as young as five years old. We’re building their critical thinking, rational scepticism, and STEM skills from the very start of their education, and we’re teaching them that they have the power to change the world.

A friend of mine recently hired a newly graduated data scientist who freaked out when his analysis did not produce the perfect curve his Masters of Data Science had led him to expect. Fortunately my friend is an expert statistician, and she was able to reassure him that real data almost never gives a perfect curve. But it’s a dreadful indictment on that Masters degree that all of this person’s training had been on perfect, textbook datasets. No wonder his poor brain exploded on contact with his first real dataset.

Now you have to bear in mind that not every school project will produce immediate, tangible results. But the fascinating thing is that just knowing that they are working on something real and meaningful gives students a mind blowing level of motivation and engagement. It also makes it much easier for them to imagine how the skills they are learning could be useful elsewhere. Even though they are learning the same essential coding skills they would learn in a unit where the goal is to push robots out of circles or draw pretty pictures, those projects seem to the students to lack relevance, and hence motivation.

Open Data combined with this kind of data science education gives students the power to change the world. And goodness knows we could use some of that!

So, as Open data enthusiasts, I charge you with a few things that will help us turn your data into school projects.

  • Annotate your data for non experts. If you have a csv file with fields labelled with things like “fg62p”, it may be open, but it’s not really accessible. Provide a data dictionary that explains every field in clear and non-specialist language, and as much information as you can about how the data was collected, as well as any processing that’s been done on it. It’s fine, indeed helpful, to provide this as a separate file.
  • Use accessible, non proprietary formats wherever possible. CSV is ideal. We can’t make any assumptions about available software, hardware, or even internet access in schools. We might even be reduced to sneakernet, which brings me to
  • If your dataset is huge, provide some subsets if possible, with the dataset broken into meaningful units. This not only helps schools with bandwidth/connectivity issues, it also helps provide more manageable datasets for kids – and teachers – who are just getting started. For example, I have broken down the Google mobility data for Australia by state. For me it’s a simple python script. For some teachers, it’s an insurmountable barrier.
  • Be open and explicit about the issues and limitations of your data. Are there significant time periods missing, patches where sensors failed or the internet went down, biases in the sample? Etc
  • DON’T, please, please don’t clean your data. Real, messy, complicated datasets are the best possible preparation for the real, messy, complicated world. Let’s stop teaching kids that problems have simple answers, or even correct answers, and instead equip them to handle real life.
  • Provide contact information where possible for someone who can answer questions about the data. When I first used data from the AEC, I literally spent hours on the phone to the AEC trying to find someone who could explain the data to me. The file had a flat string representing a two dimensional ballot paper, and I was hoping to find someone who could confirm the mapping from ballot paper to string. I never did, so I had to figure it out from first principles. It all took a long time, and really, any project involving an interesting dataset takes ages to build. It’s part of the reason I founded ADSEI – because I could only do these cool data science projects when I was teaching because I was teaching part time, so I could use my own time to find interesting datasets, figure out what they meant, and construct a project around them. If I’d been full time, there’s no chance I could have made that happen. If you can’t provide your own contact details for some reason, or your dataset is particularly thrilling and you’re afraid you’ll be fielding questions from a zillion teachers and students every day, feel free to use me as an intermediary. It’s part of what I do. So we can put the dataset up with my email address on it, as long as I can contact you if/when I’m seriously stuck for an answer. At least that way you only get asked each question once.

A key part of ADSEI’s mission is to relieve teachers of the burden of finding interesting datasets and figuring them out, as well as helping to skill up teachers to teach Data Science themselves. Long term, of course, the goal is to put ADSEI out of business, because this will be the way teachers teach, the way they’re taught to teach, and the way the curriculum is written. But we’ve got a week or two before that happens sooo….

I set ADSEI up as a charity, because it is fundamental to me that funding must never be a barrier to access. Educational equality is how we build societal equality. That means we do charge, but those who can’t afford to pay don’t. It also means we’re always strapped for cash. Please donate!

The more authentic and meaningful data we have access to, the more kids we can empower to change the world using data science. And those kids who have learned that data science is a tool they can use to change the world, might just go on to make things better for all of us!

Leave a Reply