One of my early Data Science projects in class came about because of a student’s fascination with politics. Being obsessed with politics myself, this student, Jack, and I had lots of conversations about the federal election as it happened. During one of these conversations, Jack told me that he had been playing with the data on the Australian Electoral Commission website.
When I went and had a look, I discovered that it was possible to download a spreadsheet file that contained every single vote in the federal election. So I downloaded the Victorian Senate votes. Each line in the file contained the electorate, polling booth, and box-by-box information about how each voter had numbered their ballot paper. Obviously the votes are anonymous, but there’s a wealth of information in that file that had my head spinning.
You could then analyse the data to find out not only the simple stuff like who got the most first preferences, but more complex questions like where did people who voted 1 for The Australian Greens put their 2? Or how many people voted below the line? Was there a difference in the average first preference of people who voted below the line or above it?
(If you are not familiar with the Australian voting system and these terms aren’t making sense to you, you can read some of the details here. It’s bizarrely complex.)
It was a tricky one because we had to first understand the format of the data. The boxes on the ballot paper are not numbered, so it took us some time to match each box with the correct field in the spreadsheet.
It was great, too, because being a real dataset, it didn’t always follow the rules. In the Australian Senate you are only allowed to use the number 1 once, and you either vote below the line OR above it. Not both. Some voters, though, had not followed the rules, so an analysis that assumed valid votes was doomed to failure. It was a great lesson in the complexity of real datasets.
I had the students first come up with a question the data could answer – and that was fascinating to start with, because some would ask questions such as “which is the best party?” which, of course, the data cannot answer. But it can answer which party got the most votes.
Some students chose to explore the differences between rural and urban voters, which necessitated finding a way of categorising a particular electorate – an interesting can of worms in its own right.
Some looked at the voting patterns of their own electorates. Some looked at which party’s voters followed the How To Vote cards provided by their parties – now that was an interesting one!
This idea of students taking a rich and complex dataset and exploring the questions they find most interesting is a really powerful one. It provides built-in differentiation, and gives the students a lot of motivation when they have the freedom to explore.