One of the things ADSEI does in its lesson plans is ask the question: What is wrong with this data?
This is a really crucial question, because there is no such thing as a perfect dataset. All data has issues. Often its not the data you want, it’s simply the data you were able to get. For example:
- whale observations tell you how many whales were seen, when what you really want to know is how many whales were there. Some whales might have breached but not been observed (shades of Schrödinger’s Whale), or swum by without breaching, or even been spotted twice but accidentally counted as two whales when it was really just the one.
- speed cameras tell you the instantaneous speed of the car when what police really want to know is: has that car exceeded the speed limit at any time on this trip?
- counting the litter found in the schoolyard tells you how much litter you found, not how much litter was dropped – some of it may have blown away or hidden under things. It also only tells you how much litter was there that day. What if a year level was out on excursion, or it was a wet day timetable…
And even when the data is actually what you want, there may be data that’s missing or flawed for various reasons. For example:
- Facial recognition systems that were trained on images of faces that were almost exclusively white and male.
- Phone polls that can’t include people with unlisted numbers.
- Internet polls that can’t include people without internet access.
- Surveys where people don’t or can’t tell the truth – for example about healthy eating, or sexuality, or where people don’t actually know the truth, for example about why they did things, or things they don’t remember (like what did you have for breakfast yesterday? Or how often do you eat broccoli?).
- Skipped data where someone forgot to record a daily observation or the system went down and didn’t record any values.
Consider the reporting around the Corona virus. We have a reported death rate of around 2% which is highly speculative, because we have no idea how many mild cases of corona virus there are out there that are not being identified or reported. Some sources report the numbers and stress the uncertainty, while others report them as solid facts.
This is a kind of scepticism and critical thinking that we don’t often leave room for – in education, business, or journalism. Often we are in such a rush to get the “right” answer that we don’t have time to pause and evaluate the data we’re working with, to consider the flaws and uncertainty that are built in to any dataset, and any analysis.
If we can teach our students, from pre-school onwards, to question their data, to ask “how many ways is this data flawed” rather than assuming the data is perfect, then perhaps we can build a world which centers critical thinking and evaluates evidence.
This is why using real datasets rather than nice clean sets of fake numbers is crucially important to teaching data science. Because real world datasets are never nice, clean, and straightforward. There is no need for scepticism and critical thinking in textbook examples. But kids who have used real data in their learning are equipped to tackle real world problems.
Can you share some examples of flawed data? What consequences have you seen from people assuming data is perfect?