Site icon ADSEI – Teaching Kids to Change the World

What’s wrong with this data?

When you’re dealing with real data, rather than textbook datasets that always come out with a nice clear answer and a perfect curve, there’s always something wrong with it. There will be flaws or unexpected complexity in the collection process, limitations of sensors, ambiguous survey questions, sampling bias that means your answers would be different if you collected the data somewhere else, or some other way. There will be outliers that might be real or might be errors. There will be conditions under which the data was collected that can change, and affect the data in unexpected ways. There is always something. Usually there are quite a lot of things!

We could easily spend years talking about all the possible problems with data, but that’s not particularly helpful when you have a dataset, or someone else’s analysis, in front of you here and now. So what are some simple questions you can ask about quantitative data to figure out whether you can trust it or not? We’ll start with the data itself, then do separate posts on the analysis and the visualisation. This list is not exhaustive, by any means, so if you have other favourite questions, please add them in the comments!

Question 1 – What is the sample size?

One of the most important questions for figuring out the validity of the data is “what is the sample size?” Sample size helps determine how accurately the population or situation you are considering is represented. The larger the sample, the more likely it is to represent the situation accurately, but the more expensive it is to collect and analyse. It can be tempting to stop collecting data when you have proven your point, but collecting for another day might show that your point is actually wrong.

It’s important to ask how large the sample size was, and how that relates to the size of the population, or the complexity of the situation you’re looking at. For example: How many people were surveyed to collect this data? How many birds were measured? How many days did you count cars? How many years of temperature data did you use? How long did the sensors run for, and how often did they collect data?

Question 2 – How was it collected?

The companion to sample size is how the sample was collected, and was it representative of the population or situation as a whole? For example, if you are collecting data about whether people like going to large shopping centres or not, collecting data at a large shopping centre is going to include significant bias, because people who really don’t like large shopping centres (like me!) won’t be there. Similarly, if you use weather data from Melbourne in December, you can’t use that to represent the entire year, because Melbourne’s weather varies seasonally. You also can’t use December 2023 to represent all Decembers, because the seasons also vary, and in fact Melbourne’s weather was quite unusual last December. 

If you collect data about traffic on a Monday, it may look different to data on a Wednesday, particularly if Monday was a rostered day off for all the local building sites and factories. Some years ago my local council collected data about traffic on a major feeder road near a university, but it collected the data during university holidays, so the resulting traffic looked much lighter than it would during term time. That’s fine if you want to know what traffic is like during university holidays, but not fine if you want to know what traffic is like year round.

Whether a sample is biased depends on the context, but also on the question being asked. If I want to know how people who shop at large shopping centres feel about them, then surveying people at large shopping centres makes sense. But if I want to know how the whole population feels about large shopping centres, I’d have to find another way to collect my data.

Question 3 – what are the flaws in the way the data was measured or observed?

No measurement is perfectly accurate. No observation is perfectly complete. There are always errors and flaws. The question is not “is it accurate?” The question is “Is it accurate enough?” If you want to measure the heights of a group of people in order to sort them into height order, then accuracy is probably not super important. You don’t even need numbers, you just need to measure people against each other. But if you want to know the average height of the same group, you do need measurements, though probably accuracy to the centimetre is enough. You don’t need to (and probably can’t) measure to the micron.

Similarly, if you are counting all of the white cars in the traffic passing a certain point, there is a good chance you’ll miss some, when the traffic is heavy, or you are distracted for a moment, or you sneeze at the wrong time. If you are counting whales passing a particular point on the coast, you are unlikely to see them all. When you mark an exam, there’s always a chance you mark a question incorrectly, or misread a student’s handwriting.

We often think of tech devices as perfectly accurate, but it’s important to remember that there is no such thing as perfect accuracy, even using electronics. All sensor data is flawed. The questions we need to ask are: How flawed? and In what ways? Using a sensor to measure CO2 levels might be thrown off by your breathing, if you are too close to the sensor.

One set of sensor data I worked with measured pollutants in the air. Occasionally the concentration of some pollutants measured by the sensors went negative. Concentrations can’t physically go lower than 0, though – how can you have negative amounts of nitrous oxide in the air? The answer, when I contacted the scientists who had made the measurements, was that the sensors were unreliable at low concentrations, so the values fluctuated a bit. The negative values weren’t real, they were artefacts of the sensor’s limitations. Similarly a set of solar data I worked with had negative amounts of power being produced at dawn and dusk – again, an indication of the inaccuracy of the sensors.

Sensors always have limitations. The question is: are the values collected by the sensor accurate enough to answer the question I’m asking?

Survey answers are particularly likely to include errors, because people give false answers to surveys for a huge range of reasons, including:

How much you can trust survey answers depends on many factors, such as how personal/embarrassing/controversial the questions are, who is asking the questions, whether the survey face to face, over the phone, or online, and what the purpose of the survey is, and how seriously people take it.

Question 4 – Is the data measuring what we think it’s measuring?

Often the data we want is not actually available or collectable, so we collect the next best thing. This is known as proxy data – the data we have is a proxy for the data we need. For example, exam results are a proxy for how much the student has learned. We hope they’re close to the same thing, but there’s a risk we forget that they’re not actually the same. Exam results can be influenced by all kinds of things – ambiguous questions, noisy or uncomfortable exam conditions, exams that don’t cover the full content of the course, exams that test recall instead of understanding, external stress, among many others.

The number of covid cases reported (back when covid data was collected and reported) was not actually the number of covid cases in the community. It was the number of people who tested positive for covid AND who reported their result. It excluded people who tested negative even though they did actually have covid (false negatives), people who didn’t test but did have covid, and includes people who tested positive but did not have covid (false positives – these were far more rare than false negatives).

I don’t have a weather station or rain gauge, so if I want to know how much rain fell on my garden last week, I rely on the weather bureau’s measurements. Those measurements are taken at the weather station for my suburb, though, which is not on my property. Rainfall can be surprisingly patchy, so there might be more or less rainfall at the weather station than my garden actually receives. In most instances, though, it’s a reasonable proxy (as far as I can tell, I haven’t actually done a study!).

The data we collect or have access to is very often a proxy for the data we want, and it’s important to understand, and factor in, the differences that might exist between the proxy and the real thing.

Question 5 – Do we have access to the raw data?

Is the raw data (the data exactly as it was collected), publicly available, or has it been aggregated or summarised in some way? What was lost in that process? If the data is not available at all, why not?

Any time data is summarised, information is lost. Averages mean we lose information about the highest and lowest values, and the extent of the variation. Sometimes that doesn’t matter, sometimes it does. If you want to know how hot it was in Melbourne yesterday, the average maximum temperature for Melbourne over the past week is meaningless, because it varied so much. The average maximum temperature, however, is useful to track if you are investigating climate change. Whether the data lost in the summarising process is a problem or not depends on the question you want answered.

Even if the data is simply aggregated – eg all the results from survey question 1 combined, and then the results from question 2, etc – we can sometimes lose information. Sometimes we want details that combine answers to different questions, like: what did people who ate broccoli for lunch then have for dinner? Or who did people who voted for the Greens candidate first put second on their ballot paper?

If the data is not available at all, it can be very difficult to evaluate the quality of the data and the integrity of the analysis. In that case, it’s significant whether the author includes details about the flaws in the data. We know that real data is always flawed, so any report that does not acknowledge that is immediately suspect.

Question 6 – What else might be going on here?

What confounding factors could there be in the context? Is there some feature of the time or way that the data was collected that would explain it? For example, if a person were collecting data about dead antechinuses, they might assume that there was a sudden outbreak of disease in late winter that only affects males. If you know much about antechinuses, though, you know that the males die of exhaustion after their first breeding season, which happens in late winter.

Despite the context, this antechinus is neither dead nor a boy. It is, however, the only photo I have of an antechinus.

Similarly if you count traffic on a Monday, you might think that people drive to work less consistently on Mondays, unless you know that the local building sites and factories all had a rostered day off that day, and that only happens once a month. Or if you count whales passing by the coast off Lorne in December, you might think whales are extinct, unless you know that whales mostly only pass by that coast from May to September.

Question 7 – What is missing from the simulation or model?

Sometimes data is simulated, or modelled, rather than collected or observed. It’s impossible to simulate a real world phenomenon exactly, so what factors are missing from the simulation/modelling process, and how will that affect the data? Sometimes details that are used in the simulation are simplified or summarised. For example a simulation of the spread of fire might use the average vegetation cover in an area, rather than a tree-by-tree, shrub-by-shrub map of the terrain. That might change the way the fire behaves, and the ways in which it is different from the real situation are not necessarily predictable.

Modelling the impact of a particular drug on the body might only model the nervous system, which ignores any possible input from the circulatory system. Sometimes these differences are important, sometimes they’re not. Once again, it depends on the question you’re asking. It’s important to know, though, whether the data is simulated or real, and how much real world detail the system contains.

Question 8 – What bias or conflict of interest might there be?

When a drug company funds a study evaluating their own product, or when a government funds research into its own programs, it can influence the researchers’ findings. Sometimes deliberately, sometimes subconsciously. Our own unconscious biases can also influence the way we investigate an issue, so it’s important to know whether the people collecting, analysing, and communicating the data had any particular goal in mind. There’s a crucial difference between setting out to explore something, and setting out to prove something, even if we think we’re being objective.

It’s important to ask whether the people who collected the data have any conflicts of interest, or obvious biases that might have led them, consciously or unconsciously, to want to find a particular result.

Question 9 – Does the data look too good?

Data that is super neat and produces a perfect curve, or that increases or decreases steadily and predictably, without any fluctuations, is immediately suspicious. The two graphs below were produced in Georgia, USA, during 2020. The first is a graph that was touted as showing that cases of covid were decreasing. It is suspicious because each county’s data trends very consistently downwards. Upon investigation, the data was sorted by value, not by date (a close look at the dates on the X axis show that they are not in chronological order). The more accurate second graph, using the same data, shows a much more complicated story.

Data that shows exactly what the author is trying to show, with no variation, complexity, or outliers, always requires a second look.

These are just a few of the questions we should be asking about data we are presented with. Still to come: questions to ask about data analysis, and questions to ask about visualisation. Please add the questions you ask about data in the comments, or email them to me!

For more about how to teach using real data, grab a copy of Raising Heretics: Teaching Kids to Change the World. And to support this work, donate today at givenow.com.au/adsei

Exit mobile version