Data Science Explainer

What is wrong with this data analysis?

So someone is using quantitative data to justify something. How can you figure out whether the analysis is valid, or whether there are holes in it you could drive a truck through? It’s not always easy. Sometimes it’s not even possible, without access to the raw data! But here are some starter questions you can ask about the data analysis, to help figure out where the issues are. (There are always issues, no analysis is perfect, just like no dataset is ever perfect. The question is: is it good enough for my purposes? We mostly don’t have access to the analysis itself, so we need to figure out as much as possible from the write up.

I’m not going to get into the intricacies of technical statistics here. This piece is intended, rather, to be a guide to thinking critically about what other people say about their data analysis. A guide for high school students, or adults without any data experience, who are presented with a report about data and need to evaluate its credibility. It follows on from “What is wrong with this data?” , which should be read first.

Question 1: Are the limitations of the data & analysis acknowledged?

If the write up assumes, or outright claims, that the data is perfect and unassailable, this is a huge red flag. There is no such thing as perfect data. A report that does not clearly set out the limitations of its data is immediately suspect. Limitations include flaws in the data (see previous post), sample sizes that are not representative of the whole population, analysis techniques that don’t perfectly suit the data for some reason, results that are statistically significant but only just, and results that are stated as change in risk rather than absolute risk.

That last one bears particular attention, because it’s a very common problem. It often shows up in the shape of a headline that screams “Eating <bacon> increases risk of <bowel cancer> <tenfold>!” (where you can sub in any number of foods, diseases, and risks) This sounds objectively terrible, and may well be so, except that often the existing risk was something tiny, like 0.0001 cases per million people. Which, when you multiply it by 10 is still only 0.001 cases per million. Because headlines saying “risk goes from 0.0001 cases per million to 0.001 cases per million” aren’t nearly so clickbaity, reporting of change in risk is disturbingly popular.

Any analysis that doesn’t state its flaws clearly and publicly is immediately suspect.

Question 2: Does the report claim the data explains why?

Quantitative data cannot tell you why. It can tell you what, but never why. So it might show that fewer people caught public transport in Melbourne in 2022 than in 2019, but it cannot explain why that happened. Data doesn’t explain things. It can answer quantitative questions like “how many?”, or “what’s the largest/hottest/farthest/most popular?” but it can never answer qualitative questions like “why?” or “what’s the best/nicest?”

Any report that claims the data tells you why something happened is, at best, misinterpreting the data. At worst it’s trying to mislead you. Telling you the data explains something is another red flag.

Be on the lookout for reports that say “the data shows we must…” or “the data explains why…” Data can’t explain things, and it cannot tell you what to do.

Question 3: What happens to outliers?

Outliers in the data can be a problem. They can be mistakes due to measurement errors, where the sensors went a bit haywire, or they could be due to people simply typing the wrong number. Sometimes, though, outliers are real, meaningful values. We like to believe that there are simple rules for telling the difference, but in practice it’s disappointingly complicated.

I recently heard a story at a dinner party from someone who had just completed a third year research methods class. Their class was told by the lecturer that if any of the data didn’t look right, they should just delete that line. I nearly choked on my dinner. We can easily rule out outliers that are impossible values – like someone’s height being recorded as 185m tall. That’s an obvious mistake, when you know the context. These are human heights. It’s far more likely to be 185cm. 185m is not a reasonable height for a human being.

But even when we know the context, sometimes outliers that look wrong are, in fact, real values that can tell us things about the situation we are exploring. For example, if a normal count of bees in a particular area is between 30-100, a sudden count of 30,000 could be an error, or it could be a swarm. Similarly a sudden spike in my heart rate could be an error, or it could be that a spider jumped onto my arm.

I once filled in a sleep tracking questionnaire for a month that took in all times as 24hour times. The first few times I did it, I unthinkingly put 10 as my bedtime, because I don’t think in 24 hour time. While I knew that was an outlier, the researchers who received that data had no way of knowing (without checking with me) whether it was an error, or whether I had stayed up all night partying (It’s possible, ok!?! Even at my age!).

Because of this complexity, while there are certainly formulae to identify outliers, there are no golden rules or formulae to know which outliers should be thrown out and which ones should be kept for analysis. It is very dependent on the context of the data, and sometimes we simply can’t be sure. Outliers are a signal to investigate the data further. Any data analysis that throws out outliers without an incredibly strong explanation for doing so is immediately suspect. Anyone who tells you there are hard and fast rules for eliminating outliers is vastly oversimplifying a fiendishly complex issue.

Question 4: Do the results look perfect?

Perfect results from real data are a huge red flag. In fact, the stronger the result, the more likely it is that the data analysis is not doing what we think it’s doing. With machine learning systems, getting 100% accuracy is an indicator that the system has learned to do something other than the intended task – such as finding a copyright statement rather than a horse in an image. Similarly, analyses that find huge statistical significance, or perfectly clear results from real data are immediately suspect. Sometimes it’s because the data has been cleaned over-enthusiastically, so that all the meaningful mess has been removed. Sometimes it’s because the wrong analysis was used. Sometimes it’s because ChatGPT was asked to do the analysis. Sometimes it’s an outright lie.

We’ll talk about this issue more in the post on visualisations, as it’s often particularly obvious in graphs.

Question 5: Have they confused correlation with causation?

Assuming correlation means causation is a classic and incredibly common mistake. “When value A goes up, so does value B, therefore A causes B.” It’s a useful kind of brain hack for surviving the world. If you start getting sick every time you eat bread, not eating bread seems like a wise move. If your partner gets asthma every time you apply a particular perfume, it makes sense to stop using that perfume. In these cases, correlation does lead to causation. The trouble is, they are not the same. Things being correlated doesn’t always mean there’s a causal relationship. Just because B follows A, doesn’t mean that A causes B.

You might assume that suddenly getting sick every time you eat bread means bread is bad for you, but it might also be that the weather is more humid than usual, and the bread has gone mouldy. Similarly, getting it might be that you only apply a particular perfume when you go out to dinner, and since you always go out to the same restaurant, it turns out that the asthma is caused by the spray they use to clean the tables. Correlation might mean causation, but it might not. Sometimes the causal relationship is more complicated, and sometimes it doesn’t exist. Correlation might be coincidence, or it might disappear if you collected more data. More investigation is always needed to rule out different causes, and to test the actual cause. So if a report finds two values trending together and says it means one causes the other, that’s a red flag.

Now what?

These questions are by no means the only ones to ask, but they’re great first steps in trying to figure out how valid any given data analysis might be. The next post in this series, “what’s wrong with this data visualisation?” is perhaps the most useful of the lot, because visualisations tell us so much about the data, the analysis, and the stories being told, and they are very easy to get wrong, or to use deceptively.

What other questions would you add to this list? Add them in the comments!

2 thoughts on “What is wrong with this data analysis?”

  1. When I see quantitative research claims, the first thing I always ask, is ‘what was the question?”

Leave a Reply