There’s a recurring theme in the interviews I’ve done for Make Me Data Literate Podcast, and it turns out to be absolutely central to everything we do with data, though I hadn’t really thought about it before. It’s a simple idea, but it’s only once you start exploring examples that it blows your mind. The idea is this: Before you collect data, you have to define what you’re collecting. And that definition can change everything.
We tend to think of data as clear, inviolable, simple, and obvious. If we count the number of students who sat Naplan, for example, that must obviously be one simple number, and we’d get the same result regardless of who we asked to count them. Right? Surely?
Well. Actually. It’s unexpectedly complicated. Naplan is 4 tests – Reading, Writing, Language Conventions, and Numeracy. So how do we count a student who only sat one of the four, because they were sick for the other two? Do we count them as a quarter of a student? Or do we only count students who sat all four tests?
Oh, and how do we define “sat the test?” What about a student who logged in to the online system to complete a test, but the internet dropped out part way through? What if they wrote their name on the top of the paper version but didn’t write anything else. What if a school had a power failure, or a fire alarm, or, all too likely now, was evacuated because of imminent flooding or bushfire? Do we count those kids as having sat Naplan? Do we count them if they were there for half the time, or only if they were there for the whole time?
Suddenly you can see how different people might return different numbers for how many kids sat Naplan, even though they’re working with the same information.
Data always comes back to definitions. Even something as notionally simple as average global temperature is bizarrely complex. I asked a group of people in a workshop recently to define average global temperature for me. They said you add up all the temperatures and then calculate the average. Ok. Define “all the temperatures.” Which locations do we include? And how many measurements per day? Over what time frame? There is no single place we can measure, given the wide variation from place to place. There is no single time we can measure, given the wide variation over time. So how do we define average global temperature? How many datapoints over what period of time? Are the datapoints themselves raw values or averages? Do we include both land and water?
It turns out that average global temperature typically uses the daily maximum in each of the locations used, but that’s just a convention. There’s no reason it couldn’t be daily minimum. And the set of locations used is important if you want to end up with a value that you can compare with previous values.
So, what about the change in average global temperature over time? Change compared to what? For this, we need to define a reference point, or in this case a reference dataset.
The IPCC says that, “unless otherwise specified, warming is expressed relative to the period 1850–1900”, because it’s considered to be largely pre-industrial, and hence before humanity started pumping vast amounts of greenhouse gases into the air.
The climate spiral uses 1951-1980.
The Australian Bureau of Meteorology uses 1961-1990, which is the standard defined by the World Meteorological Organisation.
In May 2021 the U.S. National Oceanic and Atmospheric Administration defined US Climate Normals as 1991-2020.
Before that, the standard reference period for Climate Normals was 1981-2010. They still recommend 1961-1990 for climate change tracking.
NASA, at least sometimes, uses 1951-1980 as a reference period.
The 5th IPCC report uses 1986-2005.
And that’s not the full list of reference periods. That’s just the point where my head exploded and I stopped looking.
It’s clear that definitions matter. Like the person who takes the minutes, the person who makes the definitions has the power.
The definitions determine all kinds of things. Who is normal? Who is included? Who is an outlier? What are we measuring? What do we value? And the trouble is that our biases inevitably invade our definitions. We tend to be careful to include ourselves, or things we are familiar and comfortable with, in our definitions, and we often forget to include people who might be different to us, or things we haven’t encountered before.
It’s distressingly common, for example, to see datasets counting men and women, but completely failing to include non-binary folks. Or to count people entering a building via the front steps, and fail to count those using the ramp round the side, thus excluding people pushing prams, using wheelchairs, or other mobility aids, or even wheeling suitcases or trolleys. Or assuming marriages are heterosexual, or that kids have one Mum and one Dad, thus excluding gay parents and step parents, among others.
It’s quite common for government agencies, in particular, to ask whether you are working or a full time student, thus excluding part time students, people who are working AND full time students, or people who are unemployed or not working for health reasons. Indeed, the definition of unemployed is, itself, a minefield. Is it the number of people not in paid employment? How long must they have been out of work before they count as unemployed? Are we then counting retirees as unemployed? What about children? Shall we change the definition to those who have not worked in the last month AND are looking for work? What if they’ve worked for just an hour in the last month? What about volunteer work?
Once again, it’s unexpectedly complicated, which is an opportunity for the unscrupulous so shape the data to tell their preferred story. Want to minimise the unemployment rate? Tighten the definition to exclude anyone who has done volunteer work, or had just an hour of paid work, or anyone under or over certain ages. Want to maximise the unemployment rate? Change your definitions in the other direction.
One definition that shocked me arose in my interview with economist Dr Cameron Murray. The definition of inflation in the USA is different to the definition of inflation in Australia. Inflation, you might think, is a simple measure of how much prices have increased over a fixed period of time. But which prices? Groceries? Consumer goods? Utilities? And how much weight do we give to different categories? The American definition includes used cars and gives rent a weight of 30%, whereas the Australian definition doesn’t include used cars at all, and uses only an 8% weight for rent. So the American rate of inflation and the Australian one are not comparable at all. They have the same name, but entirely different definitions.
Definitions can change the story, and hence the world. I have a whole blog post brewing on medical definitions, which you will probably find horrifying, but for now, I leave you with this one take away: It doesn’t matter what technology you teach, when you’re teaching Data Science. I don’t care whether you use Python, R, spreadsheets, or stacking blocks to make graphs and analyse your data. What matters, above all else, is that you teach your students to ask critical questions about the data. How was it collected? What are the definitions you used? How do we know the definitions are valid? What other definitions could we use, and how would that change the data?
If we all, as a society, can collectively look at data and ask these questions, we will be vastly better informed than we are now, and much less likely to be fooled.
What horror stories do you have about data definitions? Share them in the comments!
4 thoughts on “The power of definitions”
Reading this reminded me of a customer feedback from a lab where I worked years ago. We had customers signing up to get their favorite genomes sequenced (for free). It was like winning the lottery, so you’d think people would be pretty happy. We surveyed them about how they felt about winning (the lottery itself) and then at each step downstream (sending in their DNA, getting it sequenced, getting it assembled, holding the annotation jamboree, writing a publication) and to our dismay found that people got unhappier the further we went down the pipeline. We focused particularly on assembly and annotation which had pretty average scores and wrung our hands for a year or more trying to improve things. Finally reexamining the original survey (“How do you like stage X (1.. 10, 10 being really awesome), we realized we had never given people a N/A option. The vast majority of customers were still working through the pipeline and if they hadn’t gotten to stage X, seemed to have give us a 5 (average score, since they didn’t know how it was going to go). We had just spent a year optimizing the wrong thing due to bad survey design.
That’s a great story! Surveys are such a trap like this. If you insist people answer every question, you’d better be sure every question has a meaningful answer for every respondent!
I think the funniest example is the ABS classifying Luna Park as “Parkland” in the 2016 mesh blocks. From the ABS mesh block definitions:
“Parkland: will mainly contain parks, nature reserves, public open space, and other minimal use protected or conserved areas and, where possible, will have a zero population count. Parkland Mesh Blocks may also include sporting arenas or facilities, including racecourses, golf courses and stadiums. These facilities may not be open to the public.”
This is stretching it for Luna Park which I don’t think is “minimal use”. But this lumps local parks, golf courses, and national parks all in one category and that makes for a larger conundrum when you want to see how much “Parkland” Australia has.
On a different note, I think the Americans are right for once: They include transport and a substantial amount of rent in the cost of living!
That’s so interesting! It would never have occurred to me to include Luna Park as “Parkland”. Nor the MCG or Flemington Racecourse, come to that!
And I definitely agree, I think the USA definition of cost of living makes more sense than ours… I wonder if it’s time to create a new definition!