Measurable or Meaningful: Pick One

This is an excerpt from Raising Heretics: Teaching Kids to Change the World

What kind of people do we want our kids to be?

When I first sat down to write Raising Heretics, I did a trawl of university websites around the world, to try to find out what they considered the most important attributes they claim their graduates will display.

Monash University in Melbourne, for example, says this:

Monash University prepares its graduates to be:
Responsible and effective global citizens who:
* engage in an internationalised world
* exhibit cross-cultural competence
* demonstrate ethical values

Critical and creative scholars who:
* produce innovative solutions to problems
* apply research skills to a range of challenges
* communicate perceptively and effectively.

While Adelaide University has this list:
Attribute 1: Deep discipline knowledge and intellectual breadth
Attribute 2: Creative and critical thinking, and problem solving
Attribute 3: Teamwork and communication skills
Attribute 4: Professionalism and leadership readiness
Attribute 5: Intercultural and ethical competency
Attribute 6: Australian Aboriginal and Torres Strait Islander cultural competency
Attribute 7: Digital capabilities
Attribute 8: Self-awareness and emotional intelligence

Most institutions publicise such a list, and they vary in the details, but the high level attributes are remarkably consistent. They say they want Creative, Ethical Problem Solvers, With Knowledge of a Discipline.

Which is interesting, because most university courses don’t measure the first three of those points much, if at all. This is largely because assessment of knowledge of a discipline is easy – simply measure how many facts a student knows, how many of the procedures and processes they can apply. This is quite straightforward to assess with a standard assignment/exam combination.

One of the key qualities we measure in assessment is validity – does the assessment actually measure what we think it measures? With facts and known processes & procedures, validity is relatively easy to achieve. There are always issues with exam validity. Some students don’t perform well in exams, for example. If they freeze with anxiety under exam conditions, their answers to exam questions might not be very accurate measures of how much they know. Sometimes exam questions are ambiguous, or confusingly worded. Sometimes they don’t ask what we thought they asked.

Reliable assessment is assessment that gets the same result every time. If you run the same assessment, with the same student, under the same conditions, you should get the same result. It’s not the same as validity, but it’s important. Most exams are pretty reliable. But do they measure what we want them to measure?

As Associate Professor Nick Falkner, from Adelaide University points out, getting 60% on a multiple choice exam does not mean a person knows 60% of the course, even in the rare event that the exam actually covers 100% of the course curriculum. Instead, results on a multiple choice exam actually show that when someone is presented with a limited set of options, their understanding and vocabulary have led them to pick this one. As Dr Falkner puts it, “The real question is have they actually learnt something, or have they been trained in pattern recognition in a limited space?”

Academics often ruefully say that “Knowledge is not transferable between semesters,” which raises another concern with this form of measurement – do the students actually remember the facts for any significant length of time after the exam?

To summarise, we tend to use exams because we have been using them for a long time, because they are relatively easy to write and to mark, and because they are fairly reliable. And, of course, because they allow us to rank students in what we can pretend is an objective way.

But do they measure creativity?

Possibly the big problem with assessment “instruments” such as tests and exams as measures of learning is that we assume learning can be tested in a mechanised, human-free fashion, whereas true assessment of where a student is at requires interaction with the student. Of course, this introduces the risk of bias. But the notion that tests and exams are bias free – even if they are blind marked – is somewhat farcical anyway. Students who interpret things differently can be penalised by exam questions that are rote marked. For example, my daughter, Zoe, once had this maths question on an assignment: You have $300 in your wallet, which you deposit into your bank account. Is this a positive or negative transaction?

Most of the kids said it was negative, because the amount in your wallet goes down. But Zoe pointed out that the question was ambiguous: It’s a negative transaction from the perspective of the wallet, but a positive one from the perspective of the bank account, and a neutral one from the perspective of yourself – you have the same amount of money you had before, you’re just storing it differently. When she took this problem to her maths teacher he sympathised, said she was right, but told her to put the “right” answer (according to the answers in the teacher version of the text). Had Zoe had that question on an exam, she would very probably either have given the wrong answer, or wasted valuable time trying to make sense of it, when she could have been answering other questions. That question was effectively teaching my daughter to pass assessments, not to think critically, logically, and carefully.

The problem of making sure you teach and assess the things you want students to learn is something of an elephant in the room of the teaching profession. When you are under extreme time pressure, as most teachers are, it is natural to reach for tried and tested techniques – to do things the way they were done when you were at school. And to teach things the way you were taught to teach them. Where evaluation of teachers exists, it tends to reinforce this approach.

Years ago I got into an argument with Ted, the leader of the faculty I was working in at the time. I argued that the course we were teaching was fundamentally flawed. The students we were teaching were not learning what we wanted them to learn, and they couldn’t see the point of the subject. Ted argued that the course was great, the kids loved it, and we didn’t need to change a thing. I wanted to run an anonymous survey to find out for sure, which he finally agreed to – with one small flaw. He didn’t make it anonymous. Out of 200 students he got 20 replies, and, what do you know? They said “the course is great, it doesn’t need changing”.

Ted was outraged when I suggested that collecting the kids’ identities made it unsafe for them to say what they truly believed, directly to the teacher responsible for the course. We argued a lot, and in the end we asked a neutral party – a teacher from another faculty – to run a focus group. Now the kids were reporting on the subject to someone who was uninvolved in it, and uninvested in the outcomes. And what they said was horrifying to Ted, but no surprise to me. They hated it. They couldn’t see the point. Some of the kids in the focus group had also answered the survey, but their responses were very different.

That survey did not answer our questions. What we wanted to know was “what do kids really think about this course?” and what we wound up asking was “what will kids say about this course when asked by the guy who designed it, and when he knows who they are?”

Had we based the continuing shape of the course on that survey feedback, we would have been basing our ideas on data that didn’t say what we thought it did. This is one of the big problems with all data, not just assessment data. No dataset is perfect. No collection technique is foolproof. It’s very easy to ask leading questions on a survey that get you the result you are hoping for (such as: “how awesome was the course?”

Rather than “how did you feel about the course?”), to survey a subset of people that don’t represent the entire population, or to assume the data is complete when there are significant parts missing. So we need to be super cautious when we use data to justify our actions, and to shape changes in our systems. Does the data say what we think it does? How can we be rationally sceptical and test our assumptions?

Our education system is heavily measured. Between standardised testing such as Naplan and PISA, external final year exams, and all of the performance measures imposed on teachers, we are measuring outcomes constantly. Unfortunately, we don’t seem to pay a lot of attention to the question of whether those outcomes are the ones we really want to aim for.

We’re looking for assessment reliability – will we get the same result if we do the same test again? – rather than validity – are we measuring what we think we’re measuring, or, indeed, asking the question: what we should be measuring? And that, in a nutshell, is the issue with education. We shape our education systems to maximise outcomes. Unfortunately, the outcomes we are maximising are PISA scores and exam results. In an ideal world, these would be measures of learning but, as we will see, they don’t always measure what we think they do. The other issue, of course, is that we also shape our education systems to minimise cost.

The myth of perfect data

Any dataset we work with has flaws. Usually, they are not exactly the information we want, rather they are as close as we can easily/cheaply/quickly get to that data, or simply the data that we have access to. For example, population datasets are nearly always data from a sample of the population, rather than from everyone, which means some people will not be represented by that data.

Consider the census data for Australia, which covers the population who filled out the census. This will not necessarily cover people who were homeless at the time (because census forms are delivered to fixed addresses), or overseas, or who chose not to fill out the form. These people might leave significant gaps in the data if it is being used to calculate, for example, how many hospital beds we might need in a particular area.

The census data is the data we have. What we want is how many people live in an area, and how that might change over the next four years. What we have is how many people filled out the census on census night (and the previous census nights). This is, ideally, a close approximation, but it’s not the same. In 2016 it was a very poor approximation indeed, because the census was online for the first time in Australian history, and the system actually crashed under the load on census night. There were also widespread concerns about the privacy and security of the system, and the information collected. Not to mention barriers to access for people with no internet connection, or for whom English is not their first language. Some people chose not to fill out the survey at all, or to use false data to protect themselves.

Some of the data we want is just not easy to measure. For example, we want to measure kids’ learning, so we have them sit exams. They are great for measuring recall of facts, and application of known procedures. They are rarely used to measure problem solving, creativity, or ethics – the attributes we say we care about. Plus they don’t necessarily even measure recall very well, depending on how well the exam was written, what the conditions were on the day, how good students are at doing exams, etc. We tend to use exam results as a proxy measure for learning, especially when we use those results, for example, to decide who gets into a particular university course.

Unfortunately in education we sometimes forget that what we are measuring is not actually what we want to know. We tend to shape our education to be measurable, rather than to be meaningful.
All datasets have issues like these. The challenge is to identify the issues, and take them into account when we’re using the data to shape our future.

The easiest way to assess students in schools is to use things like multiple choice questions, which have been computer markable for decades – this is why students have to fill in some kinds of tests with very particular types of pencils, or particular coloured pens, and by colouring in circles on a page rather than writing their answers. It makes the tests simple to feed in to an automated marking system. Now, of course, we can do that online even more easily.

Multiple choice questions are very simple ways to measure students’ knowledge of facts and known processes. You can even ask reasoning style questions, though they are more difficult to write in ways that all students can understand and interpret correctly, and, unlike a written answer, multiple choice gives the teacher no room to spot any misunderstanding that they didn’t see coming.

Doing this kind of assessment produces a very convenient number that you can effectively stamp the student with. “You received 90% on this test, therefore you know 90% of the material taught to you this semester.” Of course, that relies on the test testing 100% of the material taught, and on every question being correctly written, and no tests using questions like the photosynthesis question mentioned in chapter 3, that have one answer according to the course material taught so far, but another answer (or multiple possible answers) in the real world. And, for the most part, these tests are a test of a student’s memorisation rather than their understanding.

I’ve lost track of the number of teachers I’ve heard talk about their teaching goals and say things like: In the literacy program, our goal is to improve the Naplan and VCE results. This is heartbreaking, because the goal of a literacy program is surely to improve literacy, not to improve the Naplan score. Naplan literacy results are intended to be a measure of literacy, but they are actually an accurate measure of how well students do on the literacy section of the Naplan test. Ideally this will be close to a reasonable measure of literacy, but we always need to remember that the measurement is not actually the same as the thing being measured.

You might think, then, that the important question is this: How do we shift assessment so that we are assessing the things we really care about? Things like ethics, creativity, problem solving, and logical reasoning. And this is an important question. But I actually think we need to take the argument one step further, though it feels extremely heretical: What is assessment for?

What is assessment for?

What even is assessment? If you ask a bunch of teachers to explain assessment to you, you will likely get an explanation of summative versus formative assessment. The short version of that is that summative assessment gives you a measure of how much the student knows or can do – a sum of their learning – and formative assessment is a kind of feedback to the student to help them improve – it helps form their learning. Summative assessment often forms the end-of-course grade, while formative assessment happens throughout the course, to highlight gaps in a student’s knowledge, and to identify “skills they are great at” vs “skills they need to improve in”. Sitting a practice exam counts as formative if the student gets the opportunity to go through the results and note the types of questions they got wrong. Sitting the final exam is purely summative, unless the student has the opportunity to go through the marked exam afterwards and learn from their mistakes, in which case it can also be formative.

Formative assessment is obviously useful to students trying to improve their performance, but the question is whether they are trying to improve those important attributes of creativity, problem solving, and ethical behaviour, or whether the formative assessment is more along the lines of practice exams: trying to improve their performance on the all-important summative assessment that determines which university they can go to, and what courses they can study. When the focus is on summative assessment, students often become obsessed with a single mark lost, or with whether they did better than the other kids in the class.

Summative assessment is often used to judge students, to rank them relative to their peers, and to determine entry into subsequent courses or degrees. The ranking aspect can be particularly problematic if there are factors that vary between, or indeed within, cohorts. Consider the ATAR, or Australian Tertiary Admission Rank. It is clear from the data that there is benefit from being at particular schools, sometimes even in particular classes. From not getting sick or not having traumatic family circumstances. From being in a metropolitan school rather than a rural one. There are many factors that the ATAR endeavours to compress down into a ranking that effectively says: This student is more likely to do well in this degree than a student with a lower rank. It’s a way of saying “This student makes it in. That student doesn’t.” and being able to justify it with a nice, “objective” number.

The trouble is, the number might not be as objective as all that. Many studies have been conducted to try to determine whether Socio Economic Status (SES) has an impact on ATAR. Like a lot of educational research, it’s difficult to find a definitive answer, because you can’t control conditions entirely. However, a recent study by Emmaline Bexley and her colleagues reports that high Socio Economic Status (SES) students who were achieving similar grades to low SES students in Naplan in Year 9 went on to achieve ATARs around 10 points higher than the low SES students three years later. Given that the maximum ATAR is 99.95, that’s over 10% of your score that’s closely related to your socioeconomic status. It may not be causative – there are many factors impacting low SES students that might impact their ATAR, such as the need to work part time to supplement family income, potentially insecure housing, etc – but the fact that the correlation exists is, in itself, disturbing. If we have any commitment at all to equitable educational outcomes, this has to change.

Policy makers who are in favour of a particular approach, such as standardised testing, have a bad habit of using their own confirmation bias to support their preferred approach. In other words, they look for evidence that tells them what they want to hear, and ignore evidence to the contrary. As Finnish Education researcher Pahsi Sahlberg points out: “Evidence-based education policies use research to link selected treatment and expected outcomes, but they almost always ignore possible harmful side effects they may have on schools, teachers or children. Take NAPLAN, for example. Those who advocate the necessity of national standardised testing regimes back their views by positive consequences of high-stakes testing while ignoring the associated risks that research has exposed: narrowing curriculum, teaching to the tests, and declining student motivation, just to mention some.”

The end product of the Australian school system for most kids is the ATAR, and which university course they can get into with it. A particularly disturbing aspect of this focus on a final ranking is that kids often choose – and are encouraged to choose – subjects in which they are more likely to do well, rather than subjects that they are actually interested in.

The other obvious problem with this system is that what we are doing is training kids to be very good at exams. We then use how good they are at exams to select the courses they will do, and those courses mostly use exams to determine how well they do in those courses. So we are selecting kids to train to be engineers, doctors, architects, lawyers, teachers, scientists, etc, on the basis of how well they do in exams. And then we rank them as engineers, doctors, architects, lawyers, teachers, scientists, etc on the basis of how well they do in exams. And then we send them out to be engineers, doctors, architects, lawyers, teachers, scientists, etc, where they will be required to do a huge range of things that bear no resemblance to sitting exams at all.

If we truly want creative, ethical, rational, critically thinking problem solvers, then it makes sense to ask if our school system is actually producing kids with those characteristics. It’s not clear that we’re even turning out kids who value these kinds of characteristics. The system currently runs on marking criteria and constrained outcomes that punish the kind of kids who see a problem with the assignment definition and create a whole system to solve that problem. The kids who misinterpret exam questions because they think laterally. The kids who know that there are creatures in extreme environments who produce their own energy without photosynthesis. The kids who solve problems differently and come up with creative solutions, but that don’t fit the rubric. The kids who write more, like my year 11 student Chris, who struggled so much with the word limit on an assignment that he made a separate web page to add in the in-depth data that explored the topic in so much more detail. We teach kids not to do that. We teach them to meet the criteria and stop. Do the minimum.

Chris was also one of the students who took his year 11 Computational Science project, together with his partner Matt, and continued working on it in year 12. Chris and Matt were producing software to enable a cancer researcher to do more powerful and effective research. But they received no credit for it. There was no room in the ATAR for it. Indeed, their teachers advised against putting too much effort into it (despite it being used by a scientist to contribute to cancer research), in case it distracted them from the really important thing – not cancer research, but receiving the highest possible ATAR. What kind of values are we teaching with that kind of message?

What do we WANT the purpose of education to be?

So, here we are. Assessing our students mostly using exams and assignments that have no purpose other than to contribute to assessment (and maybe, if we’re lucky, to learning). From the early school years right through to undergraduate courses at university, we are teaching kids that what matters above all is the final mark. Ethics, critical thinking, community service, even health and wellbeing, and yes, even learning, become subservient to that final mark.

Screenshot of a google search result from www.deakin.edu.au 
there is a big blue heading saying Find your future| Deakin University, followed by: Discover Courses by ATAR. Enter your ATAR to browse the courses available in your range.
Filter: ATAR

Consider this google search result, when I searched for “atar future”. “Enter your ATAR to browse the courses available in your range.” The clear message here is: Do not look at courses for which you do not have the ATAR. They are not for the likes of you. Don’t go considering alternative pathways, or how you might study something related, do well, and transfer into your course of choice. Be realistic. Accept your limitations.

When I was teaching, one of my students told her form group teacher that she was aiming to get into one of the big tech universities in the United States. CMU, Caltech, MIT, or somewhere like that. He talked her down, told her she needed to be realistic. She’s now studying at MIT. Another student wanted to do a subject in year 12 for which you needed really strong marks in year 11. He wasn’t on track to get those marks, so his form group teacher also told him he needed to be realistic. That it wasn’t something he could do. He came to me, distressed, and I took a slightly different line. I said he could do it, but that it would be tough. He’d have to work super hard. And there were no guarantees. He did not, in the end, get the marks he needed to do that subject. But he had worked harder, reached higher, and developed some confidence from the fact that I believed in him. (I still do.)

I actually think that judging his ability to do that subject by his marks was foolish. It was an attempt at objectivity. If we make decisions based on a number, then surely we can’t be accused of bias, or prejudice? It is a way of wriggling out of the ethical and emotional complexity of a decision. How do we choose which kids are given the chance to become doctors? How do we choose which kids could become engineers? How do we select teachers, nurses, or data scientists? We remove ourselves from any possibility of making it personal – no-one can say “She just didn’t like me”, or “He didn’t like the colour of my skin”, or “She didn’t let me in because I’m a girl.” It’s all down to this simple number. Objectivity guaranteed.

But if there is a correlation between socioeconomic status… if girls are driven out of particular subjects by the perception that they are not suited to them… if rural kids don’t have access to the same range of subjects… if some schools don’t have great teachers or support structures… then what we have is the pretence of objectivity and fairness, rather than actual objectivity and fairness.

The bottom line is that if we want to teach the things we say we care about – creativity, ethics, problem solving, and collaboration – then we have to show the students that we do, indeed, care about them. We have to stop using an ATAR based largely on assessments that don’t, and probably can’t, measure those things.

The good news is that we can absolutely do that, by giving students the opportunity to learn by solving real problems. By teaching them to critically evaluate their own results, and by optimising for carefully tested outcomes, rather than right answers, we can build habits of critical thinking and scepticism that will last a lifetime.

Leave a Reply