Why get excited about Data Science?

This is an edited version of a talk I gave in Perth for the Innovation Institute, for the opening of Data Science Saturdays, aimed at 12-18 year olds. Huge thanks to the folks at Pawsey Supercomputing Centre, NeSI, and NCI for awesome examples!

Hi there! I’m going to start by recognising that I am coming to you from unceded Wurundjeri land, and pay my respects to the Wurundjeri people and their elders, past, present and emerging. 

My name is Dr Linda McIver, I’m the Founder and Executive Director of The Australian Data Science Education Institute, ADSEI,  a charity dedicated to empowering every student with critical thinking, data literacy, and stem skills in the context of projects that matter. 

I’ve been asked to talk to you today because I get crazy excited about Data Science, and I want you to know why. You’re welcome to mock me for that, but before you do, let me tell you that I wasn’t terribly keen on maths at school. I didn’t understand logarithms, and I found calculus terribly dull. I couldn’t see the point of a lot of the stuff we were learning. I just needed marks on the exam, and that’s not particularly exciting or motivating.

So how did I get from there to someone who’s crazy excited about data science? I founded ADSEI because I found out that Data Science is a superpower. It says so on my shirt, which was a gift from the San Diego Supercomputing Centre. I don’t know how clearly you can see it, but it says “I do Data Science. What’s your superpower?”

Data Science is a superpower because it gives you the power to solve problems. It gives you the power to prove that there are problems – like proving that your classroom is way too hot for compulsory blazers, or showing that the noise level in the gym is actually a health and safety issue (I hated sport at school!) – and it gives you the power to figure out how to fix them, as well as the power to show how well you’ve fixed them. 

So I want to start with some examples of some really amazing data science applications that happen in the real world.

Oddly enough, I’m going to start with my physiotherapist, Joshua Heerey. A lot of physios approach the job somewhat unscientifically. They poke, prod, and wrangle you about, pronounce their diagnosis and then give you some fiendishly painful exercises to do that may or may not solve the problem. When I developed hip problems, I was in a lot of pain. I saw a physio who poked, prodded, and diagnosed me with bursitis. He gave me a few things to do, applied ultrasound and heat, and made no difference at all. He then diagnosed something different, gave me more exercises, and again we achieved nothing. If anything, it was getting worse.

So I went to see Josh. Josh’s approach to physiotherapy is rather different. After listening to the problem and asking questions, Josh measures weakness in different muscle groups using a dynamometer – a force meter.  He uses repeated measurements to ensure accuracy. He finds the weak muscles and records just how weak they are. He also measures the angles each joint can bend to. He assigns exercises (they still hurt, btw) to strengthen the muscles that are weak. Each time I went back he’d measure them again, see which ones were improving, and by how much. In short, he applied data science to physiotherapy, and voila! Together we cured my hip. 

This is a very scientific approach to healthcare. Measure the problem. Work to fix it. Measure it again to see how well the fix has worked. Adjust treatment if necessary. Measure it again. It’s not rocket science, but it absolutely is data science. 

The next story is about a study by Professor Rosalind Picard at MIT that used a wearable device that measured skin conductivity to measure stress (this first study was before wearable devices were common). Your skin conducts more electricity when you sweat, and you sweat when you’re stressed, so in theory higher conductivity means more stress. Of course, there are other reasons why you might be sweating, or why your skin’s conductivity might change, hence the study. They wanted to figure out how good the device was at measuring stress. The device recorded measurements throughout the day, which were then matched against a diary kept by the participant, so that the researchers could track whether people were actually stressed when the data made it look as though they were. 

The researcher loaned the device to a student who wanted to use it to measure his autistic brother’s anxiety levels.  One day this device gave a massive spike in readings. Nothing the researchers could do in the lab could trigger a reading this high. They tried all sorts of stressors and exercise tests, and simply could not get a reading like that. You could show someone a massive tarantula and not get a response like that.

They thought it must be an anomaly. But rather than throw away the data as an outlier, they carefully tracked it back to the matching diary and discovered that the spike in data happened right before an epileptic seizure.

So those researchers could have ignored a value that wasn’t relevant to the study they were doing, or they could have thrown it away as an outlier, but what they did instead was develop this device – the Embrace – a seizure monitoring watch that not only detects epileptic seizures, it can message caregivers to let them know a seizure has occurred, and it also uses accelerometers, or motion sensors, to figure out if the wearer has collapsed. The Embrace has provided epilepsy sufferers with a new level of independence and safety. And it couldn’t have been done without data science.

This next story is about Jennifer Yeung, a Canadian, plane spotter, aerospace engineer, and PhD student. Jennifer’s PhD uses a system called Artemis, which is designed for real time monitoring of neonatal babies, sending data from regional hospitals to specialists elsewhere in the world, so that they can receive the best of healthcare even if their doctors are thousands of kilometers away. In 2019 Jennifer visited Pawsey Supercomputing Centre, and used Artemis with machine learning to track changes in babies’ vital signs BEFORE their health crashed, so that they could receive lifesaving treatment before their condition became critical. Incidentally, Jennifer’s main PhD project is to adapt Artemis to monitor the vital signs of astronauts in real time. How cool is that?! And, again, it’s all data science. 

Now we’re off to New Zealand, where Dr Céline Cattoën-Gilbert  analysed 40 years of climate data on a supercomputer named Maui at New Zealand eResearch Sciences Infrastructure (NeSI) to create high resolution weather and river flow forecasts to predict floods up to 48 hours in advance. This is obviously amazing news for people in the path of those floods, who used to have to wait until the water was lapping at their doorstep to know there was a problem! Now we can use data science to warn people in time to take precautions, or even evacuate if the flood levels are going to be dangerously high.

We tend to think of data as numbers – counting things, measuring things, monitoring things. But data can also be sound and images. For example Dr Giacomo Giorli is an oceanographer at the National Institute of Water and Atmospheric Research (NIWA) in New Zealand. There, his team tracks marine mammal populations around New Zealand through underwater acoustic monitoring, again using NeSI supercomputers. Dr Giorli is particularly interested in whales, and wants to track their movements. But it’s hard to detect and monitor whales 24/7. It’s expensive, often cold and wet, you get seasick, and whales can be just plain hard to find sometimes. If you can place microphones underwater, suddenly you can do 24/7 monitoring from the comfort of your local supercomputer. 

Now off to space! The craters on a planet’s surface tell its history.  Volcanic activity tends to smooth the planet’s surface, by covering it with lava, so the more craters we can see, the older the surface since a volcanic event wiped it ‘clean’. The current database for Mars contains 385,000 identified craters with diameters of 1 km or larger.  But it took at least six years to construct, before it was published in 2012. Planetary scientist Professor Gretchen Benedix at Curtin’s Space Science and Technology Centre used machine learning and the Pawsey Supercomputing Centre’s systems to identify 94 MILLION craters in just 24 hours.  Even cooler, they can now identify craters as small as 5meters across – 200 times more sensitive!

Now let’s get physical. Curtin Graduate student, Jordan Makins, with the help of Pawsey Supercomputing Centre, has developed an open source tool for analysing soccer player performance. Feed the tool data about recent games, and it can tell you how well players are performing, and where their weaknesses are. Data Science is heavily used in sport to try to monitor and improve performance. 

Any trainspotters here? Let’s talk about how Data Science caught Singapore’s rogue train. In 2016 the circle line in Singapore suffered a series of strange disruptions. Trains on the line, apparently at random, lost contact with the control system, which triggered the emergency braking system, leaving the trains dead on the tracks. This is obviously a bit of a problem for a busy train line! The events seemed so random, though, that the train company had no idea what was going on. They called in some data scientists and gave them a dataset containing the date and time of each incident, where it had happened, the ID of the train, and the direction the train was travelling in.

The data scientists tried everything to find a pattern in the data, but it wasn’t always the same train, it didn’t seem to be in the same place, or even the same set of places. It was bizarre.They visualised a whole range of different aspects to the data, using complicated graphs, simple ones, anything they could think of. They crunched all kinds of numbers. Nothing. Eventually they spotted a small pattern in all of the noise: When a train lost signal, another train behind that train but headed in the same direction would often also lose signal directly afterwards. They started to think that perhaps there was a rogue train, causing signal interference with other trains. Complicating their investigation was the fact that the rogue train never interfered with itself, so it did not appear in their data. But that, in itself, was a clue! An extra complication is that a small number of shutdowns are normal, so there was some noise in the data.

Eventually, after a lot of work, they zeroed in on a possible suspect, and checked when that train, Passenger Vehicle 26, was not in service. Lo and behold, very few shutdowns happened during those times! Culprit identified! Passenger Vehicle 26 was repaired to prevent the interference, and the Circle Line went back to normal. Another problem that would have been really hard to solve without data science.

Now let’s talk about something particularly close to my heart, since I’m in Victoria and only just out of lockdown! Professor Linsey Marr is a scientist who proved back in 2011 that the flu was airborne rather than aerosol. Aerosol and airborne might sound the same, but the technical difference is crucially important. Diseases spread by aerosol transmission spread by droplets – particles emitted when you cough or sneeze. Droplets are heavy. They don’t stay in the air, but they CAN land on surfaces and make you sick if you touch those surfaces and then touch your face or your food. They can also land straight in your mouth, nose, and eyes if someone coughs or sneezes nearby. (how gross is that!?) That’s why social distancing is really important with aerosol diseases.

In contrast, diseases that are airborne make you sick if you breathe them in. And, crucially, they stay in the air for much longer. Marr took samples of the air in different rooms, in places like up near ceiling air vents, where droplets simply couldn’t be (because they fall, they don’t fly!), and she found enough flu virus to make people sick. The trick, though, is that she couldn’t get published, because the medical establishment was convinced that the flu was aerosol transmitted.

The reason? In the 1930s a study of tuberculosis found that only particles smaller than 5 microns could infect people with the disease. This somehow got translated into “only particles smaller than 5 microns can be airborne.” 

The thing is that Professor Marr is an expert in airborne pollutants and indoor air systems, and her engineering training told her quite clearly that the physics of this assumption was all wrong. Particles larger than 5 microns hang in the air all the time!

When covid came around, Professor Marr was quite sure it was also airborne, while the WHO and the American CDC among many others were busy saying it was droplet, so social distancing and hand sanitising were promoted as the way to stop the spread, rather than masks and ventilation.

Frustrated, Dr Marr teamed up with a history researcher by the name of ​​Katherine Randall who conducted what was effectively research archaeology – digging down into the history of a topic to figure out where certain ideas come from. Randall discovered that the original tuberculosis study, from the 1930s, did indeed establish that only particles smaller than 5microns can infect a person with tuberculosis, but not because larger particles don’t hang around. Tuberculosis can only make you sick if it gets deep into your lungs, and our lungs very efficiently filter out particles larger than 5 microns well before they get that deep. 

Particles larger than 5 microns DO hang around in the air, and while they can’t give you tuberculosis, they can certainly give you covid19 or the flu, because those can make you sick if they get anywhere in your respiratory system. They don’t need to get anywhere near as deep as tuberculosis does.

Linsey Marr challenged scientific orthodoxy, and she’s one of the heretics I talk about in my book, Raising Heretics, because we need people to challenge orthodoxy, but only on the basis of evidence, data, and rational evaluation. Not on the basis of youtube rabbitholes, reddit, and tiktoks!

We desperately need people who are prepared to be rationally heretical.

Who are prepared to ask “why? “how can we be sure?” “what have we missed?” “how can we do better?” “who are we hurting?” “how can we fix this for everyone?” “how will we know how well it works?”

These questions are often heretical. By asking them, I’ve sometimes made people very unhappy. These questions are uncomfortable. But they are crucial to building an ethical, sustainable, positive future for all of us.

Heresy has been crucial to our scientific development. In the 1840s Ignaz Semmelweis came up with the radical heresy that doctors washing their hands before (and after) surgeries prevented disease. Prior to this doctors went from autopsies to childbirth without washing their hands or changing their clothes. And they wondered why people died. The idea that this could cause disease was considered so ludicrous that it took decades for the idea of washing hands to be accepted. Semmelweis was so ridiculed and pilloried that his colleagues committed him to an asylum where he was beaten and died.

In 1917 Alice C Evans made the laughably heretical suggestion that milk should be heated to a high temperature, or pasteurised, to kill bacteria that could be harmful to humans. She was not taken seriously, being a woman and without a PhD (which, by the way, were not offered to women at the time), and it took over a decade before milk was regularly pasteurized in the US. After her discovery but before its general acceptance, Alice became significantly ill with Undulant fever, a disease caused by one of the bacteria found in raw milk.

In the 1940s and 50s, Barbara McClintock discovered that genes aren’t static sets of instructions passed from generation to generation, but that they can be regulated – turned on and off – by other parts of the genome. She described the reaction to this discovery as “puzzlement, even hostility”, but in the end her research radically changed our understanding of genetics.

In the 1960s, Frances Kelsey of the American Food and Drug Administration refused to approve Thalidomide for use as a morning sickness drug, because she was concerned about the lack of data about whether the drug could cross the placenta, and directly affect babies’ development in the womb. This averted thousands of birth defects in American babies. Sadly, other countries were not so cautious.

More recently, Marshall and Warren’s original paper on ulcers being caused by bacteria rather than stress was rejected and consigned to the bottom 10% of submissions. Barry Marshall eventually drank helicobacter pylorii – the bacteria that causes ulcers – to prove it, thus inducing an ulcer which he then cured with antibiotics.

It might surprise you to know that Florence Nightingale was one of the first data scientists, and her use of statistics actually saved a lot of lives. Nightingale discovered that the way field hospitals were recording deaths was wildly inconsistent, and it made it very difficult to understand why soldiers were dying. By standardising the way they were recorded, she was able to analyse the data and figure out that by far the greatest proportion of soldiers were dying from infections spread in the hospital itself, rather than injuries received in battle. Knowing what the problem actually was meant that they could work to fix it. Once hygiene was improved throughout the hospital, deaths and illnesses dramatically reduced, and many lives were saved.

You can see that there is no practical limit to the ways we can use Data Science to solve problems. To change the world. From sport to disease, from the ocean to space, Data Science is a tool that empowers us to understand the world, and change it for the better. 

We need you to be data scientists. Not necessarily professionally, but to have enough data literacy to ask difficult questions, to challenge the status quo, to be heretics.  And we need you to do it on the basis of evidence and data. 

Leave a Reply