Michael Brand, on the Science of Data Science

Make Me Data Literate

00:00 /

“It was such an amazing success rate. It was very obvious to me from the first second that it can’t possibly be true. It was just way, way too good. Probably like the single best indicator that a result is wrong.”

“I’ve seen this all the time in the commercial world. People are very happy to do whatever you want, call it data science or call it anything else in order to finish a project. But as soon as the project is finished, as soon as it’s put into production, nobody goes the extra step of actually measuring how well it works. You did this project in order to solve a problem. Are you not going to measure how well it actually did solve the problem?”

“We do not teach people that making mistakes is not just right, but it’s the only way of learning. It’s the only way of becoming better.”

“It’s perfectly fine to get some ideas from the AI or to use it as a generator of the first draft of your computer program or the first draft of the whatever but it can’t be the last draft. It can’t be something that goes on to production without further scrutiny.”

“Unless you’re willing to take a hard look at your data and have that data tell you that you’re wrong, research is not for you.”

“A lot of times people just do data science because it’s something that shareholders expect them to do. A manager can want to push a decision through a board meeting. What they’ll do is they’ll tap their favorite data scientist on the shoulder and say, “Give me the data that proves that I’m right.” I call that confirmatory data science. They can go to that board meeting and say, “By the power of data.” That’s not data science.”

“If you want to be serious about being a data scientist, you need to start by being serious about being a scientist. Science starts with peer review. There is no science without peer review.”

Honestly, I want to turn this whole episode into pull quotes! Go listen!

Transcript

Linda: Welcome back to another episode of Make Me Data Literate. This is going to be an interesting one. We haven’t tackled this area before, so let’s get straight into it. Welcome, Michael. Tell us who are you and what do you do?
Michael: Thank you, Linda. I’m absolutely delighted to be here. To your question, I’m a data scientist, have been for 30 years now, long before the term ever got popularized. I’ve had a very long career. I’ve served as chief scientist and chief data scientist in equivalent positions in many companies, both nationally and internationally, most recently as chief data scientist at Telstra. I’m also an academic in the field. I’m an adjunct professor of computing technologies at RMIT. I was previously an associate professor of data science and artificial intelligence at Monash University. I also headed the Monash Center for Data Science.
Six years ago, I started my own company. It’s called Otzma Analytics. It’s a data science consultancy. It advises organizations on how to make data science into a strategic capability, how to make the most out of the data that they have or can get. This can take many forms. I coach executives. I provide upskilling and training for data scientists, often by working alongside them on strategic projects.
Perhaps most uniquely to Otzma Analytics, what I do is provide review for analytics work. So data scientists show me projects that they’ve done and I just kick the tires to see what holds up to scrutiny. This is something that really only a handful of people do globally. I can count them on the fingers of one hand.
Linda: It’s extraordinarily rare to see any data analysis actually subjected to scrutiny and tested properly. I’m excited to hear that that’s part of your role. Is it difficult to encourage companies to do that? I know that there’s a great pressure to analyze, react and jump on to the next thing rather than stop and make sure that it’s solid.
Michael: Well, that’s an interesting question because there are really several different types of populations that do data science around the world. Several types of reasons why people do data science. A lot of times people just do data science because it’s something that shareholders expect them to do. A manager can want to push a decision through a board meeting. What they’ll do is they’ll tap their favorite data scientist on the shoulder and say, “Give me the data that proves that I’m right.” I call that confirmatory data science. They can go to that board meeting and say, “By the power of data.” That’s not data science. It’s just not what it is in my view.
If you want to be serious about being a data scientist, you need to start by being serious about being a scientist. Science starts with peer review. There is no science without peer review. The vast majority of data science that is done around the world is pro-forma data science. I call it cargo cult data science. People do it for various reasons. I’m sure they’re happy doing it. They’re not my clients. My clients are the ones who actually want to do good data science and they love being reviewed because they’re hungry for it.
Linda: That’s beautiful. I can say half a dozen pull quotes just in that last two minutes of conversation. Cargo cult data science might be my new favorite expression.
Michael: One day I will write a book that will be its title.
Linda: I will read it with great enthusiasm. What did you have to learn to do your work? Was there something missing for your formal education?
Michael: Well, my formal education was quite diverse. In hindsight, quite perfect for a data scientist. I did my bachelor’s in industrial engineering, my master’s in applied mathematics, and my PhD in information technology. Those are exactly the three legs data science stands on. You’ve got the domain knowledge, you’ve got the math stats, you’ve got the computing. I couldn’t have designed it better.
The thing is, I’ve been doing data science. I’ve been doing research long before any of these academic degrees. I got the degrees after I’d already known these. Fairly little of my knowledge comes from that formal education. I can tell you that the most fundamental skills needed for a data scientist were not covered in any of these degrees. I’m talking about things like understanding the research discipline, research as a thing unto itself, understanding scientific theory, just having a good grasp of scientific theory.
Again, this comes back to what we said before. Anybody who calls themselves a data scientist and thinks he doesn’t need peer review, I’m not sure they know what scientist means. It’s so prevalent to see people who do not have this background, who do not have this understanding that, yeah, these are the things that when I do reviews, I immediately find mistakes in. The scientific method, people just don’t understand it. The other thing a good data scientist needs is what is misleadingly called common sense, very, very rare commodity. I don’t know that anyone can teach that. I think it’s something that you gain by experience.
In the end, when an aspiring data scientist comes to me and says, “What should I learn?” my main answer is you should do, just basically take data and analyze it. Not Kaggle data, mind you, real data, actual data. No replacement for that.
Linda: No, no. As you know, I’m wholeheartedly in agreement with that. I had a student say to me recently that she does data wrangling as part of her job. She said, “You’re the one who taught me the confidence to play with data like silly putty.” I thought, “That’s beautiful. I’m definitely nicking that one as well. You have to be able to just get in there and wrangle it and see what you can find and then challenge yourself and say, “What other reasons might there be for what I just found? How many ways did I go wrong in that?”
Michael. Absolutely. Absolutely. Play with it like silly putty. That’s an expression that I’m going to take away from this conversation.
Linda: Beautiful. We’ll spread it. All credit to Sarah. What do you wish everyone knew about data?
Michael: I think everything I wish people knew about data is stuff that should be obvious. Usually it is obvious to anyone who ever actually looked at data. Before answering the question, let me just say that what I wish is not that anything that people knew about data. I just wish that people knew data.
I’m just constantly surrounded by C-level executives who all know so much about data. Data is important to them. Data is the lifeblood of their companies. Data is something that they invest in heavily. I just go and ask them, “When was the last time you actually looked at any of your data?” It’s all crickets chirping. They don’t look at data. If they had looked at data, I think they’d have many of the insights that they want from data science. They’ll have them right off the bat. They also learn everything that I want people to know about data. It’s messy, that it’s not objective, that it tells a story, that it needs to be understood in the context of what it means for the business, etc. If you actually ever look at data, you will see that these things immediately jump at you.
If I need to choose one thing that I wish everybody knew about data, it’s this. Data is everybody’s job. It’s a very common attitude among business people to think that data is the job of the data scientist. It isn’t. How can you possibly manage without data? When I was studying industrial engineering, there was a legend going around about the manager of the Hilton hotel chain that he knew the price every day of what does a pack of peanuts cost because they used to serve peanuts in the hotel lobbies. And, you know, that’s data.
How can you possibly make good business decisions without being aware of your own data? I believe that managers absolutely need data literacy. I think they need data literacy like they need computer literacy. And some people are horrified when I say that because they say, “Well, do you want to take all data scientists out of a job?” No, I don’t.
I think that when managers have data literacy in the same way that they have computer literacy, they will need data scientists in the same way that they need IT departments.
Linda: I want to come back to the concept of context. Because you’ve mentioned it a couple of times now, and it’s one of my big drums that I bang to everyone’s great despair, I think sometimes, that you cannot understand a data set without understanding the context. I’m astounded at the number of people who have come out with masters of data science and they’ve been taught a whole bunch of statistical processes and they’ve been taught a whole bunch of code and they’ve never been taught that you can’t just apply these like cookie cutter, stamp out the data and apply the stats and move on, without actually knowing the context of the data and what it’s appropriate to do that data and what isn’t and what does it mean and how does it interrelate.
We don’t seem to have grasped the idea that context actually is meaningful with data and you can’t operate sensibly without it.
Michael: Absolutely. I could not agree more. I had at some point, you can still find it on YouTube, I had a series of videos called How to lose money on analytics and that’s the repeating theme there. People like the idea of some magic bullet that will solve the necessity to do research. Managers by and large are afraid of the word research because they believe that if it’s research then we don’t understand it and if we don’t understand it then we shouldn’t be doing it.
They are so very likely to be tempted by any vendor that comes in and says I have software for you and it will solve your problems. You don’t need data scientists, you don’t need research, here this will do it for you. I think the only reason that that form of snake oil even exists today is that people don’t understand that that’s not what data scientists do.
Data scientists are not about running the correlation, anybody can run the correlation. Data scientists are about understanding what the results mean and understanding requires context, it requires semantics.
Linda: Let’s just feed it into chat gpt and do without our data scientists altogether.
Michael: That’s right and people are trying and there’s just so many vendors, so many vendors selling this. When I was a Telstra I had like 10 emails every day from companies saying just give us your data and we’ll show you that we can do wonders with it and you’ll buy our products and you’ll never need data scientists again. It’s snake oil, it always has been snake oil and unfortunately more and more managers are tempted into doing it and the result is that less and less managers believe that there is any power in data, because they feed the data into the software and what comes out of it has no strategic value to the company, you cannot make business decisions based on it, and the more that happens the more people are convinced that the whole data thing is just a fad, a game, a toy without real value.
And then I have to deal with every time I speak with data scientists somebody comes to me and asks why is it that my manager never listens to me? How come they don’t listen to the data, they don’t listen to the research outputs, what can I do in order to make them understand? And the answer is as long as they’re into the snake oil business they’re not going to listen to you. They need an understanding that data is part of their strategic plan in order to treat all of this seriously. If it was a game from day one they’re not going to listen, and you may want to cast stones but let the data scientists who never practiced confirmatory data science cast that first stone.
Linda: I’m reminded of the story that came out just a couple of days ago about Air Canada having to honour the refund policy that its chatbot had made up. I mean that was inevitable. As soon as you start using AI to answer questions we know that AI has no investment in truth or actual meaningful results. We knew that was going to happen but the snake oil peddlers were like AI can do all the things that your people do and you can replace the people.
Michael: That’s true and it’s a symptom. It’s just the symptom in which the latest cycle of the snake oil trend manifests the fact that it has no actual value. Every single time there’s just a new buzzword that comes in that takes over the conversation and managers are just so happy to accept the idea that, despite the fact that everything they’ve tried up until this point did not give that strategic value that they were after. The next software, the next buzzword, maybe it’s mesh computing this time around. Maybe it’s I don’t know whatever and they fall into this trap every single time because they’re so afraid of research.
I actually had a conversation with the manager of one of the largest companies in Australia and at some point he asked me when are we going to have a version of this that works? I said a version of what? Do you think we’re selling software?
No, AI is not going to be the salvation of any of this. It’s just the latest iteration in let’s obfuscate everything. Let’s make everything as black boxes we possibly can make it so that there will be no scrutiny on what we do and there will be the people who just trust us at face value, trust us on faith that this is going to be their salvation and it’s going to work for some months and then they’re going to forget about it. It’s going to be at the end of that hype cycle and we’re going to have to come up with the next buzzword.
Linda: The lack of demand for evidence that these systems work is breathtaking to me. I’ve got a friend who’s applying for a high level executive job and part of the process is AI interviewing and I find that just astounding because there’s no evidence that that produces anything of value at all. In fact, there’s significant evidence that it’s garbage and biased garbage at that but it’s been adopted so enthusiastically by so many large companies and large organisations and I’m just sitting there going, “What are you doing?” It’s nonsense that you’re injecting into the process. Why would you do that?
Michael: Yes, and I tell managers, there’s no reason not to use AI tools. There’s no reason not to use any form of tools. I’m not against automation. I’m not a Luddite. A lot of what I do is automated but it’s not the research. The research is what you do with these tools, over these tools. The tools don’t do it for you. I believe in human in the loop.
It’s perfectly fine to get some ideas from the AI or to use it as a generator of the first draft of your computer program or the first draft of the whatever but it can’t be the last draft. It can’t be something that goes on to production without further scrutiny.
The example that you gave of using it in HR, we’ve seen this before. Facebook had famously used machine learning in order to choose candidates after interviews. Yes, it blew up because it was biased, etc. But I think nobody ever actually learned the real lesson of that incident. What Facebook did was they said, “Oh, so we trained a machine learning tool based on all of our historical data. That machine learning tool ended up being biased. Let’s jettison it and forget that we ever did it.” Whereas my takeaway from that is, no, you finally figured out how to measure the bias in your recruitment process. Obviously, your recruitment process was biased long before this system came online because it was studying this data. It was copying your human bias. But now you can measure it. You can study it. You can finally solve this problem.
But no, the solution was let’s just ignore it and come back to the no data form of decision making.
Linda: Let’s not look at the results of what we do because we might find out things we don’t want to know.
Michael: Exactly. This is the essence of research, and it’s a learning process. Unless you’re willing to take a hard look at your data and have that data tell you that you’re wrong, research is not for you.
Linda: Right. This is one of the things that I always build into the projects that I designed for students is that you find a problem, you measure it, you analyze that, you try to fix it, you measure it again to see what worked and what didn’t work. This is fundamental, and it’s often missing from a lot of school projects because it’s the kind of design thinking, design a solution to a project that you can’t actually really solve and walk away saying, hey, aren’t we great? Actually implement it and figure out how well it worked. Imagine if governments did that as a default. Imagine if we always did that when we tried to fix something. We actually checked to see if we did. It seems so obvious to me, but it’s not. It’s the step we skip all the time.
Michael: Yep, and it’s not just governments. I’ve seen this all the time in the commercial world. People are very happy to do whatever you want, call it data science or call it anything else in order to finish a project. But as soon as the project is finished, as soon as it’s put into production, nobody goes the extra step of actually measuring how well it works. You did this project in order to solve a problem. Are you not going to measure how well it actually did solve the problem? And the answer is they don’t because nobody has any incentive to do it. The only possible outcome is for the manager who was in charge of the project to now have data to show that the project isn’t working. And yeah, people are really afraid. We have not been teaching. This goes back to the formal education question. We do not teach people that making mistakes is not just right, but it’s the only way of learning. It’s the only way of becoming better.
Linda: Right. Right. And not just making mistakes, but going back and looking at your mistakes and learning from them. It should be a fundamental part of every project. Critically evaluating your own work and figuring out what went right and what went wrong and how you could do it better next time. It’s not rocket science. It seems really obvious to me, but it’s not what we do.
Michael: It’s not what we do. I provide some of that in the reviews that I do. But yeah, it’s far from being the standard process around the world. Quite the contrary.
Linda: Yeah. Speaking of mistakes, What are the worst data mistakes that you’ve seen?
Michael: Well, given that my job is to review other data scientists’ work. This is what I do for a living. I keep thinking I’ve seen it all and every single time something comes along and it’s just so much worse.
Here’s an example. It’s not by far not the worst that I’ve seen, but it’s sort of a good demonstration of the kind of issues that I run into all the time. Remember, I said what I wish is that everybody looked at data? Unfortunately, not even all data scientists look at the data. In this particular example, this was in a review where this was a project done by an experienced senior data scientist. I’m not picking on a new graduate here.
The project was about finding suspicious activity in accounts. We were looking at a computer system and the idea was we might have somebody trying to infiltrate into our system. An account like that will have a string of suspicious things that they are doing. We want to find these strings. He built something that correlates between strings of activity that were logged in the system with what was later flagged by the human cybersecurity team as fraudulent, and then built a system that found more suspected fraudulent accounts.
He was presenting this in a review and was extremely happy with his own success rate. It was such an amazing success rate. It was very obvious to me from the first second that it can’t possibly be true. It was just way, way too good. Probably like the single best indicator that a result is wrong. I just asked a very simple question. I said, “Have you actually looked at what were those sets of logged activities based on which the system decided these are fraudulent accounts?” He never did.
During the review, it turned out it was about 20 patterns the software homed into. It turned out that 95% of everything that the program found came from rules where the last activity in the pattern was labelled “ACTLCK”. I asked him, “Do you know what that event is? Do you know what ACTLCK stands for?” He didn’t know. He never bothered looking at his own data. He never looked at the raw data and never looked at the processed data. Unfortunately for him, I did know what it was. ACTLCK stands for account lock. It was the indication that the existing in-place operational rule-based system already flagged this account. They already locked this account. Basically, he was leaking ground truth in a very extraordinary way.
I presented this before as a failure to look at data. It certainly is a failure to look at data, but it’s more than that. It’s an unfortunately common lack of humility that many data scientists have. They come into a job believing that they should not consult with SMEs. We are the replacement for SMEs. SMEs just have some experience. We have data. No, no. If you only have data, then you only have numbers. You need the SME to be able to interpret what those numbers mean to you so that you can learn the semantics of what’s going on, so that you can understand the real meaning behind what those numbers are. But data scientists think that they shouldn’t spend any effort learning the domain.
Linda: Can you clarify for our non-technical listeners, what SME stands for?
Michael: Sorry, subject matter expert. I should have said the full name, not the acronym. Anyway, not consulting with a subject matter expert. To me, that is a lack of common sense, and it’s bad research methodology. It’s a complete lack of appreciation that data needs to be understood in the context. This is not an extreme example. It’s not an isolated incident. I see things like this in literally every single one.
Linda: It’s a classic flaw in the entire tech industry, really, where we constantly come in and go, “Oh, let’s do a startup to solve this problem, that we don’t really understand because we’re tech bros, not experts in that problem, but we don’t need experts. We’ll just solve it with the magic of tech,” and then we’ll hit all of these roadblocks that we couldn’t possibly have predicted were there because they’re relevant to the subject matter. Who knew? You keep reinventing the wheel, worse. You keep reinventing phrenology. You keep reinventing things that we know don’t work because you don’t understand the context, and you haven’t bothered to learn the history of it.
Michael: Yes, and so many data scientists have learned the tools of various types of correlation and come in saying, “We found these wonderful correlations.” A trivial question that should be asked and never is, is: “Is this news?” Maybe this is something that every single person who ever worked in the field already knows.
What data scientists don’t know how to do, largely because it’s much harder to automate so you don’t have tools for this, is how to find the second phenomenon. So after we’ve taken into account all of the things that the subject matter experts already know, what then can we innovate over this? What new insights can we bring into the field? But data scientists have shrouded themselves in this field of, “I don’t want to know what their SMEs already know.” They can invent the wheel every single time.
Linda: Yep. Yep. Have you ever seen data deliberately misused?
Michael: I guess that depends on what you mean by deliberately. I think in continuing what we were talking about before, I think there’s a trend towards people understanding data less. A lot of the mistakes that they do, you can blame at this point on ignorance rather than deliberation. But look, it’s basically two groups, which we’ve already mentioned in this conversation, are, the majority of data science being done around the world, is done by these two groups, and both of them constantly, always misuse data.
One we’ve talked about is the people who do confirmatory data science. So the managers who just basically want a particular result, and then the data scientist needs to give them that result, justify the result somehow.
I gave a talk to Monash graduates in data science a couple of years ago. This was a guest talk. It was like the last thing that they heard before their formal schooling was done. I said to them, “Seek a job as a data scientist. You will get hired by a company. On the very first day, someone will ask you to do something that is not data science. They will ask you to do programming. They will ask you to do BI. They will ask you to do data engineering,
Linda: BI?
Michael: Oh! business intelligence. So they will ask you to just build tables or bar charts of historical data. Sorry, I keep falling into the trap of acronyms.
Linda: We all do.
Michael: And at that point, you have a choice to make, I say to them. You can decide to say, “Yeah, sure.” And you will never see data science again. All you will do is one of these three things. Or you can decide to say, “No.” And chances are you’ll be out of that job fairly quickly. And there is no right choice or wrong choice here. One of them is going to be better for you tactically. It will secure your current job. One of them is going to be better both for you and for the entire profession strategically.
I know which one I picked. It’s the much harder choice. But I can’t blame people for falling into this trap of, “Oh, well, let’s just do something that is not data science.” And once you start down that road, then you do confirmatory data science and you don’t realize that it’s just wrong. People don’t understand that that is not their job as data scientists. And they in fact should stand up to that kind of behaviour as data scientists. But given that that’s the case, the value that is attributed to data has been going down.
So that’s one group that is misusing data. The other group that is misusing data is those people who claim that they can automate the process. And this is either the vendors who sell you the software, we talked about them, or the people who say, “Why don’t you just outsource all of your data science? We have data scientists. Just give us your data. We’ll give you back the results. It will be all good. Trust us.” And it’s just as much snake oil because they’re not going to spend the effort that you would have done in-house in order to really understand your data. They would not build the understanding of the data over months and years in order to do what’s most right for your specific company. It’s not their business model. Their business model is to just crunch numbers quickly, get back a response quickly, move on to the next job.
And I have been forced more than once to give some of my data to vendors like that when I was employed by companies. And I always said, “Look, this is a mistake. We are taking data that is a strategic asset of ours and are just handing it off to people. This will end up training data, training models that will be also used by our competitors. You’re basically giving away your advantage.” And more than one occasion, I was overruled on this, and what happened was I only made one insistence. The insistence was, “At the end, whatever comes out of this project, I want to be able to do validation on it. I want to test the quality of that result.” And begrudgingly, my bosses agreed. And every single time I found that the result had no value and moreover, those vendors, every single time, cheated.
Cheated in the sense of they took the data that was given to them as a test set and pulled it into their training. I could prove that they did that. And it was the first time that I saw it, it was shocking to me because I was working with one of the largest companies in the world. And I thought, “These guys are very serious players. They have a very robust name. There’s just no way that they did that.” But the data doesn’t lie. And ever since then, I’ve seen it just everywhere.
Big companies, serious players, and they keep doing the same thing. And I think, again, is that deliberate or are they just really not good at the job? How can you tell? Either way, what I tell my clients today is, if you feel the need to outsource, if you must do it, you can outsource whatever you want. But always keep the validation function in-house. You can’t not have a data scientist look at the data and validate these results. And what will happen is that very quickly, that data scientist will be able to tell you that the results are worthless. And very quickly, you will realize that the work done by that single data scientist that you left in-house is more valuable to you than the entire contract with the big vendor that you’re paying I don’t want to even tell you how much money to. And yeah, very quickly, you will find yourself making the right business decision of cutting the ties with that vendor and just building out your own in-house data science function.
Linda: Yeah. So how do we spot it when data is deliberately misused? You mentioned one big red flag earlier, which I actually wrote into one of my recent blogs, which is the results are too good, which is an immediate, oh, something’s wrong here. What else do we look for?
Michael: Oh, so the results are too good is like a classic. It actually has a name. It’s called Twyman’s Law. Yes, Twyman’s Law says that any statistic that looks too good probably is. Any result that seems, you know, good and meaningful, too good and meaningful, can’t possibly be it. Now, it’s to begin with, the cases that I’ve mentioned are, you know, these specific populations, when I go to review their work, I know what I’m going to find. It’s, you know, once you know what you’re looking for, the telltale signs are just too obvious. You look for those, you know, ground truth leaks, you look for signs of overfitting. Basically, if you look at a model and you see that everything that was being used is like the latest trend, you know, whatever is the algorithm currently making the rounds, they’ve only used that one thing or wherever you see that, you know, nothing has been live tested against actual data. It’s all, it’s all, you know, based on the one data set that you keep crunching.
These are very easy red flags to spot. I mean, I was one time in a meeting with a vendor and I was there as a consultant. There was the vendor, there was the client, and the client was, you know, laying out the basics of their problem. And the vendor said, oh, well, we’re going to present for you a deep learning solution that will do this and this and this and this and this. I don’t think they went so far as to actually commit on the quantity targets. But at this point, it was already obvious before the work was ever done that this was going to end up being problematic for the simple reason that how can you possibly tell that it’s going to end up being a deep learning solution before you started the research? You’ve committed to a solution. You’ve finished the research saying, this is what I was going to get to. Look, you have to know the domain in which you’re working in order to understand that the person presenting actually doesn’t. It’s not difficult when they start saying things and those things make no sense when you’re thinking about the realities of the situation. You find very common, very commonly, you find them using metrics that are like, you know, the kind of metrics that you will find in a statistics book rather than the kind of metric that you will find in a board meeting. So metrics that actually make no sense for the business just because, you know, we have some theoretical results that if you use this metric, then that algorithm is guaranteed to converge. So you need to know your math, you need to know your domain. And quite often, I just asked them to visualize the results that they’ve got. Because while you can, you can fool one metric, you can fool, you know, a chosen metric. And in fact, that’s mostly what machine learning algorithms do, right? You give them a metric, and you’re asking them to optimize it. So of course, that one metric is going to look good. But when you ask somebody to actually visualize the result, you see the whole behavior of what is being done, and your brain immediately picks up, you know, what are all of the other metrics that they never thought of that, that completely screw up this one metric.
Linda: That brings us beautifully to the subject of visualization. What is the first question you ask when you look at graphs in the media?
Michael: Oh, I think my answer to this one will surprise you. My first question when I see a graph in the media is who drew this? Was this the art department? Was this the article writer? Was this something that came from a journal paper that was just adapted? Because these are populations that have, you know, different incentives, and they can all make graphs are misleading, but their graphs are misleading in different ways. And that’s not just how I look at graphs and not how I just look at the media. I started with, you know, who did this? I go to what was their goal? What were they incentivized to do? I go from that to how might have gone? How might they have gone about accomplishing this goal of what they wanted to do? And by that time, I already have a pretty good idea of what to expect.
So for example, if this is a graph done by the art department, they will quite often use these nice looking icons to, you know, the size of the icon is supposed to represent something. But we know that people are very much misled by three dimensional icons. Just they don’t understand how they scale. They think that it’s volume rather than height. It’s just how our brain works. And an article writer might decide to put the graph not starting from zero, just because they want to show the trend. If there’s a trend between, I don’t know, values moving from 999 to 1001, they will start the graph at 998 so that you’ll get this very nice slope.
If they adapt something from the original paper, usually those problems don’t exist. But the original paper may have had error bars. And if you don’t have a confidence interval next to your numbers, quite often, you know, how are you going to interpret those numbers? How will you know which are the more trustworthy ones? But in order to simplify things for the general public, they take error bars away normally.
It’s actually part of a larger symptom that there’s an impression that the general public is afraid of data. And you should hide as much of it as possible. And again, when you do that, what you lose is the context. When you lose the context, you lose the meaning. And look, I used to have a habit many years ago. I’ve stopped it now. But many many years ago, whenever I saw any article mentioning statistics, I would use to read it just in order to figure out where they got it wrong. Because, you know, most article writers, they’re not experts on statistics. And at the same time, they are quite incentivized to make it sensational. So they pick and choose how they present it. So for example, I would quite often find that whatever graph was in the article did not actually match the text of the article. A lot of things like that.
My absolutely favorite example, this is an oldie, but it’s still a goodie. I remember this article with the ultra sensational headline, it said, statistical study finds that increased sexual activity contributes to a younger appearance. And I looked at that. I looked at that and I said, I don’t even have to read this. I don’t even have to read this paper. Everything that is wrong is right there in the title. Yeah, it’s statistical study. All you ever measured were correlations. How can you possibly assign causality here? Isn’t it at least as likely, I would argue more likely, that a younger appearance contributes to increased sexual activity? How do you know? Maybe it’s, you know, some C causing both A and B. Maybe it’s the way that you measure the created correlation. You cannot assign causality based on correlation. And yet this was the most sensational version of it that they could put in the headline.
Now, I stopped this. I stopped doing that many years ago. And the reason why is we hardly have any more science reporters in the move from paper based industry to online industry, the newspaper business largely fired off the specialists. And now we only have generalists. From generalists I just don’t even expect them to try anymore getting this right. So if I’m interested in science news, that’s not where I get it from anymore.
Linda: Yeah, that’s a big problem. This has been so interesting. So much so much vigorous nodding from my end. What excites you about data?
Michael: That’s an easy one. And I think it’s something that we already mentioned. It’s the power that it has to make – to make me to make everyone really – go. Oh, I’ve been so wrong about all of this. So I deal with strategic data science. And oftentimes when I come to a to a company and I start working, there still isn’t any data there. The work is to find what data is even needed. And can it be collected? Can it be bought? What kind of problems will we encounter if we try this, if we try that? And then there’s a quite long period until, you know, enough data is available. And during that time, we start building, you know, potential solutions, some algorithm to do this, some algorithm to do that. And once we have enough of that, there is quite often, I want to say almost always, there is a point where I have a conversation with the client and the client says, Well, so we’re done. And I said, No, what do you mean, we’re done? We don’t even have the data yet. How can the research possibly be done? And they’re like, but you have the software. Yes, we have the software to show us that we are wrong.
The next step is going to be to figure out how are we wrong and, you know, build the next set of solutions. And they will still be wrong, but they’ll be wrong in more subtle ways. And you have to keep digging into that until you get until you get something that has the business value that you’re after. And, you know, if they say to me, but I mean, you’re a very, very experienced data scientist, isn’t your job to be right the first time around? No, no, it isn’t. If you if I’m right the first time around, that means I have not learned here anything. Your data is in no way unique to you. And if it is in no way unique to you, it has no value that is intrinsic that only you can capitalize on. That is a very, very basic misunderstanding of what research is and how research works. So yeah, that’s really the magical thing about data. It excites me every single time it is possibly the best teacher. If only you’re willing to listen to what it has to say.
Linda: That’s the key, isn’t it? Being open to being shown that you’re wrong. I think if more of us had that capacity, it would change a lot. Yep. Well, it’s an attitude that I have been diligent cultivating with Otzma analytics. I’ve been coaching managers at it. I’ve been training and upskilling data science teams to get into that mindset. I’ve been, you know, doing the reviews that show people who thought that they have it all figured out that in fact, there are still some very basic steps that they are missing and need to learn. Yeah, one company at a time, one client at a time. I’m trying to make this a better world for data science.
Linda: Magnificent. You work on the companies, I work on the schools.
Michael: Absolutely. Both sides of the same coin.
Linda: That’s right. Thank you so much.
Michael: Thank you, Linda. As always, a pleasure, a pleasure talking with you.

Michael Brand, on the Science of Data Science

Related

Published by Dr Linda McIver

Leave a ReplyCancel reply