How many errors is too many?

Dr Linda McIver

2 months ago

I keep hearing how great chatbots are at summarising, and that this is one thing we can use them for reliably and safely. At the same time, chatbots are taking off in the health space, with a surprising number of clinicians, from psychologists to endocrinologists, embracing the use of AI as a note taker that allegedly relieves the clinician of the burden of taking notes, and enables them to focus entirely on the patient. The idea is that the clinician checks over the notes after the appointment, to make sure they’re accurate. As one such system boasts, they let you: “Talk naturally with your patients — our AI listens, transcribes, and generates clinical notes, ICD-10 codes, and smart suggestions for treatment plans and next steps. It also creates editable medical documents in Word format.”

The trouble is that, according to the Cambridge online dictionary, to transcribe means “to record something written, spoken, or played by writing it down”. It doesn’t say “to guess at what was said and write down something that may, or may not, be at all correct”.

You might think I am being needlessly negative here, and I do sometimes wonder, given the enthusiasm I see around me, whether I am letting my scepticism of AI run away with me. So let’s consider a recent appointment I had with a new medical specialist. When I arrived, the receptionist had me fill out a form, emphasising that there were two sides. Side two was a consent form for the doctor to use AI during the appointment. I told the receptionist that I had not signed this form, as I did not consent to the use of AI. She was startled, and said she didn’t think he used AI. She told me to make sure I told the doctor when I went in, and it would be fine.

Unfortunately, by the time I went in to the doctor’s rooms, I had forgotten, so I didn’t tell him not to use AI. It’s important to note that, although I forgot to mention it to him, informed consent must never hinge on the patient remembering to tell the doctor something. It’s only informed consent if it is actively and meaningfully sought. So. I did not consent, but AI was used anyway. No big deal, right? Unless you think consent matters… (And this is by no means the worst consent related medical incident I have heard of – a friend seeing a different doctor was told “You don’t have to consent, but just know that if you don’t, the notes will be inferior” which is not so much consent as coercion.)

The rest of the appointment went fine. Many things were discussed, he took a very detailed history, and seemed nicely thorough. He said he’d send me through a summary of our discussion, after he had checked it. A day or two later the summary arrived. It was superficially ok, unless you knew my history and checked any of the details. Glancing at it was enough to show me that there were problems, so I went through it carefully and counted at least 9 significant errors. Among other things, it had described me taking medications I am not on for issues I don’t have, left off medications I actually am on for issues I do have, altered the number of pregnancies I had experienced, and left off things from my medical history that were not only covered in the appointment, but also listed in the referral letter, and, more importantly, relevant to the issues I was seeing him for.

This, then, gives the lie to the “it’s ok as long as you check for errors” myth. Because not only did he fail to pick up errors when he checked, he actually had the correct information in front of him in the referral, but it didn’t help. It turns out that carefully checking documents is really difficult for people. We tend to assume that information in front of us is accurate, so reading it looking for errors is a lot of effort, and extremely error prone. In fact, reading errors tends to kind of settle them in our heads, persuading us that they were correct all along. This means that accurately fact checking a document is really hard work. It might even be more work than taking notes in the appointment would be!

We’re just not good at fact checking things. We skim them, assuming they’re accurate, and then the errors settle in, like stones at the bottom of a pond. Sharp ones. Stones that may well cut our feet when we inadvertently step on them later. I did challenge the specialist about all of this on a subsequent visit, and he claims that, despite the errors, the notes are still better than when he was taking them himself. I would love to see an objective test of this. I’d also like to see alternative possibilities, such as recording the appointment and listening back to it while checking notes, tested against chatbots. But there’s no time for that in the race to the fanciest new clothes for the Emperor!

In related news, I asked three different chatbots – ChatGPT, Claude, and Gemini – to summarise a blog post of mine this afternoon. The blog post was titled Channels of Information, or How Metro Trains Derailed its Passengers.

ChatGPT gave me a summary talking about an actual derailment (not mentioned in my post at all), and manufactured the key points of the post. Gemini also manufactured key points, including “Trust in Data: If a system provides incorrect data once, the user loses trust in that system entirely, even when it eventually provides correct info.” which I did not even imply in that article, much less say outright. It wasn’t at all relevant. Claude does better, without inserting really egregious errors, but it still misses significant details, while highlighting inconsequential details.

So no, chatbots are not “reliable documenters of medical appointments, as long as they are carefully fact checked” – in part because of how much they get wrong, and in part because people are famously bad at that kind of attention to detail. And no, chatbots are not reliable summarisers of documents either.

We’ve still got a shocking high level of emperor’s new clothes hype happening, and a shockingly low level of objective, empirical testing of these systems to determine what they can and can’t do.