June 16, 2024
How to Navigate the Pitfalls of AI Hype in Health Care | Artificial Intelligence | JAMA

This conversation is part of a series of interviews in which JAMA Editor in Chief Kirsten Bibbins-Domingo, PhD, MD, MAS, and expert guests explore issues surrounding the rapidly evolving intersection of artificial intelligence (AI) and medicine.

What is AI snake oil, and how might it hinder progress within the medical field? What are the inherent risks in AI-driven automation for patient care, and how can we ensure the protection of sensitive patient information while maximizing its benefits?

When it comes to using AI in medicine, progress is important—but so is proceeding with caution, says Arvind Narayanan, PhD (Video), a professor of computer science at Princeton University, where he directs the Center for Information Technology Policy.

JAMA Editor in Chief Kirsten Bibbins-Domingo, PhD, MD, MAS, recently spoke with Narayanan about biases in machine learning, data privacy, and the dangers of overestimating AI’s abilities.

The following interview has been edited for clarity and length.

Dr Bibbins-Domingo:There are many ways in which AI and machine learning can mirror the biases we already have in society and clinical practice. What is meant by AI fairness in the context of medicine?

Dr Narayanan:There was a study in Science that looked at an algorithm for risk prediction that many hospitals use to target interventions to patients. What it found was that the algorithm had a strong racial bias in the sense that for 2 patients who had the same health risks—one who is White and one who is Black—the algorithm would be much more likely to prioritize the patient who is White. What the authors figured out was that the algorithm had been trained to predict health costs and minimize them. Like all AI algorithms, it’s trained on past data from the system. Since most hospitals had a history of spending more on patients who are White than on patients who are Black, the algorithm had learned that pattern. In terms of what it was programmed to do, the algorithm was working. It was correctly predicting that by targeting these interventions more to patients who are White than patients who are Black, hospitals could save more on costs. This is one kind of bias: perpetuating past inequalities into the future.

Dr Bibbins-Domingo:So these models are learning from patterns that exist, but those patterns reflect inequities in the practice of medicine. The danger here is that if you adopt these models across health systems, they allow us to magnify inequities, right?

Dr Narayanan:And reify. If there’s a clinician involved in these decisions, they’re going to notice a disparity, and they might bring it up. But when you’ve automated the decision, a couple of things have happened. One is you’ve given a decision a veneer of impartiality; it’s data driven; it’s math; and it’s been validated to work correctly. The other thing that happens is we’re taking moments of introspection out of the equation, and so it becomes much harder to change.

Dr Bibbins-Domingo:So it’s not just the problem of scaling. It’s that we tend to believe things that are math-based and proven. Do you have another example of bias?

Dr Narayanan:A different kind of bias in clinical applications of AI is related to the technology itself. It could be that there are more data from some populations than others. Underrepresented populations, by definition, are often going to have fewer data points in the system. That means that the model might be better at picking up patterns for the majority population. This leads to a disparity in the performance of the model for different algorithms. This is more of an issue where AI itself is the problem rather than simply perpetuating something in the system.

Dr Bibbins-Domingo:Let’s talk about solutions to these problems. It sounds like when the problem is inherent to the types of data available to train the model, you might make adjustments or get additional data. Also, how do we identify issues of bias that might be inherent in a model, and what are the strategies to mitigate those biases?

Dr Narayanan:Let’s start with the hard problem, which is when there are existing inequities in the health system, and constant monitoring is essential. Here, AI actually provides an opportunity: what we’re finding—both in medicine and in other applications—is that because there’s a concern around data-driven discrimination and because these systems make the data readily accessible in a form that can be analyzed, these disparities are at least coming to light. But of course, that requires researchers, clinicians, and others to actually do this work and publish it.

Assuming all of that happens, we’ll have a good amount of visibility into these biases. We should be funding this kind of work, and we should publish it on an ongoing basis. But one of the challenges with research is that to publish it, it’s often helpful to have novel findings. In the last few years, as the issue of bias has been coming to the fore, it’s been possible to publish this work. I worry that the incentives are gradually going away, now that people understand how bias is possible. So we need to keep the spotlight on this until the incentives change and fixes start rolling out.

Dr Bibbins-Domingo:And what are solutions for when we have asymmetric data available for some populations but not others?

Dr Narayanan:One solution that’s often proposed is to oversample certain populations. In other words, not to pick the same proportion. So instead of using 1% of patients from each population, perhaps using 1% of patients from a larger population, and 3% of patients from a smaller population. That would balance out the disparity a bit. The concern here is that we’re now putting more of the burden on underrepresented populations to contribute data for the building of AI models. We need to be thinking about different ways for that contribution to be valued and compensated. That’s one way of looking at it. Another way is, can we use synthetic data? Can we use imputed data? Can we use algorithms that will do a better job of accounting for the disparities in the sample sizes?

Dr Bibbins-Domingo:I want to turn to a book you’re working on called AI Snake Oil. Tell me what the title means.

Dr Narayanan:AI snake oil is simply AI that does not and cannot work. There’s a lot of it out there.

Dr Bibbins-Domingo:So by snake oil, you don’t mean bias or ways in which models might be leading us astray but rather the hype that has accompanied our enthusiasm for AI and that there are people capitalizing on this by trying to sell us something to make our life easier but doesn’t work.

Dr Narayanan:Exactly. Let’s start with research. In the research world, it’s not always people trying to fool others but it’s often people fooling themselves. What we’ve noticed is that evaluating AI and machine learning is hard in any field, including medical AI. For instance, when COVID happened, there was an influx of researchers trying to apply their machine learning skills to things like chest radiographs to detect COVID. There are well over 1000 papers on this, and there have been systematic studies to look at the quality of reporting and identify errors.

There was one study that looked at 62 of these papers, and the authors had to whittle it down from a bigger set of 400 because most had such poor reporting standards that they couldn’t even be evaluated. Out of these 62 studies, the authors concluded none had produced reliable evidence because they fell short of standards of evaluating and reporting these algorithms in the medical context. The worst of it was 16 studies that had all used a particular flawed dataset where all the positive COVID cases came from one data source. The positives were cases of COVID-positive adults, and the negatives came from a different source, which was children who had pneumonia. What ended up happening was that the algorithm was not learning to detect; it was learning to distinguish children from adults.

There are many systematic reviews that have looked at entire groups of machine learning for health papers and found these kinds of basic pitfalls in large fractions of sampled studies.

How to Navigate the Pitfalls of AI Hype in Health Care | Artificial Intelligence | JAMA

Dr Bibbins-Domingo:What’s your advice for a journal editor like me who wants to be open to the innovation happening with AI and its applications? We don’t want to fall into the hype cycle; we want to keep our eye on the standards for why we publish research. Ultimately, we care about patient outcomes. How could we do a better job of setting standards or detecting nonsense?

Dr Narayanan:Last summer, Sayash Kapoor and I ran a virtual workshop to improve reproducibility in machine learning–based science. Along with the speakers of that workshop and a few others, we put our heads together to see what we could do about it. There are hundreds of little details that you must be mindful of in the course of building and reporting a machine learning pipeline for a complex data analysis test. Maybe a checklist can help. Not purely a checklist, but a set of guidelines that you have to keep in mind throughout your project. If you’ve gotten some of the steps wrong while building the model, the checklist is not going to help you at the end. We called it Reporting Standards for Machine Learning Based Science. We’ve done some preliminary informal tests with people who used it for their own machine learning–based research, and it’s helped them detect errors that would otherwise have passed unnoticed. Maybe it can even become a thing that journals ask authors to use when submitting a paper.

Dr Bibbins-Domingo:There’s also excitement about using generative AI for diagnosis and thinking of it as a way that engages patients more directly in the process of self-diagnosis. Talk to me a bit about the potential in this area and as well as the potential risks.

Dr Narayanan:I’ve used generative AI for self-diagnosis. The difference in convenience between that and finding the next available appointment with my primary care physician is a huge gap. I feel confident that millions of people out there are doing this, but it’s been hard to find numbers. I think a good start would be to figure out how many people are doing this, how they‘re using the tools, and what difficulties they’re encountering. Because it’s happening, and I think it is going to grow quickly. The opportunities are clear. There are potential advantages to generative AI tools that can offer a thorough experience for the patient as well as a pleasant user interaction—something that would potentially be more accurate and user-friendly compared to an online symptom checker.

But I think the risks are equally clear. We’ve talked about biases in traditional AI and machine learning tools. There’s an analogous set of concerns with generative AI as well. Just a few months ago, there was a paper called “AI chatbots not yet ready for clinical use.” It looked at biases that can happen when we put the usual kinds of queries into these tools. For instance, the query that the authors had was about the choice of analgesic for a 50-year-old White male vs for a 50-year-old Black male. There was a difference, for example, in ChatGPT’s answers in one case recommending opioids, and in another case recommending aspirin. These kinds of biases in the health system are being picked up by these tools because of the data that they’re trained on. So I think a lot more research is necessary before we can have confidence in these tools.

Dr Bibbins-Domingo:I’m curious what tasks you use AI for. What tools are you most excited about? And what tools would you advise others to keep a bit of a distance from for now?

Dr Narayanan:There are many AI tools that we use without even necessarily recognizing it. Autocomplete is an AI tool. Even 15 years ago, it would’ve been considered amazing to have spellcheck and autocomplete, things like that based on machine learning. And now it’s routine. I think that frontier is constantly moving. When we scroll down our social media feeds, every one of those items is put there by a recommendation algorithm, which is a form of AI. There’s targeted advertising. There’s also AI in cars. Not just self-driving cars—a lot of safety features use algorithms that would’ve been impossible a decade ago. AI is pervasive in our lives already.

I think the reason why generative AI has led to all of these conversations is that it’s a very in-your-face kind of AI as opposed to all of these other things that our apps and devices are doing silently in the background. Generative AI has enabled everyone, including me, to use AI for a new set of applications that would not have been possible before. I use it in my own research—not only to research AI, but to improve how I do research in general.

I’ll give you a couple of examples. One is very mundane and the other is, I think, more interesting. The mundane one is formatting citations. If I’ve written a document with a bunch of links, and I want a tool to turn those webpages into citations and produce a nicely formatted document, generative AI can do that. That’s a straightforward use. A more interesting use is, if I’m reading a paper from another field, I’m now able to understand papers to an extent that I wouldn’t have been able to before because I can put the paper into an AI tool and then ask it questions. Of course, it’s on me to double check the answers. I know that AI tools can produce wrong information, but that’s still easier than not even knowing where to start and being terrified of a paper with new terminology. It gives me more confidence in reading something that’s not from my discipline.

To the last part of your question, which types of AI tools I’m skeptical about, it’s AI tools that claim to predict the future. There are AI hiring tools that claim to predict how well an employee will perform at a certain job based on hard-to-believe features, like just how they talk. But that’s also happening in medicine.

When Epic put out a sepsis prediction model, it was widely deployed, and they claimed that it had accuracy in terms of area under the curve—a standard metric for binary classifiers. They claimed it had 76% to 83% accuracy. But when researchers at the University of Michigan validated it later, they found that the accuracy was only 63%. And 50% is random, so it’s only slightly better than random. The model was missing most sepsis cases and overwhelming physicians with false alerts. There were many things that had gone wrong in the way that Epic had evaluated the tool. And in building and evaluating the model, one of the variables was whether antibiotics had been prescribed for treating sepsis. So the model was making the prediction after the clinician had already figured it out.

This demonstrates that predicting the future is hard. If a company says that they can do it with 80% accuracy, we should be suspicious immediately. We’ve seen this repeatedly in cases ranging from predicting hit songs to predicting civil wars to predicting medical outcomes.

Dr Bibbins-Domingo:At JAMA, we think of our audience as being not just scientists, clinicians, and health policy people, but also the lay public. We communicate broadly across social platforms, and we think it’s our goal to put out high-quality information because we know that these platforms are oftentimes the sources of less accurate information. What have you learned about how social platforms either amplify or don’t amplify inaccurate information?

Dr Narayanan:Scientists have not adapted well to the spread of information online. I think once papers started going online when the internet became a thing in our lives, it became obvious that our papers were no longer for our peers; they were for the public. But we never consider the layperson a potential reader. We don’t write our papers to make sense for them. We write papers where graphs are easy for someone who either misunderstood the information or has an agenda to take it out of context to put it on social media and make it look like it means something else.

Part of the problem is the way we produce our papers and the fact that so few scientists are on social media and engaging directly in terms of how to interpret their papers. The easier problem, perhaps, is working within the constraints of what we have. One thing that might be worth trying is if we have a long conversation like this one. How can we turn it into a 30-second format, for instance? But unless the way that social media algorithms amplify information changes dramatically, we have to think about how we can compete with all the other junk that’s out there.

Dr Bibbins-Domingo:There also has been interest in standards for evaluating the accuracy of ChatGPT. For example, should it just be passing the USMLE [US Medical Licensing Examination] at a certain rate? Many hospital systems have been interested in offloading the work of clinicians by having a chatbot generate the first patient response. So maybe your primary care provider has access to a patient portal, and you get an automated response. How should we be studying these tools to figure out when they’re sufficient to use and endorsed by a hospital, for example?

Dr Narayanan:Probably the biggest risk with clinical uses of chatbots is that they can make up facts. So they can produce a wrong diagnosis and provide dangerous information to patients or clinicians. Because of that, there has been a lot of excitement from facts like the one that you mentioned: ChatGPT scored 60% on the USMLE. There are other models that did even better.

But we cannot be confident in these tools for clinical purposes at this point. The views that we want in the real world are different from answering USMLE questions. When a medical professional has a certain performance on the exam, that’s based on an underlying set of skills and knowledge that is going to generalize to a certain extent to the real world. Of course, the exam isn’t the only thing we look at. There are years of practice and training as well as other checks and balances, and none of those exist when you’re talking about AI. It’s just a text generator. The way that a text generator does well on a medical licensing exam is such a thin veneer of the way in which a human expert does well on the exam.

What we really need are evaluations of medical professionals using these tools in their day-to-day jobs on an experimental basis, and for AI experts to evaluate them in actual clinical use. Until we have those kinds of evaluations, we should have very little confidence in how these are going to work in the real world.

Dr Bibbins-Domingo:You’ve also written about guarding against the ways in which one might extract sensitive information from widely available anonymized data. I wonder what you think about how we guard the sensitivity of patient information against the value of having better data help us create better tools.

Dr Narayanan:Data sensitivity and privacy is one of the biggest tensions in medical uses of AI, as well as machine learning and technology in general. Privacy is hugely important. What I’ve found in my work is that there’s been a lot of emphasis on using technical methods to protect privacy. For instance, HIPAA [the Health Insurance Portability and Accountability Act] requires that you remove 18 identifiers, and then you’re good to go. There’s hope that we can use these formulas by having a list of things to do to take care of the privacy problem. But I think it’s not going to be that simple. It requires a bit more mindfulness. Part of the solution has to be in terms of the care with which we treat these data rather than transform them into this deidentified form that you can put out there, and anyone can do anything they want with it.

My colleague here at Princeton University, Matt Salganik, was one of the people behind the Fragile Families Challenge. It has a lot of sensitive data from low-income families, including medical and genetic data. There was a machine learning challenge that was organized using these data, and I was the privacy consultant. One of the things that we found is that straightforward, nontechnical ways of protecting privacy can be very helpful. For example, simply asking people to fill out a form about what they’re going to use the data for and what they’ve done in the past. That’s a very different situation compared to a deidentified dataset that any company can access—or in some cases, even releasing it to the public. There’s a big distinction between releasing them to trusted researchers based on approved purposes for research vs releasing them to the public. I think we should resist the latter model. It’s true that easy availability of data has been responsible for a lot of innovation in the AI field, but to some extent, that’s incompatible with privacy and respecting patient rights in the way that we should. I think there should be some barriers. There should be some human involvement, perhaps strong human involvement, in figuring out who should get access to the data and under what conditions.

Dr Bibbins-Domingo:How should we think about this in the context of electronic health records? They are sources of these data. Is it enough that every health system basically maintains control over those data and who has access?

Dr Narayanan:There are techniques called differential privacy and secure multiparty computation, which, to some extent, allow models to learn from data without having to aggregate it all in one place. Perhaps in the future, those will be as easy to use as regular machine learning. But that’s far from the case today. There are still barriers in terms of usability for those technologies. One other thing to bear in mind is that sometimes when patients say they’re worried about privacy, they want to make sure their data don’t fall into the wrong hands or get hacked, or what they’re really objecting to is certain use of the data. We should be careful about those two types of privacy concerns. I think one of those concerns can be solved by putting processes and controls in place as well as legal and technical methods. But the other concern goes to fundamental issues of patient respect. If patients say they don’t want their data used for certain purposes, that’s a kind of objection that we should treat differently.

Published Online: January 3, 2024. doi:10.1001/jama.2023.23330

Conflict of Interest Disclosures: None reported.

Leave a Reply