
What does AI mean for medicine? Image by © Tim Sandle
Artificial intelligence (AI) is progressing, getting smarter. An example is with the way neural networks first treat sentences like puzzles solved by word order. Yet, once they have ‘read’ enough, a tipping point sends them diving into word meaning instead—an abrupt “phase transition”. By revealing this hidden switch, researchers from Sissa Medialab believe they can open a window into how transformer models such as ChatGPT grow smarter and hint at new ways to make them leaner, safer, and more predictable.
However, this type of advancement does not mean that AI is advancing in all of the areas that it needs to. One of the more problematic areas is with ethics, and one area of ethics that is of great importance is with medical decisions.
Thinking, Fast and Slow
AI models, including ChatGPT, can make surprisingly basic errors when navigating ethical medical decisions, a new study reveals. For this review, researchers from Mount Sinai’s Windreich Department of AI and Human Health tweaked familiar ethical dilemmas and discovered that AI often defaulted to intuitive but incorrect responses—sometimes ignoring updated facts.
The findings raise serious concerns about using AI for high-stakes health decisions and underscore the need for human oversight, especially when ethical nuance or emotional intelligence is involved.
The research team was inspired by Daniel Kahneman’s book “Thinking, Fast and Slow,” which contrasts fast, intuitive reactions with slower, analytical reasoning. The book’s main thesis is a differentiation between two modes of thought: “System 1” is fast, instinctive and emotional; “System 2” is slower, more deliberative, and more logical.
It has been observed that large language models (LLMs) falter when classic lateral-thinking puzzles receive subtle tweaks. Building on this insight, the study tested how well AI systems shift between these two modes when confronted with well-known ethical dilemmas that had been deliberately tweaked.
Gender bias
To explore this tendency, the scientists tested several commercially available LLMs using a combination of creative lateral thinking puzzles and slightly modified well-known medical ethics cases. In one example, they adapted the classic “Surgeon’s Dilemma,” a widely cited 1970s puzzle that highlights implicit gender bias. In the original version, a boy is injured in a car accident with his father and rushed to the hospital, where the surgeon exclaims, “I can’t operate on this boy — he’s my son!” The twist is that the surgeon is his mother, though many people don’t consider that possibility due to gender bias.
In the researchers’ modified version, the scientists explicitly stated that the boy’s father was the surgeon, removing the ambiguity. Even so, some AI models still responded that the surgeon must be the boy’s mother. The error reveals how LLMs can cling to familiar patterns, even when contradicted by new information.
In another example to test whether LLMs rely on familiar patterns, the researchers drew from a classic ethical dilemma in which religious parents refuse a life-saving blood transfusion for their child. Even when the researchers altered the scenario to state that the parents had already consented, many models still recommended overriding a refusal that no longer existed.
Why human oversight must stay central when we deploy AI in patient care
Consequently, the researchers conclude that where AI is used in medical practice, such findings highlight the need for thoughtful human oversight, especially in situations that require ethical sensitivity, nuanced judgment, or emotional intelligence.
In other words, medics and patients alike should understand that AI is best used as a complement to enhance clinical expertise, not a substitute for it, particularly when navigating complex or high-stakes decisions.
The research team plans to expand their work by testing a wider range of clinical examples. They’re also developing an “AI assurance lab” to systematically evaluate how well different models handle real-world medical complexity.
The research appears in the journal njp Digital Medicine titled “Pitfalls of large language models in medical ethics reasoning.”