The artificial intelligence (AI) chatbot ChatGPT is highly inaccurate at making pediatric diagnoses, a new study finds.
Just as many parents may consult websites like WebMD to check symptoms their children are experiencing, they may also be tempted to consult ChatGPT. But researchers found the AI chatbot — powered by a language model called GPT-3.5 made by OpenAI — failed to correctly diagnose 83% of pediatric cases it examined. They published their findings Jan. 2 in the journal JAMA Pediatrics.
Their research, which is the first to assess ChatGPT’s ability to diagnose pediatric cases, follows a previous study published on Jun. 15, 2023 in the journal JAMA. That previous work showed a newer language model called GPT-4 correctly diagnosed only 39% of challenging medical cases, including those concerning both adults and kids.
In this new study, the researchers ran 100 patient case challenges sourced from JAMA Pediatrics and The New England Journal of Medicine (NEJM) through ChatGPT, asking the chatbot to “list a differential diagnosis and a final diagnosis.” Differential diagnoses refer to the plausible medical conditions that might explain a person’s symptoms, and after assessing all these possibilities, a doctor then reaches a final diagnosis.
Related: Biased AI can make doctors’ diagnoses less accurate
These pediatric cases were published in the journals between 2013 and 2023.
To verify the study’s findings, two medical researchers compared the diagnosis the AI generated with those made by the clinicians in each case. They assigned each AI-generated response a score of correct, incorrect, or “did not fully capture diagnosis.”
High levels of inaccuracy
ChatGPT provided incorrect diagnoses for 72 of the 100 cases, with 11 of the 100 results categorized as “clinically related but too broad to be considered a correct diagnosis.”
In one of the case challenges ChatGPT incorrectly diagnosed, a teenager with autism displayed symptoms of a rash and joint stiffness. Despite the initial physician diagnosing the teen with scurvy, a condition caused by a severe lack of vitamin C, ChatGPT’s diagnosis was immune thrombocytopenic purpura. The latter is an autoimmune disorder that affects blood clotting, causing bruising and bleeding. People with autism can have very restrictive diets, due to sensitivities to food textures or flavors, which can make them prone to vitamin deficiencies.
Another inaccurate case featured an infant with a draining abscess on the side of their neck, which the original case physician attributed to Branchiootorenal (BOR) syndrome. This developmental condition affects the formation of the kidneys, ears and neck. Instead of BOR syndrome, ChatGPT claimed the infant had a branchial cleft cyst, when a baby’s neck and collarbone tissues develop improperly before birth.
However, in a few cases, ChatGPT reached the same diagnosis as the doctors. For a 15-year-old girl with an unexplained case of pressure on the brain, known as idiopathic intracranial hypertension (IIH), ChatGPT correctly matched the physician’s original diagnosis of Addison’s disease, a rare hormonal condition that affects the adrenal gland. Rarely, IIH can be a knock-on condition that stems from Addison’s disease.
A mixed outlook for healthcare
Although the researchers found high levels of inaccuracy for AI-generated pediatric diagnoses, they said large language models (LLMs) still have value as an “administrative tool for physicians,” such as in note-taking. However, the underwhelming diagnostic performance of the chatbot observed in this study underscores the invaluable role that clinical experience holds
One of ChatGPT’s most significant limitations is its inability to find relationships between medical disorders — such as the links between autism and vitamin deficiencies, the researchers explained, citing the aforementioned scurvy case, which was published in 2017 in the journal JAMA Pediatrics. They believe that “more selective training is required” when it comes to improving AI’s ability to make accurate diagnoses in the future.
These technologies can also be let down by “a lack of real-time access to medical information,” they added. As a result, they warned that AI chatbots may not keep up-to-date with “new research, diagnostic criteria, and current health trends or disease outbreaks.”
“This presents an opportunity for researchers to investigate if specific medical data training and tuning can improve the diagnostic accuracy of LLM-based chatbots,” the researchers concluded in their paper.