In a recent study published in the journal npj Digital Medicine, a group of researchers examined the effectiveness of prompt engineering in improving the reliability and consistency of large language models (LLMs) for aligning with evidence-based clinical guidelines in medicine.
Study: Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs
Background
LLMs have significantly progressed in natural language processing (NLP), showing promise for medical applications such as diagnosis and guideline adherence. However, their performance in the medical field varies, particularly in complex cases and consistency with guidelines, due to differing accuracy and reliability. Prompt engineering, which aims to refine prompts to elicit better responses from LLMs, appears to be a promising strategy for improving their performance in medical contexts. Further research is needed to enhance LLMs’ accuracy, reliability, and relevance in medical settings, supporting clinical decision-making and patient care.
About the study
The present study tested LLMs’ consistency against the American Academy of Orthopedic Surgeons (AAOS) evidence-based osteoarthritis (OA) guidelines, supported by detailed evidence and covering treatments to patient education. The AAOS, being the largest global association of musculoskeletal specialists, offers OA guidelines that are supported by research evidence and encompass various management recommendations, making it an authoritative resource in the field.
The study implemented four distinct types of prompts: Input-Output (IO) prompting, Zero-Shot Chain of Thought (0-COT) prompting, Prompted Chain of Thought (P-COT) prompting, and Return on Thought (ROT) prompting, with the objective of examining the LLMs’ adherence to the AAOS guidelines and the reliability of their responses upon repeated inquiries. These prompts were designed to facilitate the LLMs in generating responses that would be evaluated against the AAOS guidelines’ recommendations.
Nine different LLMs were utilized, accessed either through web interfaces or Application Programming Interfaces (APIs), with fine-tuning performed as per protocols described on the OpenAI platform. Statistical analysis, conducted using SPSS and Python, focused on measuring the consistency and reliability of the LLMs’ responses. Consistency was defined by the instances where the LLMs’ recommendations matched precisely with those of the AAOS guidelines. At the same time, reliability was measured by the repeatability of responses to the same questions, assessed using the Fleiss kappa test.
Study results
The present study’s findings highlighted generative pre-trained transformer (gpt)-4-Web as the superior model in terms of consistency, showcasing rates between 50.6% and 63% across different prompts. Comparatively, other models like gpt-3.5-ft-0 and gpt-4-API-0 demonstrated lower consistency rates with specific prompts, with the highest consistency observed with ROT prompting in gpt-4-Web. This suggests that the integration of gpt-4-Web with ROT prompting most effectively aligns with clinical guidelines. An analysis across various models and prompts revealed a broad range of consistency rates, with gpt-4 models achieving up to 62.9% and gpt-3.5 models, including fine-tuned versions, reaching up to 55.3%. Bard models showed a consistency range from 19.4% to 44.1%, indicating variable effectiveness of prompts across different LLMs.
Subgroup analysis was conducted based on the AAOS’s categorization of recommendation levels from strong to consensus. This analysis aimed to discern whether the strength of evidence impacted consistency rates. It was found that at moderate evidence levels, no significant differences in consistency rates were observed within gpt-4-Web. However, notable differences emerged at the limited evidence level, where ROT and IO prompting significantly outperformed P-COT prompting in gpt-4-Web. Despite these findings, consistency levels in other models generally remained below 70%.
Reliability assessment using the Fleiss kappa test varied widely among the models and prompts, with values ranging from -0.002 to 0.984. This variability indicates differing levels of repeatability in responses to the same questions across models and prompts. Notably, IO prompting in gpt-3.5-ft-0 and gpt-3.5-API-0 demonstrated almost perfect reliability, while P-COT prompting in gpt-4-API-0 showed substantial reliability. However, the overall reliability of other prompts and models was moderate or lower.
Invalid data were categorized and processed according to specific procedures, with a significant portion of responses to certain prompts being considered invalid, particularly in gpt-3.5-API-0. This contrasted with gpt-4-Web, which had a relatively low rate of invalid responses.
Conclusions
To summarize, the study highlights the impact of prompt engineering on the accuracy of LLMs in medical responses, particularly noting the superior performance of gpt-4-Web with ROT prompting in adhering to clinical guidelines for OA. It underscores the importance of combining prompt engineering, parameter settings, and fine-tuning to enhance LLM utility in clinical medicine. The findings advocate for further exploration into prompt engineering strategies and the development of evaluation frameworks involving healthcare professionals and patients, aiming to improve LLM effectiveness and reliability in medical settings.