AI Made Friendly HERE

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

The results of this study suggested that prompt engineering may change the accuracy of LLMs in answering medical questions. Additionally, LLMs do not always provide the same answer to the same medical questions. The combination of ROT prompting and gpt-4-Web outperformed the other combinations in providing professional OA knowledge consistent with clinical guidelines.

We have summarized the current performance of LLMs in diagnosing patients, querying patients, and examining patients within clinical medicine in Supplementary Table 3. Indeed, GPT-4 has shown superior results and exhibited superior performance compared to both GPT-3.5 and Bard in the field of clinical medicine16,22,23,24,25,26,27,28,29. In our study, by combining the performance of the four types of prompts across different models, as shown in Fig. 1, gpt-4-Web, also known as ChatGPT-4, demonstrated a more balanced and prominent performance.

Previous research has primarily assessed GPT-4 through web interfaces in clinical medicine. The study of Fares et al. 30 accessed GPT-4 via the API and set different temperatures (temperature = 0, 0.3, 0.7, 1) and found that the model set at a temperature of 0.3 performed better in answering ophthalmology-related questions. Our study revealed differences in consistency and reliability between GPT-4 scores accessed via the web and GPT-4 scores accessed through the API. In our study, we found that among the gpt-4-Web products with specific parameter settings, gpt-4-API with a temperature of 0 (gpt-4-API-0) and gpt-4-API with a temperature of 1, gpt-4-Web exhibited the most prominent performance. This indicated that adjusting the internal parameters of LLMs during different tasks can alter the performance of LLMs.

To our knowledge, there has not yet been research exploring the impact of fine-tuning ChatGPT on clinical medicine. For other LLMs, in the study by Karan et al. 8, Med-PaLM, which is a version of Flan-PaLM that has been instruction prompt-tuned and is not currently publicly available, was evaluated by a panel of clinicians. They found that 92.6% of the answers generated by Med-PaLM were consistent with the scientific consensus. For our study, in the fine-tuning versions of GPT-3.5, where IO prompting is used as the input part of the dataset during fine-tuning, the 2 fine-tuning models achieve consistencies of 55.3% and 45.9% when IO prompting is used for inputs. However, when other types of prompts are used as inputs in the fine-tuning models, the performance deteriorates (22.4% to 34.1%). Furthermore, fine-tuning could not ensure that GPT-3.5 fully understood the rationale behind each piece of advice in the dataset. As a result, answers can be generated with incorrect rationales. The less-than-ideal fine-tuning results in our study might be due to the setup of the fine-tuning dataset, the capability of the base model or the fine-tuning methods employed by OpenAI.

Overall, the comparison of nine LLMs indicates that parameter settings and fine-tuning, along with prompt engineering, could influence the performance of LLMs. Improving LLMs in clinical medicine requires a combination of multiple approaches, accounting for various factors, including model architecture, parameter settings, and fine-tuning techniques.

Supplementary Table 4 briefly summarizes the current application of different types of prompts in clinical medicine. Studies on the topic of prompt engineering in clinical medicine are limited, and most studies primarily apply prompt engineering techniques directly31 or provide an overview of prompt engineering14,32,33 in clinical medicine. The study of Karan et al. 8 did not significantly differ between the COT and few-shot prompting strategies. However, self-consistency prompting, particularly in the context of the MedQA dataset, showed an improvement of more than 7%. Conversely, self-consistency led to a decrease in performance for the PubMedQA dataset. Wan et al. 31 demonstrated that few-shot prompting and zero-shot prompting exhibit different levels of sensitivity and specificity in converting symptom narratives using the ChatGPT-4.

This study, built upon previous research, further indicated that prompt engineering could influence the performance of LLMs in clinical medicine. Based on current theories of prompt engineering, we developed a new prompting framework, ROT prompting, which demonstrated good performance on the gpt-4-Web. As shown in Fig. 2, ROT prompting achieved the highest consistency rate. According to our subgroup analysis, compared to those of the other three types of prompts within gpt-4-Web, the ROT prompting performed more evenly and prominently. In terms of ‘strong’ intensity, ROT prompting is superior to IO prompting, and it is not significantly inferior to other prompts at other levels. In contrast, although answers of P-COT prompting at ‘strong’ intensity are better than those of IO prompting, its performance at the ‘limited’ intensity level is significantly worse than that of other prompts.

However, ROT promoting is not necessarily the best prompt for other LLMs. For instance, for five versions of GPT-3.5, except for P-COT prompting being the best prompt for GPT-3.5-Web, the best prompt for other versions was IO prompting. For Bard, the best prompt was 0-COT. This indicated that we could try different prompting strategies to obtain the best responses.

The ROT prompting asked LLM to return to previous thoughts and examine if they were appropriate, which may improve the robustness of the answer. Furthermore, the ROT-based design can minimize the occurrence of egregiously incorrect answers from the gpt-4-Web. For instance, regarding a ‘strong’ level suggestion, “Lateral wedge insoles are not recommended for patients with knee osteoarthritis.” ROT prompting provided four ‘strong’ answers and one ‘moderate’ answer in five responses. Initially, in this ‘moderate’ response (Supplementary Note 1), two “experts” provided “limited” answers, and one “expert” answered “moderate”. After “discussion”, all “experts” agreed on a ‘moderate’ recommendation. The final reason was that even though there was high-quality evidence to support the advice, there might still be slight potential benefits for some individuals. Notably, the reasons given by the two experts for “limiting” seem to be more in line with the statement “Lateral wedge insoles are recommended for patients with knee osteoarthritis.” This implies that these two “experts” did not fully understand the medical advice correctly, as “Expert C” mentioned in step five: “Observes that the results are somewhat mixed, but there’s a general agreement that the benefits, if any, from lateral wedge insoles are limited.” However, after the “discussion”, the final revised recommendation and reason were deemed acceptable. Referring to the application of TOT in the 24-point game13, the prompt designed in the style of TOT as well as the ROT prompting in this study could offer more possibilities at every step of the task, and LLM could be asked to return to previous thoughts, aiming to induce LLM to generate more accurate answers.

In future studies, considering the varying effectiveness of the ROT prompting across different models, a potential direction might involve optimizing it based on model differences. In the future, the design of the ROT prompting needs to be more closely aligned with different clinical scenarios. For instance, setting up roles with various professional backgrounds in disease diagnosis and treatment could provide more specialized advice. Additionally, incorporating different clinical application scenarios, such as testing and improving the effectiveness of ROT prompting in disease diagnosis and patient treatment plan formulation, will be crucial.

Three previous studies6,7,34 briefly described reliability. Yoshiyasu et al. 7 reproduced inaccurate responses only. Walker et al. 6 reported that the internal concordance of the provided information was complete (100%) according to human evaluation. In the study of Fares et al. 34, the authors repeated the experiments 3 times and extracted the responses from ChatGPT-3.5; the κ values were 0.769 for the BCSC set and 0.798 for the OphthoQuestions set.

In this study, reliability was investigated by asking LLMs the same question five times, and according to the results of our study, it is suggested that LLMs cannot always provide consistent answers to the same medical question (Table 1 and Fig. 4). The study used the strength of recommendation of the AAOS as an evaluation standard and found that LLMs always provide different strengths for the same advice in multiple answers. Only IO prompting in gpt-3.5-API-0 and gpt-3.5-ft-0, both of which were set at a temperature of 0, demonstrated perfect reliability.

Based on the description on the official OpenAI website regarding the endpoint of Audio (https://platform.openai.com/docs/api-reference/audio/createTranscription), “The sampling temperature, between 0 and 1, affects randomness. Higher values, such as 0.8, increase randomness, while lower values, such as 0.2, make outputs more focused and deterministic. A setting of 0 allows the model to automatically adjust the temperature based on log probability until certain thresholds are met.” We hypothesize that this mechanism also applies to the endpoint of Chat (https://platform.openai.com/docs/api-reference/chat/object), although this is not explicitly stated in the corresponding section. The specific thresholds for GPT-3.5 and GPT-4 might differ, and the prompts could influence these thresholds, as consistent responses were observed only in the two groups corresponding to the IO prompting in gpt-3.5-API-0 and gpt-3.5-ft-0. Therefore, it is recommended that LLMs be asked the same questions several times to obtain more comprehensive answers and that they keep asking the ChatGPT-4 the same question until it does not provide any additional information.

In future research, within the clinical application of LLMs, particularly from the patient’s perspective, OA is a common and frequently occurring condition associated with various treatment methods. Hence, prompt engineering could play a crucial role in guiding patients to ask medical questions correctly, potentially enhancing patient education and answering their queries more effectively. On the side of doctors, our study demonstrated that the ROT developed for the web version of the gpt-4 generated better results. However, multiple variables, such as different model architectures and parameters, can complicate outcomes. Therefore, we believe that prompt engineering should be combined with model development, parameter adjustment, and fine-tuning techniques to develop specialized LLMs with medical expertise, which could assist physicians in making clinical decisions.

The application of prompt engineering faces several challenges in the future. First, there is the issue of the robustness of prompts. Prompts based on the same framework may yield different answers due to minor changes in a few words35. Patients or doctors might receive different answers even when using prompts from the same framework. Second, prompt engineering performance depends on the inherent capabilities of the LLM itself. Prompts effective for one model may not be suitable for another. Guidelines for prompt engineering tailored for patients and doctors need to be developed according to the corresponding requirements. Overall, future related studies should examine the applicability and robustness of prompts and formulate relevant guidelines.

Importantly, our research does not include real-time interactions or validations with healthcare professionals or patients. However, our approach to data collection relies on nonhuman subjective scoring, objectively assessing the consistency and reliability of LLM responses. Furthermore, the study was designed based on expected answers derived from guidelines and lacked prospective validation. Nevertheless, we acknowledge that this field remains underexplored and that a multitude of techniques warrant further investigation. Our study represents only a preliminary foray into this vast domain.

Given these limitations, future research should aim to develop both an objective benchmark evaluation framework for LLM responses and a human evaluation framework8 involving healthcare professionals and patients.

Our work represents an initial step into this expansive domain, highlighting the importance of continuing research to refine and enhance the application of large language models in healthcare. Future studies should further explore various methodologies to improve the effectiveness and reliability of LLMs in medical settings.

This study revealed that different prompts had variable effects across various models, and gpt-4-Web with ROT prompting had the highest consistency. An appropriate prompt may improve the accuracy of responses to professional medical questions. Moreover, it is advisable to pose the input questions multiple times to gather more comprehensive insights, as responses may vary with each inquiry. In the future of AI healthcare involving LLMs, prompt engineering will serve as a crucial bridge in communication between LLMs and patients, as well as between LLMs and doctors.

Originally Appeared Here

You May Also Like

About the Author:

Early Bird