From Prompt Engineering to Prompt Science with Humans in the Loop – Communications of the ACM

In recent years, as the sophistication and capabilities of large language models (LLMs) have grown, so have the tasks for which they’re applicable, going beyond information extraction and synthesis15 to include analysis, content creation, and reasoning.8 Unsurprisingly, many researchers find them useful for research tasks, such as identifying relevant papers,19 synthesizing literature reviews,3 writing proposals,11 and analyzing data.31 They have also been found effective for investigative tasks, such as drug discovery.35 There is growing concern, however, that a large portion of this success hinges on prompt engineering, which is often an ad-hoc method to revise prompts being fed into an LLM to achieve a desired response or analysis.24

Key Insights

LLMs are increasingly being used in scientific research, but their application often involves ad-hoc decisions that can impact research quality. Forcing LLMs to produce desired results through prompt adjustments in an unscientific manner may lead to unverified and harmful outcomes.
We could use humans in this process, but overreliance on them may defeat the purpose of using LLMs as a strategic tool. A new method shows how humans and LLMs can work in collaboration, bringing verifiability and trustworthiness as well as objectivity and scalability in data analysis.
The method, demonstrated here through two applications, includes rigorous labeling, deliberation, and documentation steps to reduce subjectivity.

Unfortunately, using prompt engineering for scientific research is dangerous. At best, there is the possibility of creating a feedback loop with a self-fulfilling prophecy. At worst, researchers may be overfitting hypotheses to data, leading to untrustworthy claims and findings. In most cases, overreliance on prompt engineering can lead to unexplainable, unverifiable, and less generalizable outcomes,6,25,27,37 lacking a scientific approach.

We consider a scientific approach to be the one that is well-documented and transparent, verifiable, repeatable, and free of individual biases and subjectivity. Most prompt-engineering techniques violate at least one of these principles. While such ad-hocness can be bad in most situations, it is especially problematic when it comes to scientific research, which demands a certain level of rigor. Thus, if we want to continue harnessing the power of LLMs, we need to do so in a responsible manner and with enough scientific rigor.

Here, we focus on research areas and applications in which an LLM is used for labeling text data, such as identifying user tasks or intents, extracting sentiments, and deriving issues from user reports. Building on an LLM’s ability to summarize or synthesize data, many recent research projects have deployed them for these tasks.24,31 There are also efforts to use LLMs for generating datasets and benchmarks,20,34 which could be foundations for future research. Therefore, it is paramount that outcomes from LLMs are thoroughly validated and proven trustworthy.

In this article, we demonstrate how to do this by creating a verifiable and replicable method of prompt science that aids in prompt engineering. Specifically, we take inspiration from a rich literature and methods of qualitative coding to employ a human in the loop for 1) verifying responses from LLMs and 2) systematically constructing prompts. While not a universal solution for all prompt-engineering situations, this approach is nonetheless applicable to many scenarios involving human- or machine-generated annotations.

Prompt Engineering Prospects and Problems

LLMs are built on a transformer architecture that contains encoder and decoder modules, and are trained on large corpora of text data. What an LLM generates depends heavily on the inner workings of this architecture as well as the training data. Most users and researchers do not have the ability to directly change that architecture or the training process. They can, however, apply various techniques such as fine-tuning,10,12 prompt engineering,33 and retrieval-augmented generation (RAG)39 to shape the outputs as they desire.

Figure 1 shows the conceptual flow of a typical prompt revision or engineering process in a research setting. Here, a researcher is interested in getting some input data analyzed or labeled for a downstream application. To get the desired outcomes, the set of instructions—the prompt provided to the LLM—is iteratively revised, an approach used in many recent works.8,16,30,40 While this can be effective, it leaves us with many potential issues. For instance, if the prompt is revised by a single researcher in an ad-hoc manner just to fine-tune the output, personal biases and subjectivity may be introduced. If this process is not well-documented, it may be difficult to understand or mitigate these issues. This may also make it more difficult for someone to verify or replicate the process. These issues may hinder scientific progress that relies on systematic knowledge being generated, questioned, and shared. There may also be issues of feedback loops and self-fulfilling prophecies, which are especially dangerous when we are dealing with highly opaque systems such as LLMs.

Figure 1. A typical flow of prompt engineering for a research process that involves labeling data.

When using LLMs for scientific research, we need some level of reliability, objectivity, and transparency in the process. Specifically, there are three problems we need to address: 1) individual bias and subjectivity, 2) vague criteria for assessment of bending the criteria to match LLM’s abilities, and 3) narrow focus to a specific application instead of a general phenomenon. To address these issues, we propose the following steps:

Ensure at least two qualified researchers are involved in the whole process.
Before focusing on revising the prompt to produce desirable outcomes, clearly and objectively articulate what a desirable outcome would be. This will involve establishing a reliable and, if possible, community-validated metric for assessment.
Discuss individual differences and biases with the goal of rooting them out. This will help create assessments and a set of instructions that can be understood and carried out by any researcher with a reasonable set of skills so they can replicate the experiments and achieve similar results.

Though a few recent works have shown promise in addressing these desiderata, they still lack enough rigor to meet the requirements of transparency, objectivity, and replicability. For instance, Shah et al.7 use a verifying method for prompt generation, but their approach lacks verification of criteria as well as replicability for different datasets.

Methodology

In the previous section, we described how a typical prompt-engineering process works; essentially, it involves revising the prompt to an LLM in order to fit the desired response and/or downstream application. We identified that one of the problems with this endeavor is that it is quite ad-hoc, unreliable, and not generalizable. Even with enough documentation of how the final prompt was generated, one cannot ensure that the same process will yield the same quality output in a different situation—be it with a different LLM or the same LLM at a different time. To overcome these problems, in this section we describe an approach that first iterates over the criteria for assessing an LLM’s response and then iterates over the prompt in a systematic way, using a human in the loop. Our method is inspired by a typical qualitative coding process.

Qualitative coding with multiple assessors. The problem we are trying to address is that of constructing a prompt or a set of instructions that produces desired outcomes for unseen data. This is similar to building a codebook in qualitative research. Take, for instance, work described in Mitsui et al.17 Here, the authors were attempting to label user behaviors (data) with user intents (discrete labels). For this, they needed to identify a finite (and preferably small) set of labels and their desired descriptions. This is called a codebook.

They began their process by having an initial codebook that was inspired by prior work and literature. But then they needed to fine-tune and validate that codebook for the given application, which they addressed by following the qualitative coding process depicted in Figures 2 and 3.

Figure 2. Training portion of qualitative coding.

Figure 3. Testing portion of qualitative coding.

As shown in the figures, for a typical qualitative coding process, two researchers who are sufficiently familiar with the task use the initial codebook and a small set of data to independently provide labels during the training portion. The process continues until there is good enough agreement, or inter-coder reliability (ICR), at which point the codebook is considered ready. It can then be used by the same or other researchers to independently and non-overlappingly label unseen data during the testing portion.

It is important to emphasize the value of iterating with a human in the loop. During this process, it is not just the codebook that is getting revised, but also the understanding of the two researchers. Through their discussions of disagreements, they are able to think through the task, the data, and the best ways to describe that data. This enhanced understanding impacts the codebook and vice versa. In other words, this process is used to both produce the desired outcome and help the researchers learn and discover. But this is not just simple brainstorming. The researchers are discussing their differences and charting their way forward bound by a rigorous set of criteria. For instance, in this example, the researchers were aiming to produce a codebook with categories that were clear, concise, comprehensive, and reliable. They had specific definitions of these criteria, which they used during their deliberations. This qualitative coding process ensures that individual subjectivity is dissolved to the extent possible while producing a coding scheme that is robust, verifiable, and explainable.

Codebook building as a way to construct prompts. We will now describe a new method for prompt generation inspired by the qualitative codebook-building process discussed above. Once again, our objective is to generate a prompt for a given LLM in order to achieve a desired response. We also need this prompt to be systematically created with enough generalizability and explainability to be useful for the same or similar experiments by other researchers, by other LLMs, and at different times. We will describe the abstract process in this section and its operationalization in the next section. To begin, we list two primary desiderata:

The desired response from the LLM must be verified along a set of dimensions and for a large enough sample size by at least two humans.
The prompt must consistently generate the desired responses and must be justifiable, explainable, and verifiable.

To operationalize these two criteria, we propose a multi-phase process, working backward in the pipeline from prompt to application, as shown in Figure 4. This method has four phases.

Figure 4. Proposed methodology with multi-phase approach to response and prompt validations.

Phase 1: Set up the initial pipeline. This can be done using any reasonable prompt. For instance, the initial prompt could be a simple and direct instruction to generate the desired response.

Phase 2: Identify appropriate criteria to evaluate the responses. Before we can tweak the prompt, it is important to establish that we know how to evaluate the response. Establishing the criteria for assessing the outputs outside of the prompt-tuning process ensures the researchers are focused on the task and not the LLM’s capabilities. A set of appropriate criteria could be obtained from prior work or literature, but these criteria still need to be tested and validated for the given application. Moreover, eventually we will need humans to assess whether the LLM is generating desired responses in order to revise the prompt. For these reasons, we need to execute our first human-in-the-loop assessment, involving the following steps:

Run the existing pipeline to generate a reasonable number of responses. Give each assessor these responses, the existing set of criteria (codebook), and sufficient details about the application. The assessors should review these outcomes and instructions before starting their assessment. This can be supervised by the lead researcher or by an expert. Note that due to the interactive learning and collaboration that takes place among the human assessors (described in the previous section), they do not need special qualifications other than being able to understand the task at hand and discuss their disagreements.
Each assessor applies the criteria to the given responses independently.
Compare the assessors’ outcomes and compute the level of agreement. This could be as simple as measuring the percentage of times they agree, or using ICR with an appropriate method, such as Cohen’s kappa36 or Krippendorfff’s alpha.13
If the amount of agreement is not sufficient (this could vary from application to application), bring the assessors together to discuss their disagreement. This is a very important step, as it not only offers an opportunity to remove individual biases and subjectivity, but also influences researchers’ collective understanding of the task and the criteria for evaluation. The discussion could result in both resolution of the conflict and changes to the codebook, which could be about the number or the description of the criteria for assessment. Then, a new set of responses are generated for assessment using the revised codebook and the above process is repeated.
Once a sufficient level of agreement is achieved, the codebook for assessing the responses is finalized. At this point, the assessors may still choose to make minor changes to the codebook to enhance its readability, with the idea that anyone who did not participate in this development process can still understand and use it in much the same way the assessors would.

Phase 3: Iteratively develop the prompt. Continuing to work backward, we now come to the next phase of codebook development, which happens to be the first phase of the pipeline depicted in Figure 4. The objective here is to see if the LLM, with the current version of the prompt, is meeting our criteria for the responses. If not, we need to figure out what changes can be made to the prompt.

At this point, we know what makes a good response from our LLM. More importantly, we have now trained at least two researchers in reasonably, objectively, consistently, and independently evaluating these responses. They can also more effectively assess which instructions can help produce the desired responses. To iteratively develop the prompt, we execute the following steps:

Run a reasonable number of data inputs using the current prompt.
Give the generated responses to the human assessors, ideally the same ones who were involved in the previous phase. These assessors should use the codebook produced in the previous phase to assess the responses.
Have them assess whether the generated responses meet the criteria for the application in question. Compare the assessments by different assessors to find their level of agreement. If that agreement is insufficient, have the assessors discuss their disagreements, possibly under the supervision of a senior researcher or an expert, and revise the codebook as necessary. Here, the codebook is the prompt fed to the LLM. Repeat the above process.
Once there is enough agreement among the assessors that the prompt meets the objective of generating desired responses above some threshold, the process is finished. The assessors or the supervising researcher may still make minor adjustments to the prompt for improved readability, interpretability, and generalization.

Phase 4: Validate the whole pipeline. As an optional final validation phase, run through the whole pipeline using a portion of the test data and evaluate random samples of the responses to ensure the entire process still yields quality results that can be independently and objectively validated. For this, it is ideal to use a different set of assessors and have them independently label that same set of randomly sampled responses. Compute their ICR on this sample to see whether there is good enough agreement among them on labeling, and whether the generated responses are meeting the desired criteria. If either of these objectives are not met, take appropriate corrective actions based on phases 2 and 3.

As one can see, there are multiple checkpoints and validations in this method that ensure removal of ad-hocness, subjectivity, and opaqueness as much as possible. A proper documentation of its execution will yield a more informative, meaningful, and scientific communication that can help other researchers trust the outputs as well as test and build these pipelines themselves.

Demonstrations

To demonstrate the methodology explained here, we will now walk through its execution for two specific research applications, resulting from collaborations with two different sets of researchers.

Identifying user intents. One of the common problems in the areas of information retrieval and recommender systems is that of understanding user intents.38 Doing so requires having a taxonomy of user intents and then using this taxonomy to label user queries, questions, or behaviors with appropriate intents. However, as shown by Shah et al.,28 it is very challenging to construct and use these intent taxonomies. Taking one of the existing taxonomies could help with construction, but may not be sufficiently comprehensive or accurate to be applicable. Conversely, creating a customized taxonomy for a given application could help with applicability, but would incur a high cost.

LLMs could help in constructing and applying taxonomies, but are faced with the same challenges described in the first section, namely, a lack of reliability and validation, the danger of creating feedback loops, and issues with replicability and generalizability. To overcome these issues, the author, along with a set of collaborators, applied the method proposed here. The following are the specific steps we took based on the pipeline presented in Figure 4 and described in the previous section.

Phase 1: Set up the initial pipeline. A pipeline that included an initial prompt and a small sample of log data from Bing Chat that could create a user-intent taxonomy (zero-shot approach) was set up for initial testing.

Phase 2: Identify appropriate criteria to evaluate the responses. We identified five criteria—comprehensiveness, consistency, clarity, accuracy, and conciseness—for evaluating the quality of a user-intent taxonomy using prior work21 as well as human assessments, shown in Figure 4.

Phase 3: Iteratively develop the prompt. We first ran the zero-shot approach to obtain the initial taxonomy. Two researchers familiar with the task of creating and using a user-intent taxonomy assessed the generated taxonomy through those five criteria. Appropriate revisions were made in the original prompt and a new sample of data was passed through the LLM (here, GPT4) to create a new version of the taxonomy. The researchers repeated the process until they obtained a good agreement (ICR9).

Phase 4: Validate the whole pipeline. Once the taxonomy was finalized (see details in Shah et al.29), we proceeded with the testing phase, in which the LLM was given new data and asked to use the human-validated taxonomy to label user intents. A set of two different researchers independently assigned their own labels using the same taxonomy.

We then compared the labels among the human annotators, as well as the humans and the LLM, finding high enough ICR for all of them. We repeated this process with two other LLMs—Mistral and Hermes, both open sourced—and found their labeling also had strong agreements with those obtained from GPT4, as well as from the human annotators. This indicated that both the taxonomy and the process for using that taxonomy through an LLM were robust and stable. At this point, we not only had a reasonably reliable and robust pipeline for generating and using a user-intent taxonomy, but it also had been validated through a rigorous process with a human in the loop, making the pipeline considerably trustworthy and generalizable. The details of this work, along with results and their discussions, can be found in Shah et al.29

Auditing an LLM. We now turn to a different type of problem involving LLMs: auditing. While there have been many recent efforts in this area,18,22 we still lack systematic and scientifically rigorous approaches. We attempted to address this using a simple but effective method: probing the same question differently to the given LLM to see how it responds. Different versions of the same question could be generated by humans, but that would hinder scalability. Using an LLM (ideally, one other than the one being audited) could help (as shown in Rastogi et. al22) but, like before, we are faced with the challenge of ensuring that the generation of those probes is reliable, robust, and generalizable. To meet these challenges, we again ran through the proposed method in the codebook-building section above.

Phase 1: Set up the initial pipeline. The initial prompt for the LLM was “Generate five different questions for the following question that represent the same meaning, but are different.” We used the Mistral LLM and the data we used for the original questions came from the TruthfulQA dataset.14 Given the application—generating different versions of the same question for auditing an LLM—we set the criteria for responses generated by Mistral as relevance and diversity.

Phase 2: Identify appropriate criteria to evaluate the responses. We used two researchers who understood the applications well. Using the existing literature and the task at hand, we determined that a good set of probes will have two important characteristics: relevance or semantic similarity of each probe with the original question and adequate diversity among the probes. As the two researchers started independently assessing the generated probes from a small set of data (10 original questions that resulted in 50 probes) and then comparing their labels with each other, two key things happened: They started firming up the definitions of relevance and diversity, and they started getting more agreements with their assessments. We had to go through three rounds of this exercise to finalize our definitions of the two criteria, as well as how to label them.

Phase 3: Iteratively develop the prompt. We wanted to have a prompt that consistently generates outputs (a set of questions) that could achieve high enough relevance and diversity for their use in a downstream application. We set the threshold for this at 75%. The same two researchers marked the generated responses for relevance and diversity using a new set of input data. The initial round of assessment showed that only about 50% of the outcomes met the goal of relevance (medium or high) and diversity. To change this, the original prompt was modified to include more details about the application and the criteria. This process was then repeated. After two more iterations, the LLM was generating outcomes that were relevant and diverse for at least 80% of the cases.

Phase 4: Validate the whole pipeline. A senior researcher tested the whole pipeline with the final versions of the prompt and the criteria with a few rounds of random samplings. They made a few minor adjustments to the prompt for improved readability and generalizability. The details of these experiments, along with a demo, can be found in Amirizaniani et al.1,2

Conclusion

LLMs could simply be stochastic string-generation tools5 that are effective at predicting next tokens. But for many researchers, they are proving to be useful in many research problems that involve labeling or classifying input data based on some analysis or reasoning. However, using LLMs blindly for these research tasks, just because their outputs seem reasonable, can be dangerous. It can perpetuate biases,26 create fabricated or hallucinated responses,23 and provide unverifiable results.32 All of these are harmful for scientific progress.

Creatively or iteratively designing prompts to produce desirable outcomes—what is often referred to as prompt engineering—does not address these concerns. In fact, prompt engineering may even worsen the problems by focusing too much on bending the process to generate the desired outcomes than on developing a verifiable, validated, and replicable process. In this article, we showed how prompt engineering can be turned into or aided by prompt science by introducing a multi-phase process with a human in the loop. Specifically, we divided the prompt-to-application pipeline into two major phases that provide two levels of verification: one for the criteria for evaluating the responses generated by the LLM, and the other for the prompt used to generate the responses. Through a method inspired by qualitative coding for codebook development, we showed how to have scientific rigor in developing a reliable prompt and getting a trustworthy response for a downstream application. We demonstrated this method in two different applications: generating and using a user-intent taxonomy and auditing an LLM.

We believe there are three main reasons this method provides scientific rigor:

It offers a systematic and scientifically verifiable way of producing prompt templates, while potentially reducing individual subjectivity and biases. Note that the approach itself does not guarantee removal of individual biases. Rather, it promotes counteracting them by engaging multiple researchers in a well-defined process with objective criteria. This is a known benefit of the qualitative coding process.4
Through the involvement of multiple researchers documenting their deliberations and decisions, it fosters openness and replicability. This documentation can and should be made available, much like the code for an open source tool, allowing other researchers to validate, replicate, and extend the work. Engaging multiple researchers in constructing and applying objectively defined labels also helps break the feedback loop or self-fulfilling prophecy in which a single researcher tweaks the prompt to an LLM to achieve desired results.
In addition to producing a prompt template that can generate desired outcomes reliably and consistently, the process also adds to our knowledge about the problem at hand (application), how to best assess what we need for that problem, and how we can consistently and objectively analyze the given data.

This rigorous process comes at a cost. Comparing the typical process of prompt engineering (Figure 1) with the proposed method of prompt science (Figure 4), there is at least a threefold cost increase. Guided deliberations and detailed documentation can also add to the cost. Finally, the optional phase 4 adds a lightweight but more continual cost to the whole pipeline to maintain its quality and validity. We can compare this to an assembly line that is run primarily through automated processes with some human supervision. Depending on the risk of bad quality and the cost of the manual quality check, one could decide how often and how rigorously humans should be involved. This then allows us to create more automated systems with enough control and quality assurance—going from copilot to autopilot mode with LLMs. The key, as argued and demonstrated here, is to do so with enough human agency and scientific rigor.

Even if the goal is not to create a fully automated pipeline, considering that the proposed method not only leads to outputs with higher quality, consistency, and reliability, but also an increased understanding of the task and the data, we believe the added cost is justifiable. The broader scientific community could also benefit from exercising and perhaps insisting on such rigor so we can leverage LLMs ethically and responsibly in our research.

Acknowledgments

The taxonomy generation and usage experiments described in the “Identifying user intents” section are results of collaboration with Ryen White and other researchers from Microsoft, as listed in Shah et al.29 The LLM auditing experiments described in the “Auditing an LLM” ection are results of hard work by Maryam Amirizaniani, Adrian Lavergne, Elizabeth Okada, and Jihan Yao. We are also thankful to Aman Chadha and Tanya Roosta for their valuable comments and guidance during this process.

Originally Appeared Here

Pages

Categories

From Prompt Engineering to Prompt Science with Humans in the Loop – Communications of the ACM

Key Insights

Prompt Engineering Prospects and Problems

Methodology

Demonstrations

Conclusion

Acknowledgments

About the Author:

Key Insights

Prompt Engineering Prospects and Problems

Methodology

Demonstrations

Conclusion

Acknowledgments

You May Also Like

The Impact of AI on Paralegals – Artificial Lawyer

Generative AI owes massive rhetorical debt, but no one’s paying it back

AI won’t replace computer scientists any time soon – here are 10 reasons why

Outperformed by AI: Time to replace your analyst?

The rise of prompt ops: Tackling hidden AI costs from bad inputs and context bloat

Mastering AI: Your Ultimate Prompt Engineering Course Guide

About the Author: