What is chain-of-thought prompting?
Chain-of-thought prompting is a prompt engineering technique that aims to improve language models’ performance on tasks requiring logic, calculation and decision-making by structuring the input prompt in a way that mimics human reasoning.
To construct a chain-of-thought prompt, a user typically appends an instruction such as “Describe your reasoning in steps” or “Explain your answer step by step” to the end of their query to a large language model (LLM). In essence, this prompting technique asks the LLM to not only generate an end result, but also detail the series of intermediate steps that led to that answer.
Guiding the model to articulate these intermediate steps has shown promising results. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” a seminal paper by the Google Brain research team presented at the 2022 NeurIPS conference, found that chain-of-thought prompting outperformed standard prompting techniques on a range of arithmetic, common-sense and symbolic reasoning benchmarks.
How does chain-of-thought prompting work?
Chain-of-thought prompting is effective because it takes advantage of LLMs’ capabilities, such as a sophisticated ability to generate fluent language, and simulates successful techniques from human cognitive processing, such as planning and sequential reasoning.
When human beings are confronted with a challenging problem, we often break it down into smaller, more manageable pieces. For example, solving a complex math equation typically involves a number of sub-steps, each of which is essential to arriving at the final correct answer. Chain-of-thought prompting asks an LLM to mimic this process of decomposing a problem and working through it step by step — essentially, asking the model to “think out loud,” rather than simply providing a direct solution.
The below screenshot shows an example of chain-of-thought prompting. The user presents Chat Generative Pre-Trained Transformer (ChatGPT) with a classic river-crossing logic puzzle, adding the phrase “Describe your reasoning step by step” at the end of the prompt. When the chatbot responds, it sequentially works through the problem, describing each crossing leading up to the final solution.
GPT-4 provides a step-by-step solution to a logic puzzle in response to a chain-of-thought prompt.
The following are some other examples of chain-of-thought prompts:
- “John has one pizza, cut into eight equal slices. John eats three slices, and his friend eats two slices. How many slices are left? Explain your reasoning step by step.”
- “Alice left a glass of water outside overnight when the temperature was below freezing. The next morning, she found the glass cracked. Explain step by step why the glass cracked.”
- “If all roses are flowers, and some flowers fade quickly, can we conclude that some roses fade quickly? Explain your reasoning in steps.”
- “A classroom has two blue chairs for every three red chairs. If there are a total of 30 chairs in the classroom, how many blue chairs are there? Describe your reasoning step by step.”
Advantages of chain-of-thought prompting
LLMs can only take in a limited amount of information at one time. Breaking down complex problems into simpler sub-tasks helps mitigate this issue by enabling LLMs to process those smaller components individually, leading to better accuracy and precision in model responses.
Chain-of-thought prompting also takes advantage of LLMs’ extensive pool of general knowledge. LLMs are exposed to a wide array of explanations, definitions and problem-solving examples in the course of their training on vast textual data sets, encompassing books, articles and much of the open internet. Chain-of-thought prompts tap into this reservoir of stored knowledge by triggering the model to call on and apply relevant information.
The technique also directly targets a common limitation of LLMs — difficulty with logical reasoning. Although LLMs excel at generating coherent, relevant text, they were not primarily designed to provide factual information or solve problems. Consequently, they often struggle with reasoning and logic, especially for more complex problems.
Chain-of-thought prompting addresses this limitation by guiding the model to take a structured reasoning approach. By explicitly directing the model to construct a logical pathway from the original query or problem statement to the final solution, chain-of-thought prompting helps reduce the likelihood of logical missteps and oversights.
Finally, chain-of-thought prompting can assist with model debugging and improvement through making the process by which a model arrives at its answer more transparent. Because chain-of-thought prompts ask the model to explicitly delineate a reasoning process, they give model testers and developers better insight into how the model reached a particular conclusion. This, in turn, can make it easier to identify and correct errors when refining the model.
In future work, combining chain-of-thought prompting with fine-tuning could enhance LLMs’ reasoning capabilities. For example, fine-tuning a model on a training data set containing curated examples of step-by-step reasoning and logical deduction could further improve the effectiveness of chain-of-thought prompting.
Limitations of chain-of-thought prompting
Importantly, as the Google research team highlighted in their aforementioned paper, the semblance of reasoning that chain-of-thought prompts elicit from LLMs does not mean that the model is actually thinking. When using an LLM, it’s essential to remember that the model is a neural network trained to predict text sequences based on probability, and there is no evidence to suggest that LLMs are capable of reasoning as humans do. This distinction is crucial for users to understand the limitations of LLMs and maintain realistic expectations about their capabilities.
LLMs lack consciousness and metacognition, and their general knowledge derives solely from their training data — and therefore reflects that data set’s errors, gaps and biases. Thus, although an LLM can accurately mimic the structure of logical reasoning, it does not necessarily follow that its conclusions themselves are accurate. Chain-of-thought prompts serve as a valuable organizing mechanism for LLM output, but an LLM could nevertheless present a coherent, well-structured output that contains logical errors and oversights.
Techniques such as retrieval-augmented generation (RAG) show promise for mitigating this limitation. RAG enables an LLM to access an external source — such as a vetted database or the internet — in real time when asked to deliver factual information. In this way, RAG eliminates the need for the LLM to rely solely on the internal knowledge base gleaned from its training data, which might be flawed or spotty.
However, while RAG can improve the accuracy and timeliness of an LLM’s outputs, it does not inherently address the problem of logical reasoning. Deduction and reasoning require more than just factual recall; they also involve the ability to derive conclusions through logic and analysis, aspects of AI performance that are more closely related to the algorithmic architecture and training of the LLM itself.
Moreover, the scalability of chain-of-thought prompting remains in question. Although the underlying principle of sequential, stepwise reasoning is broadly applicable in AI and machine learning, the specific technique of chain-of-thought prompting is currently limited to LLMs, as it relies on LLMs’ sophisticated performance on language tasks.
LLMs’ massive size requires significant data, compute and infrastructure, which raises issues around accessibility, efficiency and sustainability. In response to this problem, AI researchers have developed so-called small language models, which — while less powerful than LLMs — perform competitively on various language tasks and require fewer computational resources. However, it remains to be seen whether the benefits of chain-of-thought prompting are fully transferable to smaller models, as reducing their capabilities risks compromising their problem-solving effectiveness.
Finally, it’s important to keep in mind that chain-of-thought prompting is a technique for using an existing model more effectively, not a training method. While chain-of-thought prompts can help users elicit better results from pretrained LLMs, prompt engineering isn’t a cure-all and can’t fix model limitations that should have been handled during the training stage.
Chain-of-thought prompting vs. prompt chaining
Chain-of-thought prompting and prompt chaining sound similar and are both prompt engineering techniques, but they also differ in some important ways.
As discussed above, chain-of-thought prompting asks the model to describe the intermediate steps used to reason its way to a final answer within one response. This is useful for tasks that require detailed explanation, planning and reasoning, such as math problems and logic puzzles, where explaining the thought process is essential to fully understanding the solution.
In contrast, prompt chaining involves an iterative sequence of prompts and responses, in which each subsequent prompt is formulated based on the model’s output in response to the previous one. This makes prompt chaining a useful technique for more creative, exploratory tasks that involve gradual refinement, such as generating detailed narratives and brainstorming ideas.
The fundamental difference between chain-of-thought prompting and prompt chaining lies in iteration and interactivity. Chain-of-thought prompting aims to encapsulate the reasoning process within a single detailed, self-contained response. Prompt chaining, on the other hand, takes a more dynamic approach, with multiple rounds of interaction that allow an idea to develop over time.