Prompt injection attacks such as ChatGPT’s DAN (Do Anything Now) and Sydney (Bing Chat) are no longer funny. In the case of ChatGPT, the prompt made ChatGPT take on the persona of another chatbot named DAN which ignored OpenAI’s content policy and provided information on all sorts of restricted topics. They have displayed the vulnerability in the chatbot system which can be exploited for malicious activity, including the theft of personal information.
With this new crop of exploits, LLMs have become powerful tools in the hands of hackers.
From innocence to destruction
Security researchers from Saarland University presented a paper titled, ‘More than you’ve asked for’, in which they discussed methods of implementing prompt engineering attacks in chatbots.
The researchers behind the paper have found a method to inject prompts indirectly. By harnessing the new ‘application-integrated LLMs’ such as Bing Chat and GitHub Copilot, they found a way to inject prompts from an external source, thereby widening the attack vectors available for hackers.
By injecting a prompt into a document that is likely to be retrieved by the LLM during inference, malicious actors can execute the prompt indirectly without additional input from the user. The engineered prompt can then be used to collect user information, turning the LLM into a method to execute a social engineering attack.
Download our Mobile App
One of the researchers behind the paper, Kai Greshake, illustrated an example wherein he was able to get Bing Chat to collect the users’ personal and financial information. By forcing the bot to crawl a website with an embedded prompt hidden in it, the chatbot was able to execute a command which made it masquerade as a Microsoft support executive selling Surface Laptops at a discount. Using this as a cover, the bot was able to extract the user’s name, email ID and financial information.
Reportedly, using this method can allow malicious actors to achieve a persistent attack prompt, triggered by a token keyword. This exploit can also be spread to other LLMs and can even be used as an avenue to retrieve new instructions from an attacker’s server. User ComplexSystems on the Hacker News forum succinctly explained the potential of this exploit, stating,
“It is probably worth noting that you don’t even need the user to click on anything. Bing will readily go and search and read from external websites given some user request. You could probably get Bing, very easily, to just silently take the user’s info and send it to some malicious site without their even knowing, or perhaps disguised as a normal search.”
An interesting variable discussed in the paper was the impact of reinforcement learning with human feedback on the effectiveness of these impacts. To test indirect prompt injection attacks, the researchers built a model using LangChain and davinci-003. However, they couldn’t find whether implementing RLHF increases the effectiveness of these attacks or decreases them.
This paper represents a shift in the effective use-cases of prompt injection attacks. PI attacks have graduated from being playful prompts that can generate racy content to an actual cybersecurity issue utilising one of the most sinister attack vectors—social engineering.
There’s no fixing LLMs
Naturally, the release of this paper prompted a lot of discussion, especially on the Hacker News forum. In response to a comment thread exploring how this attack can be prevented, Greshake stated,
“Even if you can mitigate this one specific injection, this is a much larger problem. It goes back to Prompt Injection itself—what is instruction and what is code? If you want to extract useful information from a text in a smart and useful manner, you’ll have to process it.“
This statement accurately captures the integral problem with prompt injections as a concept, as there are very few security measures that can be used to protect against them. LLMs are designed to take user prompts and process them in the most efficient way—the better the LLMs capability to understand prompts, the bigger the attack surface for prompt injection.
Others offered up the probability that the unique identifier used in the sample prompt, termed [system], was one of the puzzle pieces that made the attack work. Hence, this avenue of attack can be fixed simply by changing this unique token. However, Greshake argued that any prompt injection is equivalent to arbitrary code injection into the LLM itself, thus preventing any way of patching this vulnerability.
The research paper ends with a call for more research and an in-depth investigation on how these attacks can be mitigated. However, considering the internal architecture of LLMs and the black box nature of large neural networks, it seems that the solution for prompt injection attacks is far off in the future.