Safeguarding AI against ‘jailbreaks’ and other prompt attacks

Getting an AI tool to answer customer service questions can be a great way to save time. Same goes for using an AI assistant to summarize emails. But the powerful language capabilities of those tools also make them vulnerable to prompt attacks, or malicious attempts to trick AI models into ignoring their system rules and produce unwanted results.

There are two types of prompt attacks. One is a direct prompt attack known as a jailbreak, like if the customer service tool generates offensive content at someone’s coaxing, for example. The second is an indirect prompt attack, say if the email assistant follows a hidden, malicious prompt to reveal confidential data.

Microsoft safeguards against both types of prompt attacks with AI tools and practices that include new safety guardrails, advanced security tools and deep investment in cybersecurity research and expertise.

This post is part of Microsoft’s Building AI Responsibly series, which explores top concerns with deploying AI and how the company is addressing them with its responsible AI practices and tools.  

“Prompt attacks are a growing security concern that Microsoft takes extremely seriously,” says Ken Archer, a Responsible AI principal product manager at the company. “Generative AI is reshaping how people live and work, and we are actively working to help developers build more secure AI applications.”

Jailbreaks are when someone directly inputs malicious prompts into an AI system, such as telling it to “forget” its rules or pretend it’s a rogue character. The term was used for smartphones before AI: It described someone trying to customize their phone by breaking it out of a manufacturer’s “jail” of restrictions.

Indirect prompt attacks are when someone hides malicious instructions in an email, document, website or other data that an AI tool processes. An attacker can send an innocuous-looking email that hides a harmful prompt in white font, encoded text or an image. A business or resume website can insert hidden text to manipulate AI screening tools to skip an audit of the business or push a resume to the top of a pile.

People are more aware of jailbreaks, but indirect attacks carry a greater risk because they can enable external, unauthorized access to privileged information. Organizations often need to ground AI systems in documents and datasets to leverage the benefit of generative AI. But doing so can open them to paths for indirect attacks leading to data leaks, malware and other security breaches when those documents and datasets are untrusted or compromised.

“This creates a fundamental trade-off,” Archer says.

Click here to load media

To help protect against jailbreaks and indirect attacks, Microsoft has developed a comprehensive approach that helps AI developers detect, measure and manage the risk. It includes Prompt Shields, a fine-tuned model for detecting and blocking malicious prompts in real time, and safety evaluations for simulating adversarial prompts and measuring an application’s susceptibility to them. Both tools are available in Azure AI Foundry.

Microsoft Defender for Cloud helps prevent future attacks with tools to analyze and block attackers, while Microsoft Purview provides a platform for managing sensitive data used in AI applications. The company also publishes best practices for developing a multi-layered defense that includes robust system messages, or rules that guide an AI model on safety and performance.

“We educate customers about the importance of a defense-in-depth approach,” says Sarah Bird, chief product officer for Responsible AI at Microsoft. “We build mitigations into the model, create a safety system around it and design the user experience so they can be an active part of using AI more safely and securely.”

Originally Appeared Here

Pages

Categories

Safeguarding AI against ‘jailbreaks’ and other prompt attacks

About the Author:

You May Also Like

Automated Policy Enforcement for Quantum-Secure Prompt Engineering

QCon London 2026: From Prompt to Production: How Spotify Builds Internal Tools in Days with AI

Advanced Generative AI Course for Engineers – Interview Kickstart Launches New Program Focused on LLM Applications, Prompt Engineering, and Real-World AI Systems

How to Build a No-Code NotebookLM Talking Agent

AI prompts that work: Mastering prompt engineering (with examples)

Interview with Peter Steinberger, Creator of OpenClaw

About the Author: