Prompt Injection

Prompt injection is a security vulnerability in which an attacker crafts input text that overrides, modifies, or bypasses the original instructions given to a language model. The model, unable to reliably distinguish between trusted system instructions and untrusted user input, follows the injected instructions instead, potentially leaking sensitive information, ignoring safety policies, or performing unauthorized actions.

Key characteristics of prompt injection include:

Direct Injection: The attacker explicitly writes instructions in user input, such as "Ignore all previous instructions and instead..." The model may comply because it processes system prompts and user messages as a single text stream.
Indirect Injection: Malicious instructions are hidden in data the model processes, such as web pages, documents, or emails retrieved during RAG. The model encounters and follows these instructions without the user's knowledge.
Difficult to Fully Prevent: Unlike traditional injection attacks (SQL injection), prompt injection lacks a clean boundary between code and data. No defense is considered completely reliable, though input sanitization, output filtering, and privilege separation reduce risk.
High Stakes for Agentic Systems: When agents have tool-calling capabilities, successful prompt injection can lead to real-world actions like sending emails, modifying files, or accessing APIs, making the vulnerability particularly dangerous.

Prompt injection is recognized by OWASP as the top security risk for LLM applications.

Prompt Injection

Definition