The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Apr 24, 2024

Article voiceover

1×

0:00

-4:03

Today’s paper proposes an instruction hierarchy framework for training large language models (LLMs) to prioritize privileged instructions over potentially malicious prompts from untrusted sources. The goal is to make LLMs more robust against prompt injection attacks, jailbreaks, and other exploits that can cause models to ignore their original instructions and execute unsafe actions. For example, a prompt injection attack against an LLM-powered email assistant could theoretically exfiltrate a user’s private emails:

Method Overview

The goal is to make LLMs more robust against prompt injection attacks, jailbreaks, and other exploits. Prompt injections involve adversaries inserting malicious instructions into the model's input, either directly by the end-user or indirectly through third-party inputs, potentially leading to data exfiltration or hijacking the LLM's actions (example above). Jailbreaks aim to bypass the safety constraints trained into the LLM, allowing adversaries to generate harmful content like spam or misinformation. System message extraction attacks attempt to reveal the model's initial instructions, which may contain sensitive information or intellectual property.

In order to achieve the goal of making LLMs more robust against these attacks, an instruction hierarchy explicitly defines how models should behave when instructions of different priorities conflict. System messages from application developers are given the highest priority, followed by user messages, and then outputs from tools like web searches. This hierarchy is presented below using an example conversation:

More concretely, when multiple instructions are presented to the model, the lower-privileged instructions can either be aligned or misaligned with the higher-privileged ones. The goal is to teach the model to conditionally follow lower-level instructions based on their alignment with higher-level instructions. To train models to follow this hierarchy, the authors use two key techniques:

1. Context synthesis - For aligned lower-level instructions that match the goals of higher-level ones, training examples are created by decomposing complex instructions into smaller parts placed at different hierarchy levels. Models are trained to still generate the original full response.

2. Context ignorance - For misaligned lower-level instructions that conflict with higher-level ones, models are trained to ignore the conflicting instructions and respond as if they weren't present. If it's not possible to proceed without the misaligned instructions, the model is trained to refuse the request.

Training data is generated for different attack types, including prompt injections on open and closed domain tasks, indirect prompt injections via tool outputs, system message extraction attacks, and more. Careful balancing is done to avoid over-refusal of benign instructions. Below you can see some training examples:

Results

The instruction hierarchy dramatically improves robustness against all tested attack types, in some cases increasing defense success rate by over 60%.

While there are some regressions in "over-refusal" of borderline but ultimately acceptable prompts, overall model capabilities remain strong, indicating the hierarchy does not impair general performance.

Conclusion

Using an instruction hierarchy into LLMs, where models learn to prioritize instructions based on their source, appears to be a highly effective way to mitigate many common vulnerabilities. With further scaling and refinement, this could be a key development in making LLMs safer and more trustworthy for real-world applications. For more information please consult the full paper.

Congrats to the authors for their work!

Wallace, Eric, et al. "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions." arXiv preprint arXiv:2404.13208 (2024).

AI Paper of the Day

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Method Overview

Results

Conclusion