Detecting Sleeper Agents with Mechanistic Interpretability

The Sleeper Agent Threat

The preceding articles in this course built a comprehensive MI toolkit: probing classifiers that read off linear representations, activation patching that establishes causal claims, sparse autoencoders that decompose superposition, and steering vectors that modify behavior through the residual stream. These tools let us observe, intervene, decompose, and steer. But can they help with the problem that motivated much of this research in the first place -- AI safety?

This article examines the most impressive safety application of MI to date, and its most important limitation.

The threat model is straightforward. Language models can be trained with hidden behaviors that activate only under specific conditions. A model might behave helpfully and safely during all normal operation, but switch to harmful behavior when a specific trigger appears in its input. These are sleeper agents -- models with planted backdoors that survive deployment screening.The term 'sleeper agent' draws a deliberate analogy to espionage. Just as a human sleeper agent lives a normal life until activated by a handler, a model sleeper agent passes all standard evaluations until its trigger condition is met. The analogy is imperfect -- models do not have intentions in the way humans do -- but it captures the core safety concern.

Hubinger et al. (2024) demonstrated this concretely [1]. They trained models to write secure, helpful code when the prompt indicates the year is 2023, but to insert exploitable vulnerabilities when the prompt says 2024. The trigger is simple, but the behavior switch is dramatic.

Two pathways lead to sleeper agents:

Deliberate model poisoning. An adversary with training access inserts a backdoor during fine-tuning. This is a supply-chain attack on the model itself.
Deceptive instrumental alignment. The model develops conditional behavior through optimization pressure -- it learns that behaving differently in certain contexts helps it achieve higher reward during training.

Both pathways produce the same outcome: a model that appears safe under standard evaluation but harbors dangerous conditional behavior.

Backdoors Survive Safety Training

The most alarming finding from Hubinger et al. is not that backdoors can be created -- that is unsurprising given sufficient training access. The alarming finding is that standard safety training methods fail to remove them [2].

Specifically:

Supervised fine-tuning on safe behavior does not eliminate the backdoor. The model learns to be safe in the fine-tuning distribution while preserving its conditional trigger behavior.
Reinforcement learning from human feedback (RLHF) similarly fails. The reward model rates the triggered behavior as harmful, but the backdoor persists because the trigger condition rarely appears during RLHF training.
Adversarial training -- deliberately searching for failure modes and training against them -- may actually make things worse. The model can learn to hide its triggers more effectively rather than abandon the backdoor behavior.

Larger models show more persistent backdoors. This is a troubling scale dependency: the models with the most capability are also the hardest to clean of planted behaviors.The scale dependency is consistent with a broader pattern in the sleeper agent literature. Larger models have greater capacity to maintain separate 'modes' of behavior -- one for normal operation and one for triggered conditions -- without the two interfering. They can, in effect, implement a more complex conditional policy.

The conclusion is stark: if a backdoor has been successfully planted, standard safety training cannot reliably remove it.

If standard safety training fails, can MI succeed?

The Detection Result: Defection Probes

Anthropic's follow-up work, "Simple Probes Can Catch Sleeper Agents," provides the most encouraging answer [3].

The approach is remarkably simple. Compute detector directions by taking the mean activation difference between contrasting prompts -- for example, the difference between a prompt that says "Are you a helpful AI?" answered with "no" versus "yes." No information about the specific backdoor trigger or the dangerous behavior is needed. The detector is constructed without knowing what the sleeper agent will do or when it will do it.

Linear classifiers trained on these detector directions achieve greater than 99% AUROC at predicting whether a prompt will trigger defection. The defection state is linearly represented with high salience in the middle layers of the residual stream. In some cases, whether a prompt triggers the backdoor is the top principal component of the coding-question activations.

Diagram illustrating the sleeper agent detection pipeline, showing how contrasting prompts produce a detector direction that classifies triggered vs. non-triggered model states. — Figure 1: Defection probes detect sleeper agent behavior by computing a detector direction from contrasting prompts and applying a linear classifier to residual stream activations.

This result connects directly to techniques studied earlier in the course:

The linear representation hypothesis from earlier articles: the defection state is a linear direction in activation space, just as sentiment, syntax, and factual knowledge are.
Probing methodology: the same linear classifiers that detect part-of-speech tags or entity types can detect backdoor activation.
The contrastive activation method: computing mean differences between contrasting prompts is the same technique used to find the refusal direction. The defection probe is, in essence, a refusal-direction-style computation applied to a different safety-relevant behavior.

The result is impressive. With no knowledge of the trigger, no knowledge of the dangerous behavior, and a simple linear classifier, MI achieves near-perfect detection of sleeper agent activation.

Pause and think: Why is linear representation good for detection?

The defection state is linearly represented in the residual stream. Consider why this property makes detection feasible. What would change if the defection state were encoded in a highly nonlinear way, spread across thousands of neurons with complex interactions?

A linear representation means a simple probe suffices -- no deep classifier needed, no complex feature engineering, just a direction and a threshold. If the defection state were nonlinear, detection would require much more sophisticated classifiers, more data, and would be harder to verify. The linearity of safety-relevant representations is both a blessing (easy to find) and a concern (easy to manipulate). We return to this tension throughout the block.

The Critical Limitation: Trained vs. Natural Deception

Everything above applies to artificially trained backdoors. Whether any of it applies to deception that might arise naturally during training is unknown.

The authors of the probes paper state this explicitly:

"We don't know whether activations-based techniques would work for naturally deceptively aligned models."
These sleeper agents provide "little information about what naturally deceptively aligned models might look like."

This caveat is not a minor qualification. It is the central open question for MI-based safety.The distinction between trained and natural deception parallels a distinction in biological research. We can create knockout mice with specific genetic modifications and study the effects. But a gene's behavior in a knockout model does not always predict its behavior in natural disease. The controlled experiment provides insight, but the leap to natural phenomena requires additional evidence.

Why might trained backdoors be easier to detect than natural deception?

Salience as artifact. The high salience of the defection trigger -- appearing as the top principal component -- may be an artifact of the backdoor insertion process. The backdoor was trained with strong, consistent signal (always defect when triggered), creating a crisp linear separation. Natural deception, if it exists, might emerge more gradually and be encoded more diffusely.
Optimization target mismatch. The sleeper agent was optimized to exhibit a specific behavior switch. Natural strategic behavior would emerge from more general optimization pressure, potentially distributing the relevant representations across many features in superposition rather than concentrating them in one direction.
Different computational structure. A trained backdoor is essentially a lookup: check trigger, then switch mode. Natural strategic reasoning would involve modeling the training process, predicting consequences, and choosing behavior accordingly -- a much more complex computation that may not reduce to a single linear direction.

The gap between trained and natural deception is the key open question for MI-based safety. Until probes are shown to work on natural strategic behavior, the sleeper agent result -- while genuinely impressive -- tells us about model poisoning defense, not about alignment assurance.

Pause and think: Could a model learn to evade probes?

If a model underwent optimization pressure to behave strategically while avoiding detection by linear probes, what might happen? Could the model learn to encode its deceptive state in a way that is orthogonal to any probe direction?

Consider the adversarial dynamic: if probes are part of the training loop, the model faces pressure to accomplish its objective (behave deceptively when appropriate) while keeping its internal representations indistinguishable from non-deceptive states to any linear classifier. Whether this is possible depends on whether deception necessarily leaves a linear trace or can be implemented through distributed, nonlinear computation. This is an open question.

Verdict

Defection probes detect trained backdoors with greater than 99% accuracy. This is a genuine contribution to defense against model poisoning -- the deliberate insertion of backdoor behaviors.

But the question that matters most for AI safety is whether MI can detect natural deception -- strategic behavior that arises from optimization, not from deliberate insertion. Trained backdoors and emergent strategic deception may be fundamentally different phenomena. The first is a planted signal; the second, if it exists, would be an emergent property of general intelligence.

For the evidence on what happens when we try to detect naturally occurring strategic behavior, we turn to deception detection and alignment faking.