From Reading to Control

Probing methods identify directions in activation space that correspond to concepts. But what happens if we add those directions during inference? Can we steer model behavior by intervening directly on the residual stream?

This is the idea behind addition steering -- modifying a model's internal activations during inference to control its outputs [1]Steering Language Models With Activation Engineering
Turner, A. M., Thiergart, L., Leech, G., et al.
arXiv, 2024
. Unlike fine-tuning (which modifies weights) or prompting (which modifies inputs), addition steering intervenes directly on internal representations during the forward pass. No training. No gradient computation. Just vector addition during inference.

Addition Steering: The inference-time modification of a model's internal activations by adding a steering vector to the residual stream. The vector is added at a chosen layer during the forward pass, shifting the model's behavior toward a target concept without modifying the model's weights.

The technique builds on a simple but powerful insight: if the residual stream is a linear communication channel, and if concepts are linear directions in activation space (as the linear representation hypothesis predicts), then adding a direction should steer the model toward that concept.

The ActAdd Method

Turner et al. (2024) introduced ActAdd (Activation Addition), the simplest version of addition steering [2]Steering Language Models With Activation Engineering
Turner, A. M., Thiergart, L., Leech, G., et al.
arXiv, 2024
. The recipe has four steps:

  1. Choose two contrasting prompts. For example, "Love" and "Hate." These should differ primarily in the concept you want to steer toward.

  2. Run both through the model. Collect residual stream activations at a chosen layer \ell.

  3. Compute the difference. Subtract the negative activation from the positive activation. This difference is the steering vector:

v=h(+)h()\mathbf{v} = \mathbf{h}^{(+)}_\ell - \mathbf{h}^{(-)}_\ell

  1. Add during generation. At each forward pass, add the steering vector to the residual stream at layer \ell:

h=h+αv\mathbf{h}'_\ell = \mathbf{h}_\ell + \alpha \cdot \mathbf{v}

where α\alpha controls the steering strength.The scaling factor α\alpha plays a critical role. Too small, and the steering has no effect. Too large, and the model produces incoherent text. Typical values range from 1 to 15 depending on the model and concept. The sweet spot must be found empirically.

Controlling Direction and Intensity

The parameter α\alpha controls both direction and intensity:

  • α>0\alpha > 0: steer toward the positive prompt (e.g., more "Love")
  • α<0\alpha < 0: steer toward the negative prompt (e.g., more "Hate")
  • α=0\alpha = 0: no intervention (original model behavior)

This bidirectionality is powerful. A single steering vector enables both amplification and suppression of a concept, depending on the sign of α\alpha.

Key Properties

Lightweight. No training, no optimization, no backward pass. Only forward passes to compute the steering vector, then simple addition during inference.

Data-efficient. Works with a single contrast pair -- as few as 2 prompts. For more robust steering, use directions computed via CAA.

Preserves off-target performance. Steering sentiment does not break factuality or general capabilities. The intervention is targeted to the concept direction.

Natural-language interface. The steering direction is specified through text prompts, not learned parameters.

Pause and think: Why middle layers?

Addition steering is most effective at middle layers (roughly layers 15-17 in Llama 2). Why might early or late layers be less effective for steering?

Early layers are too close to token space -- representations are still input-specific, encoding surface-level features. Steering here would interfere with basic processing. Late layers are too committed to output -- the model has already decided what to generate, and interventions are too late to change the trajectory. Middle layers encode concepts in their most abstract, modifiable form, making them the sweet spot for steering.

Application: Inducing Behavior

Addition steering can induce behaviors that the model would not normally exhibit:

Sycophancy steering: Add the sycophancy direction (computed via CAA) and the model agrees with the user even when the user is wrong.

Sentiment steering: Add a "positive sentiment" direction and responses become more optimistic and cheerful.

Refusal induction: Add the refusal direction to harmless prompts and the model refuses to answer even benign questions like "What is the capital of France?"The refusal induction experiment is particularly striking. It demonstrates sufficiency: adding the refusal direction causes refusal, not just correlates with it. This is strong causal evidence that the direction genuinely mediates the behavior.

The refusal induction experiment demonstrates sufficiency: adding a direction causes the associated behavior. This is the complement to ablation, which demonstrates necessity by showing that removing a direction prevents the behavior.

Additivity with Other Methods

A key finding: steering stacks additively with other methods:

  • Addition steering + fine-tuning: the effects combine without interfering.
  • Addition steering + few-shot prompting: prompting effects and steering effects add together.
  • MMLU scores (a proxy for general capabilities) remain largely intact after steering.

This suggests that steering operates in a direction somewhat orthogonal to general capabilities. You can shift the model's behavioral tendencies without breaking its underlying competence.The additivity result has practical implications. It means steering vectors could be combined with standard alignment techniques like RLHF or DPO, providing an additional control channel that works at inference time rather than training time.

The Geometric Picture

Addition steering has a clean geometric interpretation:

Illustration of addition steering in activation space. The original activation h is shifted by adding the steering vector v, resulting in a new activation h' that is closer to the target concept region.
Figure 1: Addition steering shifts the activation from its original position toward the target concept by adding the steering vector.

The residual stream activation h\mathbf{h} is a point in high-dimensional space. Adding a steering vector v\mathbf{v} translates that point along the concept direction. The translated point h=h+αv\mathbf{h}' = \mathbf{h} + \alpha \mathbf{v} is closer to (or further from, depending on α\alpha) the region of activation space associated with the target concept.

Pause and think: Designing a steering experiment

Suppose you want to steer a model to be more concise in its responses. How would you design the contrast pairs? What positive and negative prompts would you use? What layer range would you try first?

For contrast pairs, you might use prompts that elicit verbose responses (positive = concise, negative = verbose): ask the same question with instructions to "explain briefly" versus "explain in detail." You would start with middle layers since that is where abstract behavioral tendencies are encoded. The key challenge is ensuring your pairs differ primarily in verbosity, not in content quality or accuracy.

Limitations

Addition steering assumes linearity: that concepts are directions and that adding those directions has consistent effects. This assumption fails for:

  • Context-dependent behaviors. "Be helpful" might mean different things in different situations. A single direction cannot capture this context-dependence.
  • Conditional logic. Behaviors like "be honest unless honesty would cause serious harm" are inherently non-linear.
  • Interference effects. Steering strongly in one direction may have unintended effects on related concepts.

For behaviors that resist single-direction steering, more sophisticated interventions may be needed.

Connection to the Toolkit

Addition steering is one of three fundamental operations on concept directions:

  • Read with LAT and CAA -- detect what concepts are encoded.
  • Add with addition steering -- steer behavior toward a concept.
  • Remove with ablation -- eliminate a concept's influence.

Together, these operations form a principled framework for understanding and controlling model representations. Addition demonstrates sufficiency (adding the direction causes the behavior), while ablation demonstrates necessity (removing the direction prevents the behavior).