Interpretation as Question-Answering

Previous approaches to hidden state decoding generate explanations: given an activation, produce a description. But explanations are one-directional. What if we want to ask specific questions about an activation? "Does this representation encode sentiment?" "What entity is being referenced?" "Is harmful content present?"

LatentQA, introduced by Pan et al. [1]LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Pan, A., Chen, L., Steinhardt, J.
arXiv, 2024
, reframes activation interpretation as question-answering. Rather than generating open-ended descriptions, we train a decoder model to answer arbitrary questions about what a representation contains.

LatentQA: A task formulation where a model answers open-ended questions about neural network activations in natural language. The decoder model takes an activation as input (alongside a question) and produces an answer describing relevant properties of that activation.

The Visual Instruction Tuning Analogy

The approach draws a direct analogy to visual instruction tuning (VIT), which trains multimodal models to answer questions about images. In VIT:

  • Input: An image + a question
  • Output: A natural language answer
  • Training: Paired datasets of (image, question, answer)

LatentQA follows the same pattern:

  • Input: An activation + a question
  • Output: A natural language answer
  • Training: Paired datasets of (activation, question, answer)

The key insight is that activations, like images, are dense representations that can be "read" by a decoder trained on appropriate supervision. We are teaching models to perceive and describe activation vectors the way multimodal models perceive and describe images.

Latent Interpretation Tuning (LIT)

The LatentQA pipeline in two steps. Step 1, Curating LatentQA Data with GPT: a control is prepended to a stimulus, the target LLM produces an output, and GPT generates QA pairs about the control. Step 2, Finetuning the Decoder: activations from the stimulus are patched into a decoder LLM (a copy of the target, fine-tuned with LoRA), which is trained to answer the QA pairs.
The LatentQA training pipeline. (1) Data is curated by running a target LLM with a control prepended to a stimulus, capturing activations and generating QA pairs with GPT. (2) A decoder LLM is fine-tuned with LoRA to answer questions about patched-in activations. From Pan et al., LatentQA: Teaching LLMs to Decode Activations Into Natural Language.[2]LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Pan, A., Chen, L., Steinhardt, J.
arXiv, 2024

The training procedure, called Latent Interpretation Tuning, works as follows:

  1. Collect activations. Run a target model on diverse inputs, extracting hidden representations from chosen layers.

  2. Generate Q&A pairs. For each activation, create question-answer pairs that describe its content. Questions can cover:

    • What concept is represented
    • What entity or object is referenced
    • What sentiment or attitude is encoded
    • Whether specific properties are present
  3. Train the decoder. Fine-tune a language model to take (activation, question) pairs as input and produce the correct answer. The activation is fed as an embedding that the decoder conditions on.

  4. Evaluate generalization. Test on held-out activations and questions to verify that the decoder learns general interpretation skills, not just memorized associations.

The resulting decoder can answer novel questions about novel activations, enabling flexible interrogation of what representations encode.

Applications

LatentQA enables several practical capabilities:

Extraction

We can query activations to extract specific knowledge. Given the activation produced when a model processes "The Eiffel Tower is in ___," we can ask:

  • "What city is being referenced?" → "Paris"
  • "What country is relevant?" → "France"
  • "What type of landmark is this?" → "Tower/monument"

This goes beyond vocabulary projection. We are not just asking "what token would be predicted?" but "what knowledge is encoded about this entity?"

The authors demonstrate extracting relational knowledge and uncovering system prompts from model activations, capabilities that help understand what models internally represent about their inputs and context.

Control

Because the decoder provides a differentiable loss over activations, we can optimize activations to satisfy desired properties. If the decoder answers "negative" when asked about sentiment, we can compute gradients to adjust the activation toward "positive."

This enables:

  • Debiasing: Modifying activations so the decoder reports reduced stereotyping
  • Sentiment steering: Adjusting generation tone by optimizing activation properties
  • Concept editing: Changing what properties the decoder detects in a representation

The decoder becomes not just a reading tool but a control signal for modifying representations.

Safety Analysis

The most striking application is safety research. Harmful knowledge may be encoded in activations even when models refuse to express it in text. By querying activations directly, we can detect:

  • Whether bioweapon recipes are encoded in representations
  • Whether hacking techniques are represented
  • Whether the model has learned dangerous capabilities it normally refuses to express

This does not bypass safety training, it reveals what safety training is blocking. Understanding where harmful knowledge resides helps evaluate and improve safety measures.

Pause and think: The dual-use nature of LatentQA

LatentQA can reveal harmful capabilities that models refuse to express. This is valuable for safety research: we need to know what models know to evaluate safety measures.

But the same capability could potentially be used to extract dangerous knowledge. How should we think about this dual-use concern?

Consider:

  • Is revealing latent harmful knowledge more or less dangerous than leaving it hidden?
  • Does knowing where harmful knowledge resides help or hurt safety interventions?
  • Should LatentQA tools be open or restricted?

There are no easy answers. The tension between transparency and misuse potential is a recurring theme in interpretability research.

Architecture

The decoder model receives activations through a projection layer that maps them into the decoder's input space. Several design choices matter:

Which layers to read. Different layers encode different information. Early layers may encode syntactic features; later layers may encode semantic content. The decoder can be trained on activations from specific layers or from all layers.

How to present activations. The activation can be concatenated with the question embedding, inserted at a special position, or processed through cross-attention. The choice affects what information the decoder can access.

Decoder capacity. Larger decoders may answer more complex questions but require more training data. The authors find that moderate-sized decoders generalize well when trained on diverse Q&A pairs.

Relation to Prior Methods

LatentQA builds on insights from earlier approaches:

Method Input Output Format
Patchscopes Activation + template Completion Fill-in-blank
SelfIE Activation + prompt Description Freeform
Trained explanation Activation Explanation Freeform
LatentQA Activation + question Answer Q&A

The Q&A format offers specific advantages:

  • Questions constrain the output space, making evaluation easier
  • Arbitrary questions enable flexible interrogation
  • Differentiable answers enable optimization for control

Limitations

Training data quality. Q&A pairs require either manual annotation or heuristic generation. Incorrect answers in training data will be learned by the decoder.

Question scope. The decoder can only answer questions about properties represented in its training distribution. Novel question types may produce unreliable answers.

Activation-to-embedding mapping. The projection from activations to decoder input may lose information. Properties present in activations may not be accessible through the learned mapping.

Generalization bounds. While decoders generalize to new activations within the training distribution, performance on very different models or very different inputs may degrade.

Why LatentQA Matters

LatentQA demonstrates that treating activations as a modality, like images or audio, is a productive framing. Multimodal techniques developed for vision can be adapted for interpretability.

The Q&A format is particularly powerful because it:

  • Enables systematic benchmarking (correctness of answers can be evaluated)
  • Supports diverse queries without retraining
  • Provides a differentiable interface for control

This sets the stage for building general-purpose activation interpreters that can answer any question about any activation. Activation Oracles pursues exactly this goal, scaling LatentQA to create broadly capable explanation models.

Looking Ahead

LatentQA trains decoders on specific interpretation tasks. But can we create general-purpose explainers that work across diverse tasks and even generalize to scenarios not seen during training? Activation Oracles investigates this question, training broadly on many interpretation tasks to achieve zero-shot generalization to novel settings.