Localized Fact Editing and Its Pitfalls

Editing Facts in Weights

Language models store factual associations in their weights. When GPT asks "The Eiffel Tower is located in ___" and answers "Paris," that association is encoded somewhere in the network's parameters. A natural question arises: can we edit specific facts without retraining the model? Change "Paris" to "London" for the Eiffel Tower while leaving everything else intact?

This question matters for both practical and scientific reasons. Practically, models acquire outdated or incorrect information during training, and targeted editing would be cheaper than retraining. Scientifically, the ability to edit a fact by modifying specific weights would constitute strong evidence that the fact is localized in those weights -- a causal claim about where knowledge lives in the network.

Meng et al. [1] developed ROME (Rank-One Model Editing), the most influential technique for localized fact editing. The approach combines causal localization with targeted weight modification, and its initial results appeared to validate both the technique and the localization hypothesis. Subsequent work revealed fundamental problems that make this a cautionary tale about interpretability rigor.

The ROME Approach

ROME operates in two stages: first, identify where a fact is stored; second, modify the weights at that location.

Stage 1: Causal tracing. To find where a fact is stored, Meng et al. used a variant of activation patching. They corrupted the subject token embeddings with Gaussian noise (so the model could no longer retrieve the fact), then restored activations at individual layers and positions to see which restoration recovered the correct answer. The results pointed to specific MLP layers in the middle of the network as the primary storage sites for factual associations.The causal tracing results showed a distinctive pattern: restoring MLP activations at early-to-mid layers at the last subject token position recovered the fact, while restoring attention activations or activations at other positions did not. This was interpreted as evidence for a localized key-value storage mechanism in MLPs.

Theoretical work on knowledge storage provides some support for this localization. Allen-Zhu and Li [2] showed that knowledge extraction from specific layers has a principled basis: models trained with sufficient data augmentation store factual associations in ways that are extractable from targeted MLP layers, providing partial theoretical justification for why causal tracing finds localized signals.

Stage 2: Rank-one editing. Having identified the target layer, ROME modifies the MLP's weight matrix with a rank-one update. The MLP can be viewed as a key-value memory: the first projection ( $W_{\text{in}}$ ) maps the input to a key, and the second projection ( $W_{\text{out}}$ ) maps the key to a value that is added to the residual stream. A rank-one update to $W_{\text{out}}$ changes the value associated with one specific key (the subject's representation) while leaving all other key-value pairs approximately unchanged:

W_{\text{out}}' = W_{\text{out}} + \frac{(v^* - v_0) k^T}{k^T k}

where $k$ is the key vector for the target subject, $v_0$ is the original value, and $v^*$ is the new value that encodes the desired fact. This changes what the MLP outputs when it recognizes the subject, effectively overwriting one entry in the key-value store.

ROME (Rank-One Model Editing): Edit a specific factual association by computing a rank-one update to a targeted MLP layer's output projection. The update modifies the value vector associated with one subject's key representation, changing the fact that the MLP retrieves for that subject while attempting to leave other associations intact.

The initial results were striking. ROME could change "The Eiffel Tower is in Paris" to "The Eiffel Tower is in London" and the model would consistently answer "London" for related queries ("What city is the Eiffel Tower in?", "The Eiffel Tower is a famous landmark in ___"). The edit appeared to generalize across phrasings while leaving unrelated facts undisturbed.

The Insertion-Versus-Editing Flaw

The core assumption of ROME is that the rank-one update edits the stored fact, replacing "Paris" with "London." But there is a subtler possibility: the update inserts a new association without erasing the old one.

This distinction is critical. If the old fact ("Paris") is still encoded in the weights and the new fact ("London") is simply layered on top, then the edit layer does not need to encode the fact at all. The model may produce "London" not because it has genuinely changed its knowledge, but because the rank-one update acts as a hard override -- a patch that intercepts the output at one specific layer, masking the unchanged knowledge stored elsewhere.

Hase et al. [3] investigated this possibility systematically and found that it is exactly what happens. Their key findings:

Localization and editing are disconnected. The causal tracing results (which identify where facts are stored) do not predict where edits work best. Editing at the layers identified by causal tracing is not more effective than editing at other layers. The localization method and the editing method appear to be solving different problems.

Edits do not erase the original fact. After editing "Eiffel Tower → London," probing the model's intermediate representations reveals that "Paris" is still encoded in early and middle layers. The rank-one update at the target layer overrides the output, but the original knowledge persists upstream. The edit is a patch, not a correction.

Pathological side effects emerge. Because the edit is an insertion rather than a replacement, the model's internal state becomes inconsistent. Layers before the edit site still encode "Paris." Layers after the edit site see "London." This inconsistency produces failures on questions that require integrating the fact with other knowledge: "What country is the Eiffel Tower in?" might produce "France" (from the unedited upstream representation) instead of "England" (which would be consistent with the edited "London").The inconsistency between edited and unedited layers is a specific instance of a broader problem: localized edits assume facts are stored in one place, but transformer representations are distributed. A fact's influence flows through many layers via the residual stream, and editing one layer does not update the representations that other layers have already contributed.

Pause and think: Why rank-one updates produce overrides, not edits

Consider the mechanics of the rank-one update $W' = W + \Delta W$ , where $\Delta W$ is a rank-one matrix. The original weight matrix $W$ still contributes to the output for all inputs. The update $\Delta W$ adds a correction that is large for the target key and small for other keys. Why does this architecture naturally produce an override rather than an edit? What would a genuine edit require?

A genuine edit would need to both (a) suppress the original output for the target key and (b) substitute the new output. A rank-one addition can do (b) but not (a) -- it adds new information without removing old information. To truly edit, you would need to modify $W$ in a way that changes the existing mapping, not just add a correction on top. This is fundamentally harder because it requires understanding the full structure of $W$ , not just the input-output pair you want to change.

Broader Implications

The ROME story illustrates several principles that extend beyond fact editing.

Causal localization does not imply causal sufficiency for editing. Finding that a component is causally important for retrieving a fact (via activation patching) does not mean that modifying that component is sufficient to change the fact. The fact may be redundantly encoded, with multiple components contributing. Editing one location leaves the others intact, producing inconsistency rather than clean modification.

Beware the illusion of success. ROME appeared to work well when evaluated on surface-level metrics (does the model say "London" instead of "Paris"?). The failures emerged only when testing for deeper consistency (does the model's broader knowledge update coherently?). This is a general risk in MI: a technique can appear successful when evaluated narrowly but fail when probed more carefully.

The distributed nature of knowledge. Factual knowledge in transformers is not stored in a single MLP layer like an entry in a database. It is distributed across layers, encoded redundantly, and accessed through multiple pathways. The residual stream carries information forward from many sources, and later layers integrate contributions from earlier ones. Any technique that assumes strict localization of knowledge will encounter the same fundamental problem.

MEMIT [4] extended ROME to edit multiple facts simultaneously by distributing rank-one updates across several layers, partially addressing the single-layer limitation. But the fundamental insertion-versus-editing problem persists: adding new associations does not remove old ones, and the resulting internal inconsistency scales with the number of edits.

Lessons for Interpretability Rigor

The fact editing literature teaches a meta-lesson about how to evaluate interpretability claims.

Test for the mechanism, not just the output. If we claim a technique edits knowledge, we should verify that the old knowledge is actually gone, not just that the new answer appears. Output-level evaluation alone can be misleading.

Test for consistency, not just the target behavior. An edit that changes the answer to one question but produces inconsistent answers to related questions has not truly modified the model's knowledge. Evaluation should probe for coherence across the neighborhood of the modified fact.

Distinguish correlation from mechanism. Causal tracing shows where factual information is processed, but this does not mean that location is the sole source of the information or that modifying it will cleanly change the fact. The processing site and the storage site may differ, and storage may be distributed.

These principles apply broadly. Whenever we make an interpretability claim -- "this component stores X," "this direction represents Y," "this circuit computes Z" -- we should ask: what would we expect to see if the claim were wrong? And have we tested for that?

Looking Ahead

The limitations of localized fact editing highlight why guarantee-based approaches to model modification are valuable. Rather than assuming we can edit specific knowledge by modifying specific weights, concept erasure with LEACE takes a different approach: provably remove all linear information about a concept from the representations, with a mathematical guarantee that no linear classifier can recover it. The contrast between ROME's assumption-heavy approach and LEACE's guarantee-based approach reflects a broader tension in the field between optimistic techniques that assume we understand model internals well enough to make precise modifications, and conservative techniques that make fewer assumptions and provide formal guarantees.