Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences

ICLR 2026 Conference SubmissionAnonymous Authors
Mechanistic InterpretabilitySteeringAutomated interpretabilityBenchmarking interpretability
Abstract:

Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for safety research. Model diffing--the study of differences between base and finetuned models--is a promising approach for understanding how finetuning modifies neural networks. In this paper, we show that narrow finetuning creates easily readable biases in LLM activations that can be detected using simple model diffing tools, suggesting that the finetuning data is overrepresented in the model's activations. In particular, analyzing activation differences between base and finetuned models on the first few tokens of random text and steering with this difference allows us to recover the format and general content of the finetuning data. We call this the Activation Difference Lens (ADL). We demonstrate that these analyses significantly enhance an LLM-based interpretability agent's ability to identify subtle finetuning objectives through interaction with base and finetuned models. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). Our work: (1) demonstrates that researchers should be aware that narrow finetuned models will represent their training data and objective very saliently, (2) warns AI safety and mechanistic interpretability researchers that these models might not be a realistic proxy for studying broader finetuning, despite current literature widely using them. While we show that mixing pretraining data into the finetuning corpus is enough to remove this bias, a deeper investigation is needed to understand the side effects of narrow finetuning and develop truly realistic case studies for model-diffing, safety and interpretability research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Activation Difference Lens (ADL) method to detect and interpret how narrow finetuning modifies LLM activations, demonstrating that finetuning data leaves readable biases in early-token activations. It resides in the 'Mechanistic Analysis of Finetuning Effects' leaf, which contains only two papers total. This sparse population suggests the specific angle—using activation differences on random text to recover finetuning data properties—occupies relatively unexplored territory within the broader mechanistic interpretability landscape. The sibling paper focuses on general mechanistic analysis, whereas this work emphasizes a concrete detection and steering methodology.

The taxonomy reveals that mechanistic analysis sits within a larger 'Activation Pattern Analysis and Interpretability' branch containing four leaves (24 papers across the entire taxonomy). Neighboring leaves address layer-wise representation evolution via sparse autoencoders, activation-based steering and personalization, and representation space dynamics like embedding collapse. The paper's focus on activation differences for data recovery connects to 'Activation-Based Steering and Detection' but diverges by targeting finetuning artifacts rather than general steering objectives. The taxonomy's scope and exclude notes clarify that this work emphasizes mechanistic insight over parameter efficiency or optimization techniques, situating it firmly in the interpretability domain.

Among 29 candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three contributions. The ADL method examined 10 candidates with zero refutable matches; the LLM-based interpretability agent examined 10 with zero refutations; and the demonstration of static biases examined 9 with zero refutations. This limited search scope—roughly 30 papers rather than an exhaustive survey—suggests that within the examined neighborhood, the specific combination of activation difference analysis, data recovery, and LLM-assisted evaluation appears relatively novel. However, the small candidate pool means potentially relevant work outside top-K semantic matches may exist.

Given the sparse taxonomy leaf (2 papers) and zero refutations among 29 examined candidates, the work appears to occupy a distinct methodological niche within mechanistic interpretability. The analysis covers top semantic matches and immediate citations but does not claim exhaustive coverage of all activation analysis or model diffing literature. The novelty assessment reflects what is visible within this bounded search, acknowledging that broader or differently-scoped searches might surface additional related work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: understanding how narrow finetuning modifies neural network activations. The field has organized itself around several complementary perspectives. One major branch examines activation pattern analysis and interpretability, seeking to trace and visualize how internal representations shift when models are adapted to specialized tasks. A second branch focuses on parameter-efficient finetuning methods that modify only small subsets of weights or introduce low-rank adapters, often with the goal of preserving pretrained knowledge while enabling task-specific behavior. Additional branches address activation sparsity and compression (exploring how finetuning can induce or exploit sparse firing patterns), optimization and stability concerns (studying learning dynamics and convergence), domain adaptation and transfer learning (bridging source and target distributions), task-specific and structured finetuning (tailoring architectures or loss functions to particular problem classes), and application-specific studies that demonstrate these ideas in domains ranging from vision to language to scientific modeling. Representative works such as Mechanistic Finetuning Analysis[2] and Reducing Representational Collapse[3] illustrate how researchers probe the internal mechanics of adaptation, while methods like Surgical Fine-Tuning[11] and Activation Pattern Optimization[13] exemplify targeted intervention strategies. A particularly active line of inquiry centers on mechanistic interpretability: researchers are moving beyond black-box performance metrics to ask which layers, neurons, or attention heads change most during finetuning, and whether these changes can be predicted or controlled. Narrow Finetuning Traces[0] sits squarely in this mechanistic analysis cluster, sharing close thematic ties with Mechanistic Finetuning Analysis[2] in its emphasis on tracing activation-level modifications. Where some neighboring studies like Contrastive Activation Steering[4] or Joint Localization Editing[5] focus on steering or editing specific components post-hoc, Narrow Finetuning Traces[0] appears more concerned with characterizing the natural evolution of activations under narrow task adaptation. This distinction highlights an ongoing tension in the field: whether to passively observe and document representational shifts or to actively engineer them through specialized training regimes. Open questions remain about the generality of observed patterns across architectures, the interplay between sparsity and expressiveness, and the extent to which mechanistic insights can inform more robust or efficient finetuning protocols.

Claimed Contributions

Activation Difference Lens (ADL) method for interpreting narrow finetuning

The authors introduce the Activation Difference Lens (ADL), a model diffing technique that applies Patchscope and steering to activation differences between base and finetuned models on unrelated data. This method reveals readable traces of narrow finetuning objectives by analyzing early-token activation differences and steering model outputs.

9 retrieved papers
LLM-based interpretability agent for evaluating model diffing

The authors create an automated interpretability agent that uses ADL results to identify finetuning objectives without access to training data. This agent provides quantitative, reproducible evaluation of model diffing informativeness and significantly outperforms baseline prompting approaches.

10 retrieved papers
Demonstration that narrow finetuning creates detectable static biases across model organisms

The authors show empirically across 33 model organisms from 4 families and 7 architectures (1B-32B parameters) that narrow finetuning leaves strong, interpretable biases in activation differences. They provide evidence these biases stem from overfitting and propose mitigation through mixing pretraining data.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Activation Difference Lens (ADL) method for interpreting narrow finetuning

The authors introduce the Activation Difference Lens (ADL), a model diffing technique that applies Patchscope and steering to activation differences between base and finetuned models on unrelated data. This method reveals readable traces of narrow finetuning objectives by analyzing early-token activation differences and steering model outputs.

Contribution

LLM-based interpretability agent for evaluating model diffing

The authors create an automated interpretability agent that uses ADL results to identify finetuning objectives without access to training data. This agent provides quantitative, reproducible evaluation of model diffing informativeness and significantly outperforms baseline prompting approaches.

Contribution

Demonstration that narrow finetuning creates detectable static biases across model organisms

The authors show empirically across 33 model organisms from 4 families and 7 architectures (1B-32B parameters) that narrow finetuning leaves strong, interpretable biases in activation differences. They provide evidence these biases stem from overfitting and propose mitigation through mixing pretraining data.

Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences | Novelty Validation