Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

Mechanistic InterpretabilitySteeringAutomated interpretabilityBenchmarking interpretability

Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for safety research. Model diffing--the study of differences between base and finetuned models--is a promising approach for understanding how finetuning modifies neural networks. In this paper, we show that narrow finetuning creates easily readable biases in LLM activations that can be detected using simple model diffing tools, suggesting that the finetuning data is overrepresented in the model's activations. In particular, analyzing activation differences between base and finetuned models on the first few tokens of random text and steering with this difference allows us to recover the format and general content of the finetuning data. We call this the Activation Difference Lens (ADL). We demonstrate that these analyses significantly enhance an LLM-based interpretability agent's ability to identify subtle finetuning objectives through interaction with base and finetuned models. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). Our work: (1) demonstrates that researchers should be aware that narrow finetuned models will represent their training data and objective very saliently, (2) warns AI safety and mechanistic interpretability researchers that these models might not be a realistic proxy for studying broader finetuning, despite current literature widely using them. While we show that mixing pretraining data into the finetuning corpus is enough to remove this bias, a deeper investigation is needed to understand the side effects of narrow finetuning and develop truly realistic case studies for model-diffing, safety and interpretability research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Activation Difference Lens (ADL) method to detect and interpret how narrow finetuning modifies LLM activations, demonstrating that finetuning data leaves readable biases in early-token activations. It resides in the 'Mechanistic Analysis of Finetuning Effects' leaf, which contains only two papers total. This sparse population suggests the specific angle—using activation differences on random text to recover finetuning data properties—occupies relatively unexplored territory within the broader mechanistic interpretability landscape. The sibling paper focuses on general mechanistic analysis, whereas this work emphasizes a concrete detection and steering methodology.

The taxonomy reveals that mechanistic analysis sits within a larger 'Activation Pattern Analysis and Interpretability' branch containing four leaves (24 papers across the entire taxonomy). Neighboring leaves address layer-wise representation evolution via sparse autoencoders, activation-based steering and personalization, and representation space dynamics like embedding collapse. The paper's focus on activation differences for data recovery connects to 'Activation-Based Steering and Detection' but diverges by targeting finetuning artifacts rather than general steering objectives. The taxonomy's scope and exclude notes clarify that this work emphasizes mechanistic insight over parameter efficiency or optimization techniques, situating it firmly in the interpretability domain.

Among 29 candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three contributions. The ADL method examined 10 candidates with zero refutable matches; the LLM-based interpretability agent examined 10 with zero refutations; and the demonstration of static biases examined 9 with zero refutations. This limited search scope—roughly 30 papers rather than an exhaustive survey—suggests that within the examined neighborhood, the specific combination of activation difference analysis, data recovery, and LLM-assisted evaluation appears relatively novel. However, the small candidate pool means potentially relevant work outside top-K semantic matches may exist.

Given the sparse taxonomy leaf (2 papers) and zero refutations among 29 examined candidates, the work appears to occupy a distinct methodological niche within mechanistic interpretability. The analysis covers top semantic matches and immediate citations but does not claim exhaustive coverage of all activation analysis or model diffing literature. The novelty assessment reflects what is visible within this bounded search, acknowledging that broader or differently-scoped searches might surface additional related work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: understanding how narrow finetuning modifies neural network activations. The field has organized itself around several complementary perspectives. One major branch examines activation pattern analysis and interpretability, seeking to trace and visualize how internal representations shift when models are adapted to specialized tasks. A second branch focuses on parameter-efficient finetuning methods that modify only small subsets of weights or introduce low-rank adapters, often with the goal of preserving pretrained knowledge while enabling task-specific behavior. Additional branches address activation sparsity and compression (exploring how finetuning can induce or exploit sparse firing patterns), optimization and stability concerns (studying learning dynamics and convergence), domain adaptation and transfer learning (bridging source and target distributions), task-specific and structured finetuning (tailoring architectures or loss functions to particular problem classes), and application-specific studies that demonstrate these ideas in domains ranging from vision to language to scientific modeling. Representative works such as Mechanistic Finetuning Analysis[2] and Reducing Representational Collapse[3] illustrate how researchers probe the internal mechanics of adaptation, while methods like Surgical Fine-Tuning[11] and Activation Pattern Optimization[13] exemplify targeted intervention strategies. A particularly active line of inquiry centers on mechanistic interpretability: researchers are moving beyond black-box performance metrics to ask which layers, neurons, or attention heads change most during finetuning, and whether these changes can be predicted or controlled. Narrow Finetuning Traces[0] sits squarely in this mechanistic analysis cluster, sharing close thematic ties with Mechanistic Finetuning Analysis[2] in its emphasis on tracing activation-level modifications. Where some neighboring studies like Contrastive Activation Steering[4] or Joint Localization Editing[5] focus on steering or editing specific components post-hoc, Narrow Finetuning Traces[0] appears more concerned with characterizing the natural evolution of activations under narrow task adaptation. This distinction highlights an ongoing tension in the field: whether to passively observe and document representational shifts or to actively engineer them through specialized training regimes. Open questions remain about the generality of observed patterns across architectures, the interplay between sparsity and expressiveness, and the extent to which mechanistic insights can inform more robust or efficient finetuning protocols.

Claimed Contributions

Activation Difference Lens (ADL) method for interpreting narrow finetuning

9 retrieved papers

The authors introduce the Activation Difference Lens (ADL), a model diffing technique that applies Patchscope and steering to activation differences between base and finetuned models on unrelated data. This method reveals readable traces of narrow finetuning objectives by analyzing early-token activation differences and steering model outputs.

9 retrieved papers

LLM-based interpretability agent for evaluating model diffing

10 retrieved papers

The authors create an automated interpretability agent that uses ADL results to identify finetuning objectives without access to training data. This agent provides quantitative, reproducible evaluation of model diffing informativeness and significantly outperforms baseline prompting approaches.

10 retrieved papers

Demonstration that narrow finetuning creates detectable static biases across model organisms

8 retrieved papers

The authors show empirically across 33 model organisms from 4 families and 7 architectures (1B-32B parameters) that narrow finetuning leaves strong, interpretable biases in activation differences. They provide evidence these biases stem from overfitting and propose mitigation through mixing pretraining data.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks PDF

Jain, Samyak, Kirk, Robert, Lubana, Ekdeep Singh, Dick, Robert P., Tanaka Hidenori, Grefenstette, Edward, RocktÃ¤schel, Tim (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Activation Difference Lens (ADL) method for interpreting narrow finetuning

[51] Persona vectors: Monitoring and controlling character traits in language models PDF

Cannot Refute

[52] Interpretable Steering of Large Language Models with Feature Guided Activation Additions PDF

Cannot Refute

[53] Latent pattern cascade for contextual perturbation sensitivity in large language model architectures PDF

Cannot Refute

[54] Steering large language models using conceptors: Improving addition-based activation engineering PDF

Cannot Refute

[55] Supervised fine-tuning achieve rapid task adaption via alternating attention head activation patterns PDF

Cannot Refute

[56] Improving instruction-following in language models through activation steering PDF

Cannot Refute

[57] Fine-tuning enhances existing mechanisms: A case study on entity tracking PDF

Cannot Refute

[58] Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment PDF

Cannot Refute

[59] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models PDF

Cannot Refute

Contribution

LLM-based interpretability agent for evaluating model diffing

[68] Interpreting black-box models: a review on explainable artificial intelligence PDF

Cannot Refute

[69] Interpretability in healthcare: A comparative study of local machine learning interpretability techniques PDF

Cannot Refute

[70] Enhancing automated interpretability with output-centric feature descriptions PDF

Cannot Refute

[71] From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI PDF

Cannot Refute

[72] Find: A function description benchmark for evaluating interpretability methods PDF

Cannot Refute

[73] A text classification-based approach for evaluating and enhancing the machine interpretability of building codes PDF

Cannot Refute

[74] Towards next-gen smart manufacturing systems: the explainability revolution PDF

Cannot Refute

[75] Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond PDF

Cannot Refute

[76] An Integrated Framework for Scenario-Based Safety Validation and Explainability of Autonomous Vehicles PDF

Cannot Refute

[77] From black box to transparency: Enhancing automated interpreting assessment with explainable AI in college classrooms PDF

Cannot Refute

Contribution

Demonstration that narrow finetuning creates detectable static biases across model organisms

[60] Continual debiasing: A bias mitigation framework for natural language understanding systems PDF

Cannot Refute

[61] Bias after Prompting: Persistent Discrimination in Large Language Models PDF

Cannot Refute

[62] Having Beer after Prayer? Measuring Cultural Bias in Large Language Models PDF

Cannot Refute

[63] Debiasing Pre-Trained Language Models via Efficient Fine-Tuning PDF

Cannot Refute

[64] Mitigating Toxicity Bias in Language Models From a Causal Perspective PDF

Cannot Refute

[65] LoFiT: Localized Fine-tuning on LLM Representations PDF

Cannot Refute

[66] ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos PDF

Cannot Refute

[67] Visual Comparison of Language Model Adaptation PDF

Cannot Refute

Narrow Finetuning Leaves Clearly Readable Traces in the Activation Differences

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks PDF

Contribution Analysis

Activation Difference Lens (ADL) method for interpreting narrow finetuning

[51] Persona vectors: Monitoring and controlling character traits in language models PDF

[52] Interpretable Steering of Large Language Models with Feature Guided Activation Additions PDF

[53] Latent pattern cascade for contextual perturbation sensitivity in large language model architectures PDF

[54] Steering large language models using conceptors: Improving addition-based activation engineering PDF

[55] Supervised fine-tuning achieve rapid task adaption via alternating attention head activation patterns PDF

[56] Improving instruction-following in language models through activation steering PDF

[57] Fine-tuning enhances existing mechanisms: A case study on entity tracking PDF

[58] Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment PDF

[59] Analyze Feature Flow to Enhance Interpretation and Steering in Language Models PDF

LLM-based interpretability agent for evaluating model diffing

[68] Interpreting black-box models: a review on explainable artificial intelligence PDF

[69] Interpretability in healthcare: A comparative study of local machine learning interpretability techniques PDF

[70] Enhancing automated interpretability with output-centric feature descriptions PDF

[71] From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI PDF

[72] Find: A function description benchmark for evaluating interpretability methods PDF

[73] A text classification-based approach for evaluating and enhancing the machine interpretability of building codes PDF

[74] Towards next-gen smart manufacturing systems: the explainability revolution PDF

[75] Interpretable deep learning: Interpretation, interpretability, trustworthiness, and beyond PDF

[76] An Integrated Framework for Scenario-Based Safety Validation and Explainability of Autonomous Vehicles PDF

[77] From black box to transparency: Enhancing automated interpreting assessment with explainable AI in college classrooms PDF

Demonstration that narrow finetuning creates detectable static biases across model organisms

[60] Continual debiasing: A bias mitigation framework for natural language understanding systems PDF

[61] Bias after Prompting: Persistent Discrimination in Large Language Models PDF

[62] Having Beer after Prayer? Measuring Cultural Bias in Large Language Models PDF

[63] Debiasing Pre-Trained Language Models via Efficient Fine-Tuning PDF

[64] Mitigating Toxicity Bias in Language Models From a Causal Perspective PDF

[65] LoFiT: Localized Fine-tuning on LLM Representations PDF

[66] ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos PDF

[67] Visual Comparison of Language Model Adaptation PDF

Table of Contents