Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

mechanistic interpretabilitylanguage models

Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free models. Direct logit attribution now gives the exact direct effect of individual components, while the accuracy of attribution patching does not significantly improve. We also confirm that GPT-2's "confidence neurons" are inactive in the LN-free models. Our work clarifies the role of LN layers in language modeling, showing that GPT-2-class models can function without LN layers. We hope that our LN-free analogues of the GPT-2 family of models will enable more precise interpretability research and improve our understanding of language models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper demonstrates that layer normalization can be removed from pretrained GPT-2 models through fine-tuning, achieving minimal performance degradation. It resides in the 'Post-Training LayerNorm Removal via Fine-Tuning' leaf, which contains only three papers total within a taxonomy of 19 papers across seven leaf nodes. This represents a moderately sparse research direction focused specifically on post-hoc removal strategies, distinguishing it from the broader field of normalization analysis or training-time architectural modifications. The work's positioning suggests it addresses a relatively focused question within the larger normalization debate.

The taxonomy reveals three major branches: removal techniques, analytical studies, and application-specific optimizations. The paper's leaf sits within the removal branch alongside alternative normalization methods like RMSNorm replacements. Neighboring leaves include architectural analysis examining LayerNorm placement and outlier feature interactions, plus domain-specific optimizations for privacy-preserving inference and compression. The scope boundaries indicate this work differs from training-from-scratch approaches and from studies merely analyzing normalization's role without proposing removal. Its sibling papers similarly explore post-training elimination strategies, suggesting a coherent subfield examining whether pretrained models can shed normalization layers.

Among seven candidates examined across three contributions, four refutable pairs were identified. The core removal technique examined three candidates with two appearing to provide overlapping prior work. The open-source model suite examined two candidates with one refutable match, while the interpretability validation similarly found one refutable candidate among two examined. This limited search scope—seven total candidates rather than dozens—means the analysis captures immediate neighbors but cannot claim exhaustive coverage. The interpretability contribution appears relatively less explored in the examined literature, while the removal methodology itself encounters more substantial prior work within this constrained sample.

Based on examination of seven semantically related candidates, the work appears to operate in a moderately explored space with identifiable prior art in post-training removal techniques. The interpretability angle and scaling analysis may offer distinguishing elements, though the limited search scope prevents definitive assessment of their novelty. The taxonomy structure suggests this represents incremental progress within an established research direction rather than opening entirely new territory, though the specific combination of removal, release, and interpretability testing may differentiate it from immediate predecessors.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Removing layer normalization from transformer language models. The field has organized itself around three main branches. The first, LayerNorm Removal and Replacement Techniques, explores methods for eliminating or substituting normalization layers—ranging from post-training fine-tuning approaches that strip out LayerNorm after initial training, to architectural redesigns that replace normalization with alternative stabilization mechanisms or remove it entirely from the outset. The second branch, LayerNorm Analysis and Understanding, investigates why normalization works in the first place: studies examine outlier features, attention entropy dynamics, and the interplay between normalization and model memorization or generalization. The third branch, Application-Specific LayerNorm Optimization, tailors normalization strategies to particular domains or deployment constraints, such as privacy-preserving settings or resource-limited environments. Together, these branches reflect a growing interest in understanding and potentially simplifying the transformer architecture by questioning the necessity of a component once considered essential. Recent work has revealed contrasting strategies and open questions. Some studies demonstrate that transformers can be trained from scratch without any normalization if careful initialization and architectural adjustments are made, as seen in Transformers Without Normalization[2] and Pre-RMSNorm Transformers[17]. Others focus on post-hoc removal: Remove GPT2 LayerNorm[9] and Transformers Without LayerNorm[10] show that fine-tuning can successfully eliminate normalization layers from pretrained models, preserving performance while simplifying inference. The original paper, Transformers Without LayerNorm[0], sits squarely within this post-training removal cluster, sharing the goal of stripping normalization after pretraining. It contrasts with approaches like Minimising Outlier Features[3], which address the root causes of instability that normalization typically mitigates, and with Extra RMSNorm[13], which explores adding rather than removing normalization. The central tension remains whether normalization is a training crutch that can be discarded or a fundamental architectural element whose role we have yet to fully understand.

Claimed Contributions

Layer normalization removal from GPT-2 models via fine-tuning

Can Refute

2 retrieved papers

The authors demonstrate that layer normalization can be completely removed from all GPT-2 model variants (Small, Medium, Large, XL) through a sequential fine-tuning procedure, achieving comparable performance with minimal loss increase. They develop an optimized protocol that replaces LN with linear transformations and scales sublinearly with model parameters.

2 retrieved papers

Can Refute

Suite of open-source LN-free GPT-2 models

1 retrieved paper

The authors provide publicly available LN-free versions of the entire GPT-2 model family on Hugging Face. These models serve as resources for mechanistic interpretability research where layer normalization nonlinearities complicate analysis.

1 retrieved paper

Validation of improved interpretability in LN-free models

1 retrieved paper

The authors demonstrate that removing layer normalization eliminates approximation errors in direct logit attribution, reducing error from 50% to 0%, making it mathematically equivalent to computing exact direct effects. They also test other interpretability techniques and find that attribution patching accuracy does not improve, suggesting limitations arise from other nonlinearities.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] You can remove GPT2's LayerNorm by fine-tuning PDF

Heimersheim, Stefan (2024) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Layer normalization removal from GPT-2 models via fine-tuning

[9] You can remove GPT2's LayerNorm by fine-tuning PDF

Can Refute

[19] MarianCG: a code generation transformer model inspired by machine translation PDF

Cannot Refute

Contribution

Suite of open-source LN-free GPT-2 models

[21] Exploration of interpretability methods for Transformer-based language models in the medical context PDF

Cannot Refute

Contribution

Validation of improved interpretability in LN-free models

[20] Interpreting large language models through the lens of embedding-oriented visualizations: Markov models, sankey diagrams and comparative approaches PDF

Cannot Refute

Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and Implications for Mechanistic Interpretability

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] You can remove GPT2's LayerNorm by fine-tuning PDF

Contribution Analysis

Layer normalization removal from GPT-2 models via fine-tuning

[9] You can remove GPT2's LayerNorm by fine-tuning PDF

[19] MarianCG: a code generation transformer model inspired by machine translation PDF

Suite of open-source LN-free GPT-2 models

[21] Exploration of interpretability methods for Transformer-based language models in the medical context PDF

Validation of improved interpretability in LN-free models

[20] Interpreting large language models through the lens of embedding-oriented visualizations: Markov models, sankey diagrams and comparative approaches PDF

Table of Contents