Structural Inference: Interpreting Small Language Models with Susceptibilities

ICLR 2026 Conference SubmissionAnonymous Authors
InterpretabilityStatistical PhysicsSingular Learning Theory
Abstract:

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a linear response framework that treats neural networks as Bayesian statistical mechanical systems, deriving susceptibilities to data distribution perturbations and using these to identify functional modules. Within the taxonomy, it occupies the 'Bayesian and Statistical Mechanical Interpretability Methods' leaf under 'Linear Response Theory and Statistical Physics Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers are present—indicating this represents a relatively sparse and novel research direction within the broader field of neural network interpretability through perturbation analysis.

The taxonomy reveals that neighboring work exists primarily in two directions: neuronal network models applying linear response to biological systems (two papers in 'Neuronal Network Linear Response Models') and geometric analyses of layer representations without explicit statistical mechanical framing (four papers across geometric subtopics). The original paper bridges these areas by applying physics-inspired linear response theory specifically to artificial neural networks for interpretability, rather than modeling biological neurons or analyzing static geometric properties. This positioning suggests the work synthesizes concepts from adjacent branches—statistical physics formalism and interpretability goals—in a combination not extensively explored by prior literature.

Among the 30 candidates examined across three contributions, none were identified as clearly refuting any claimed novelty. The 'Susceptibility framework' contribution examined 10 candidates with zero refutable matches, as did the 'Structural inference methodology' and 'Per-token attribution scores' contributions. This absence of overlapping prior work across all contributions, combined with the limited search scope, suggests that within the examined literature the specific combination of Bayesian statistical mechanics, susceptibility-based attribution, and modular structure inference appears relatively unexplored. However, the search examined only top-30 semantic matches, leaving open the possibility of relevant work outside this scope.

Given the limited search scale and the paper's placement in an otherwise unpopulated taxonomy leaf, the work appears to occupy a genuinely novel intersection of statistical physics and neural network interpretability. The absence of sibling papers and zero refutable candidates across 30 examined works supports this impression, though the analysis cannot rule out relevant prior work beyond the top-K semantic neighborhood. The framework's distinctiveness lies in its integrated approach—combining Bayesian mechanics, local sampling, and low-rank factorization—rather than any single component in isolation.

Taxonomy

Core-task Taxonomy Papers
17
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Interpreting neural network components through linear response to data distribution perturbations. This field seeks to understand how network layers and parameters react when input distributions are slightly perturbed, drawing on concepts from statistical physics and linear response theory. The taxonomy organizes research into several main branches: one rooted in statistical physics frameworks that adapt classical linear response methods to neural systems; another examining geometric and representational properties that emerge in hidden layers; a third focused on training dynamics and how optimization shapes network behavior; a fourth exploring domain-specific applications where physically interpretable models are paramount; and a fifth addressing advanced architectural and theoretical extensions. Representative works span from foundational studies of balanced networks and chaos (Chaos Balanced Networks[3]) to modern analyses of piecewise linear activations (Piecewise Linear ReLU[2]) and separability in hidden representations (Linear Separability Hidden Layers[1]), illustrating how the field bridges classical neuroscience, physics-inspired theory, and contemporary deep learning. Particularly active lines of work contrast rigorous statistical mechanical approaches—such as those examining neuronal linear response (Linear Response Neuronal[4]) or perturbation expansions (Neural Perturbation Theory[11])—with more applied efforts that embed physical constraints into architectures for domains like granular materials (White-box Granular Materials[5]) or convection modeling (Physically Interpretable Convection[7]). A central tension involves balancing theoretical rigor with practical interpretability: some studies pursue exact characterizations of linear regions and input distributions (Input Distributions Linear Regions[10]), while others prioritize domain-specific fidelity. The original paper, Structural Inference Susceptibilities[0], sits within the Bayesian and statistical mechanical interpretability branch, emphasizing how susceptibility measures—borrowed from physics—can reveal structural properties of network components under distributional shifts. This approach aligns closely with works like Linear Response Neuronal[4] in its physics-inspired framing, yet contrasts with more geometry-focused studies (Linear Separability Hidden Layers[1]) by prioritizing statistical inference over purely representational analysis.

Claimed Contributions

Susceptibility framework for neural network interpretability

The authors develop a linear response framework rooted in statistical physics and Bayesian learning theory that treats neural networks as statistical mechanical systems. Susceptibilities measure how infinitesimal perturbations to the data distribution induce first-order changes in the expected behavior of network components, providing a principled link between data structure and model internals.

10 retrieved papers
Structural inference methodology

The authors present a methodology that uses response matrices of susceptibilities combined with PCA to identify functional modules and internal structure in neural networks. This approach reveals how models balance expression and suppression of patterns, enabling discovery of circuits like induction heads through data-driven analysis.

10 retrieved papers
Per-token susceptibility attribution scores

The authors show that susceptibilities can be decomposed into per-token contributions with interpretable signs (positive for suppression, negative for expression). These token-level attribution scores can be efficiently estimated using local Stochastic Gradient Langevin Dynamics sampling around network checkpoints.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Susceptibility framework for neural network interpretability

The authors develop a linear response framework rooted in statistical physics and Bayesian learning theory that treats neural networks as statistical mechanical systems. Susceptibilities measure how infinitesimal perturbations to the data distribution induce first-order changes in the expected behavior of network components, providing a principled link between data structure and model internals.

Contribution

Structural inference methodology

The authors present a methodology that uses response matrices of susceptibilities combined with PCA to identify functional modules and internal structure in neural networks. This approach reveals how models balance expression and suppression of patterns, enabling discovery of circuits like induction heads through data-driven analysis.

Contribution

Per-token susceptibility attribution scores

The authors show that susceptibilities can be decomposed into per-token contributions with interpretable signs (positive for suppression, negative for expression). These token-level attribution scores can be efficiently estimated using local Stochastic Gradient Langevin Dynamics sampling around network checkpoints.