Structural Inference: Interpreting Small Language Models with Susceptibilities

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

InterpretabilityStatistical PhysicsSingular Learning Theory

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a linear response framework that treats neural networks as Bayesian statistical mechanical systems, deriving susceptibilities to data distribution perturbations and using these to identify functional modules. Within the taxonomy, it occupies the 'Bayesian and Statistical Mechanical Interpretability Methods' leaf under 'Linear Response Theory and Statistical Physics Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers are present—indicating this represents a relatively sparse and novel research direction within the broader field of neural network interpretability through perturbation analysis.

The taxonomy reveals that neighboring work exists primarily in two directions: neuronal network models applying linear response to biological systems (two papers in 'Neuronal Network Linear Response Models') and geometric analyses of layer representations without explicit statistical mechanical framing (four papers across geometric subtopics). The original paper bridges these areas by applying physics-inspired linear response theory specifically to artificial neural networks for interpretability, rather than modeling biological neurons or analyzing static geometric properties. This positioning suggests the work synthesizes concepts from adjacent branches—statistical physics formalism and interpretability goals—in a combination not extensively explored by prior literature.

Among the 30 candidates examined across three contributions, none were identified as clearly refuting any claimed novelty. The 'Susceptibility framework' contribution examined 10 candidates with zero refutable matches, as did the 'Structural inference methodology' and 'Per-token attribution scores' contributions. This absence of overlapping prior work across all contributions, combined with the limited search scope, suggests that within the examined literature the specific combination of Bayesian statistical mechanics, susceptibility-based attribution, and modular structure inference appears relatively unexplored. However, the search examined only top-30 semantic matches, leaving open the possibility of relevant work outside this scope.

Given the limited search scale and the paper's placement in an otherwise unpopulated taxonomy leaf, the work appears to occupy a genuinely novel intersection of statistical physics and neural network interpretability. The absence of sibling papers and zero refutable candidates across 30 examined works supports this impression, though the analysis cannot rule out relevant prior work beyond the top-K semantic neighborhood. The framework's distinctiveness lies in its integrated approach—combining Bayesian mechanics, local sampling, and low-rank factorization—rather than any single component in isolation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Interpreting neural network components through linear response to data distribution perturbations. This field seeks to understand how network layers and parameters react when input distributions are slightly perturbed, drawing on concepts from statistical physics and linear response theory. The taxonomy organizes research into several main branches: one rooted in statistical physics frameworks that adapt classical linear response methods to neural systems; another examining geometric and representational properties that emerge in hidden layers; a third focused on training dynamics and how optimization shapes network behavior; a fourth exploring domain-specific applications where physically interpretable models are paramount; and a fifth addressing advanced architectural and theoretical extensions. Representative works span from foundational studies of balanced networks and chaos (Chaos Balanced Networks[3]) to modern analyses of piecewise linear activations (Piecewise Linear ReLU[2]) and separability in hidden representations (Linear Separability Hidden Layers[1]), illustrating how the field bridges classical neuroscience, physics-inspired theory, and contemporary deep learning. Particularly active lines of work contrast rigorous statistical mechanical approaches—such as those examining neuronal linear response (Linear Response Neuronal[4]) or perturbation expansions (Neural Perturbation Theory[11])—with more applied efforts that embed physical constraints into architectures for domains like granular materials (White-box Granular Materials[5]) or convection modeling (Physically Interpretable Convection[7]). A central tension involves balancing theoretical rigor with practical interpretability: some studies pursue exact characterizations of linear regions and input distributions (Input Distributions Linear Regions[10]), while others prioritize domain-specific fidelity. The original paper, Structural Inference Susceptibilities[0], sits within the Bayesian and statistical mechanical interpretability branch, emphasizing how susceptibility measures—borrowed from physics—can reveal structural properties of network components under distributional shifts. This approach aligns closely with works like Linear Response Neuronal[4] in its physics-inspired framing, yet contrasts with more geometry-focused studies (Linear Separability Hidden Layers[1]) by prioritizing statistical inference over purely representational analysis.

Claimed Contributions

Susceptibility framework for neural network interpretability

10 retrieved papers

The authors develop a linear response framework rooted in statistical physics and Bayesian learning theory that treats neural networks as statistical mechanical systems. Susceptibilities measure how infinitesimal perturbations to the data distribution induce first-order changes in the expected behavior of network components, providing a principled link between data structure and model internals.

10 retrieved papers

Structural inference methodology

10 retrieved papers

The authors present a methodology that uses response matrices of susceptibilities combined with PCA to identify functional modules and internal structure in neural networks. This approach reveals how models balance expression and suppression of patterns, enabling discovery of circuits like induction heads through data-driven analysis.

10 retrieved papers

Per-token susceptibility attribution scores

10 retrieved papers

The authors show that susceptibilities can be decomposed into per-token contributions with interpretable signs (positive for suppression, negative for expression). These token-level attribution scores can be efficiently estimated using local Stochastic Gradient Langevin Dynamics sampling around network checkpoints.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Susceptibility framework for neural network interpretability

[38] Diagnosing model performance under distribution shift PDF

Cannot Refute

[39] Measuring robustness to natural distribution shifts in image classification PDF

Cannot Refute

[40] Measuring domain shift for deep learning in histopathology PDF

Cannot Refute

[41] Adaptive State Estimation and Continual Learning under Data Distribution Shift PDF

Cannot Refute

[42] Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting PDF

Cannot Refute

[43] Intermediate layer classifiers for ood generalization PDF

Cannot Refute

[44] Ensemble-based deep reinforcement learning for vehicle routing problems under distribution shift PDF

Cannot Refute

[45] Agreement-on-the-line: Predicting the performance of neural networks under distribution shift PDF

Cannot Refute

[46] Prediction Accuracy & Reliability: Classification and Object Localization Under Distribution Shift PDF

Cannot Refute

[47] Incomplete Multisource Domain Adaptation for Fault Diagnosis of Blast Furnace PDF

Cannot Refute

Contribution

Structural inference methodology

[18] Big data and fuzzy logic for demand forecasting in supply chain management: a data-driven approach PDF

Cannot Refute

[19] Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework PDF

Cannot Refute

[20] FE: an efficient data-driven multiscale approach based on physics-constrained neural networks and automated data mining PDF

Cannot Refute

[21] AIFS--ECMWF's data-driven forecasting system PDF

Cannot Refute

[22] An ai-driven, scalable, and modular digital twin framework for traffic management PDF

Cannot Refute

[23] A unifying principle for the functional organization of visual cortex PDF

Cannot Refute

[24] Sensor-fault detection, isolation and accommodation for digital twins via modular data-driven architecture PDF

Cannot Refute

[25] Dynamic system modeling using a multisource transfer learning-based modular neural network for industrial application PDF

Cannot Refute

[26] Data-driven discovery of intrinsic dynamics PDF

Cannot Refute

[27] Data-driven emergence of convolutional structure in neural networks PDF

Cannot Refute

Contribution

Per-token susceptibility attribution scores

[28] Interferometric token resonance: A computational framework for measuring intramodel signal interference in large language models PDF

Cannot Refute

[29] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs PDF

Cannot Refute

[30] Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models PDF

Cannot Refute

[31] Probabilistic contextual resonance in large language model decoding through selfmodulated semantic interference PDF

Cannot Refute

[32] Transformer-based protein generation with regularized latent space optimization PDF

Cannot Refute

[33] On identifiability in transformers PDF

Cannot Refute

[34] LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions PDF

Cannot Refute

[35] Evaluating attribution methods for explainable nlp with transformers PDF

Cannot Refute

[36] Bayesian Influence Functions for Scalable Data Attribution PDF

Cannot Refute

[37] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing PDF

Cannot Refute

Structural Inference: Interpreting Small Language Models with Susceptibilities

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Susceptibility framework for neural network interpretability

[38] Diagnosing model performance under distribution shift PDF

[39] Measuring robustness to natural distribution shifts in image classification PDF

[40] Measuring domain shift for deep learning in histopathology PDF

[41] Adaptive State Estimation and Continual Learning under Data Distribution Shift PDF

[42] Dish-ts: a general paradigm for alleviating distribution shift in time series forecasting PDF

[43] Intermediate layer classifiers for ood generalization PDF

[44] Ensemble-based deep reinforcement learning for vehicle routing problems under distribution shift PDF

[45] Agreement-on-the-line: Predicting the performance of neural networks under distribution shift PDF

[46] Prediction Accuracy & Reliability: Classification and Object Localization Under Distribution Shift PDF

[47] Incomplete Multisource Domain Adaptation for Fault Diagnosis of Blast Furnace PDF

Structural inference methodology

[18] Big data and fuzzy logic for demand forecasting in supply chain management: a data-driven approach PDF

[19] Data-Driven Object Tracking: Integrating Modular Neural Networks into a Kalman Framework PDF

[20] FE: an efficient data-driven multiscale approach based on physics-constrained neural networks and automated data mining PDF

[21] AIFS--ECMWF's data-driven forecasting system PDF

[22] An ai-driven, scalable, and modular digital twin framework for traffic management PDF

[23] A unifying principle for the functional organization of visual cortex PDF

[24] Sensor-fault detection, isolation and accommodation for digital twins via modular data-driven architecture PDF

[25] Dynamic system modeling using a multisource transfer learning-based modular neural network for industrial application PDF

[26] Data-driven discovery of intrinsic dynamics PDF

[27] Data-driven emergence of convolutional structure in neural networks PDF

Per-token susceptibility attribution scores

[28] Interferometric token resonance: A computational framework for measuring intramodel signal interference in large language models PDF

[29] GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs PDF

[30] Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models PDF

[31] Probabilistic contextual resonance in large language model decoding through selfmodulated semantic interference PDF

[32] Transformer-based protein generation with regularized latent space optimization PDF

[33] On identifiability in transformers PDF

[34] LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions PDF

[35] Evaluating attribution methods for explainable nlp with transformers PDF

[36] Bayesian Influence Functions for Scalable Data Attribution PDF

[37] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing PDF

Table of Contents