Priors in time: Missing inductive biases for language model interpretability

ICLR 2026 Conference SubmissionAnonymous Authors
Top-Down InterpretabilitySparse AutoencodersTemporal StructureStationarity
Abstract:

A central aim of interpretability tools applied to language models is to recover meaningful concepts from model activations. Existing feature extraction methods focus on single activations regardless of the context, implicitly assuming independence (and therefore stationarity). This leaves open whether they can capture the rich temporal and context-sensitive structure in the activations of language models (LMs). Adopting a Bayesian view, we demonstrate that standard Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time. We then show that LM representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. This mismatch casts doubt on existing SAEs' ability to reflect temporal structures of interest in the data. We introduce a novel SAE architecture---Temporal SAE---with a temporal inductive bias that decomposes representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information that cannot be captured by the context. Experiments on LLM activations with Temporal SAE demonstrate its ability to correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Temporal SAE, an architecture that decomposes language model activations into predictable and residual components to capture temporal structure. It sits in the 'Temporal Inductive Biases and Interpretability' leaf, which contains five papers total (including this one). This leaf focuses on understanding temporal structure in model activations through interpretability methods, distinguishing it from neighboring leaves that address architectural modifications or learned representations without interpretability emphasis. The leaf appears moderately populated within the broader taxonomy of fifty papers, suggesting a growing but not yet saturated research direction.

The taxonomy tree reveals that this work is nested under 'Temporal Dynamics in Language Model Architectures', which also includes 'Temporal Representation Learning in Pretrained Models' (five papers on how models encode temporal concepts) and 'Temporal Architecture Design and Mechanisms' (four papers on explicit temporal components). The sibling papers in the same leaf examine related phenomena: intra-token oscillations, brain-inspired temporal processing, and temporal structure in neural representations. The scope note clarifies that this leaf excludes general representation learning without interpretability focus, positioning the work at the intersection of mechanistic interpretability and temporal modeling rather than pure architectural innovation or downstream task performance.

Among twenty-six candidates examined across three contributions, none were found to clearly refute the paper's claims. The first contribution (SAEs impose independence priors) examined six candidates with zero refutations. The second (empirical characterization of temporal structure) examined ten candidates, also with zero refutations. The third (Temporal SAE architecture) examined ten candidates with zero refutations. This suggests that within the limited search scope, the specific combination of Bayesian analysis of SAE priors, empirical temporal dynamics characterization, and the proposed architecture appears relatively novel, though the search scale (twenty-six papers) leaves open the possibility of relevant prior work beyond top-K semantic matches.

Based on the limited literature search, the work appears to occupy a distinct position within temporal interpretability research. The taxonomy structure indicates that while temporal modeling in language models is an active area, the specific focus on SAE priors and temporal inductive biases is less crowded. However, the analysis covers only top-K semantic matches and does not exhaustively survey all interpretability or temporal modeling literature, so conclusions about absolute novelty remain tentative.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Temporal structure modeling in language model representations. The field explores how language models capture, represent, and reason about temporal information across diverse contexts. The taxonomy organizes research into several major branches: Temporal Dynamics in Language Model Architectures examines how model designs encode temporal patterns and inductive biases, including interpretability studies of internal temporal mechanisms; Temporal Reasoning and Knowledge Tasks focuses on benchmarks and methods for temporal inference, event ordering, and time-aware knowledge integration; Application Domains with Temporal Modeling spans practical deployments in video understanding (TimeChat[1], Valley[4]), urban analytics (UrbanGPT[2]), biomedical forecasting (Temporal Biomedical[9]), and recommendation systems (TiLLM-Rec[8]); Temporal Structure in Specialized Contexts addresses domain-specific challenges like remote sensing (Remote Sensing Survey[7]) and traffic prediction (Traffic Prediction[20]); Foundation Models and Temporal Adaptation investigates how large pre-trained models can be adapted for time series and temporal tasks (Time-LLM[28], Time Series Foundation[15]); and Peripheral and Cross-Domain Studies includes work on generalization gaps (Temporal Generalization Gap[10]) and cross-modal temporal alignment. A particularly active line of work centers on understanding the internal temporal mechanisms of language models, contrasting architectural innovations with interpretability-driven analyses. Priors in Time[0] sits within the Temporal Inductive Biases and Interpretability cluster, examining how temporal priors emerge in model representations—a perspective that complements nearby studies like Intra-token Oscillation[17], which investigates fine-grained temporal dynamics within token embeddings, and Temporal Structure Brain[49], which draws parallels between neural language models and biological temporal processing. This interpretability-focused approach differs from works emphasizing explicit temporal reasoning benchmarks (Temporal Reasoning[5], Understanding Time[12]) or application-driven adaptations. Key open questions include whether temporal structure arises implicitly from training data or requires explicit architectural biases, how to bridge the gap between internal representations and downstream temporal reasoning performance, and whether insights from cognitive neuroscience (Brain LLM Similarity[24]) can inform more temporally-aware model designs.

Claimed Contributions

Demonstration that standard SAEs impose independence priors across time

The authors show that existing SAE training objectives can be interpreted from a Bayesian perspective, revealing that SAEs implicitly assume concepts are uncorrelated across time and that sparsity remains time-invariant. This independence prior conflicts with the rich temporal structure observed in language model activations.

6 retrieved papers
Empirical characterization of temporal structure in LM activations

The authors empirically demonstrate that language model representations possess increasing intrinsic dimensionality over sequence positions and exhibit strong non-stationarity, with up to 80% of representation variance at a given time being predictable from past context. These findings directly contradict the independence assumptions of standard SAEs.

10 retrieved papers
Temporal SAE architecture with explicit temporal inductive biases

The authors propose Temporal SAE, a new interpretability method that explicitly models temporal structure by decomposing activations into predictable (slow-moving, context-dependent) and novel (fast-changing, residual) components. This architecture allows correlations between concepts across time, unlike standard SAEs, and demonstrates improved ability to capture event boundaries and syntactic structure in language.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Demonstration that standard SAEs impose independence priors across time

The authors show that existing SAE training objectives can be interpreted from a Bayesian perspective, revealing that SAEs implicitly assume concepts are uncorrelated across time and that sparsity remains time-invariant. This independence prior conflicts with the rich temporal structure observed in language model activations.

Contribution

Empirical characterization of temporal structure in LM activations

The authors empirically demonstrate that language model representations possess increasing intrinsic dimensionality over sequence positions and exhibit strong non-stationarity, with up to 80% of representation variance at a given time being predictable from past context. These findings directly contradict the independence assumptions of standard SAEs.

Contribution

Temporal SAE architecture with explicit temporal inductive biases

The authors propose Temporal SAE, a new interpretability method that explicitly models temporal structure by decomposing activations into predictable (slow-moving, context-dependent) and novel (fast-changing, residual) components. This architecture allows correlations between concepts across time, unlike standard SAEs, and demonstrates improved ability to capture event boundaries and syntactic structure in language.