Block Recurrent Dynamics in Vision Transformers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.8 Download Report PDF

Computer VisionInterpretabilityDynamical system

As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the $\textbf{Block-Recurrent Hypothesis (BRH)}$ , arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether this reflects reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call Recurrent Approximations to Phase-structured TransfORmers ( $\texttt{Raptor}$ ). Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit $\texttt{Raptor}$ and identify the role of stochastic depth in promoting the recurrent block structure. We then provide an empirical existence proof for BRH in foundation models by showing that we can train a $\texttt{Raptor}$ model to recover $94$ % of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks. To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of $\textbf{Dynamical Interpretability}$ . We find $\textit{\textbf{(i)}}$ directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations $\textit{\textbf{(ii)}}$ token-specific dynamics, where $\texttt{cls}$ executes sharp late reorientations while $\texttt{patch}$ tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction and $\textit{\textbf{(iii)}}$ a collapse of the update field to low rank in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Block-Recurrent Hypothesis (BRH), proposing that trained Vision Transformers exhibit a phase-structured depth where computation across L blocks can be rewritten using k≪L distinct blocks applied recurrently. It sits within the 'Block-Recurrent and Phase-Structured Transformers' leaf, which contains only two papers total. This represents a sparse research direction within the broader taxonomy of 47 papers across 17 leaf nodes, suggesting the paper addresses a relatively unexplored aspect of vision transformer interpretability and architectural understanding.

The taxonomy reveals neighboring work in 'Video Sequence Modeling with Recurrent Transformers' (4 papers) and 'Recurrent Modules for Image Restoration' (7 papers), which apply recurrent mechanisms to specific tasks rather than analyzing inherent recurrent structure in pretrained models. The paper diverges from these application-focused directions by providing a mechanistic interpretation framework. Nearby branches in 'Spatial-Temporal Factorization' and 'Transformer Architectural Innovations' address complementary concerns about efficiency and attention mechanisms, but do not examine the dynamical flow interpretation that this work emphasizes through representational similarity analysis and phase detection.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work. The Block-Recurrent Hypothesis examined 10 candidates with 0 refutable, as did the Raptor surrogate method and the dynamical interpretability framework. This suggests limited direct prior work on block-recurrent depth structure analysis in pretrained ViTs within the search scope. The paper's focus on reusable computation phases and the role of stochastic depth in promoting recurrent structure appears distinct from existing recurrent transformer applications, though the limited search scale means potentially relevant work in mechanistic interpretability or neural network compression may exist beyond these 30 candidates.

Based on the top-30 semantic matches and taxonomy structure, the work appears to occupy a novel position at the intersection of transformer interpretability and recurrent dynamics. The sparse leaf population and absence of refuting candidates within the examined scope suggest substantive novelty, though the analysis does not cover exhaustive mechanistic interpretability literature or broader neural architecture search domains where related compression or phase-detection ideas might exist.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: block recurrent dynamics in vision transformers. The field has organized itself around several complementary directions. One major branch explores recurrent mechanisms in vision transformers, investigating how iterative refinement and temporal dependencies can be integrated into transformer architectures through block-recurrent structures, phase-structured designs, and hybrid recurrent-attention modules. A second branch focuses on spatial-temporal factorization and multi-scale attention, addressing how to efficiently decompose complex visual inputs across scales and time. Additional branches examine specialized recognition and detection tasks, architectural innovations for efficiency, and domain-specific applications ranging from medical imaging to robotics. Works such as TRecViT[2] and Recurrent Video Restoration[3] illustrate how recurrent components can enhance temporal modeling, while others like RRT-MVS[5] and Recurrent Homography Estimation[16] apply these ideas to geometric vision problems. Particularly active lines of work reveal trade-offs between computational efficiency and expressive power. Many studies adopt block-recurrent or phase-structured designs to balance the global receptive field of transformers with the parameter efficiency of recurrence, as seen in video processing tasks like Recurrent Video Restoration[3] and video anomaly detection. The original paper, Block Recurrent Dynamics[0], sits within this cluster of block-recurrent and phase-structured transformers, emphasizing structured iterative processing. Compared to nearby works such as Recurrent Visual Reasoning[24], which focuses on reasoning tasks, Block Recurrent Dynamics[0] appears to prioritize the architectural mechanism itself—how recurrent blocks can be systematically integrated into vision transformers. This contrasts with application-driven approaches like Block-recurrent Thermal Detection[25], which adapts recurrent dynamics to a specific sensing modality. Open questions remain about how best to initialize recurrent states, manage long-range dependencies, and scale these architectures to diverse visual domains.

Claimed Contributions

Block-Recurrent Hypothesis (BRH) and empirical validation

10 retrieved papers

The authors formalize the Block-Recurrent Hypothesis, which states that Vision Transformers can be rewritten using a small number of parameter-tied blocks applied recurrently. They provide empirical evidence across diverse ViTs showing contiguous phase structure in layer-layer similarity matrices and demonstrate that stochastic depth promotes this recurrent block structure.

10 retrieved papers

Raptor: Recurrent Approximations to Phase-structured TransfORmers

10 retrieved papers

The authors develop Raptor, a method to train weight-tied block-recurrent approximations of pretrained ViTs that reconstruct complete internal representation trajectories. They demonstrate that a Raptor model can recover 94% of DINOv2 ImageNet-1k linear probe accuracy using only 2 recurrent blocks, providing constructive verification of functional reuse.

10 retrieved papers

Dynamical Interpretability framework for Vision Transformers

10 retrieved papers

The authors introduce a framework for analyzing ViT depth as an iterated dynamical system. Their analysis reveals directional convergence into class-dependent angular basins, token-specific dynamics with specialized behaviors for cls and patch tokens, and collapse of update fields to low-rank subspaces consistent with convergence to low-dimensional attractors.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[24] Recurrent vision transformer for solving visual reasoning problems PDF

Nicola Messina, G. Amato, F. Carrara, C. Gennaro, F. Falchi, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Fabrizio Falchi (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Block-Recurrent Hypothesis (BRH) and empirical validation

[57] Three things everyone should know about vision transformers PDF

Cannot Refute

[59] Minivit: Compressing vision transformers with weight multiplexing PDF

Cannot Refute

[65] ViT-MVT: A unified vision transformer network for multiple vision tasks PDF

Cannot Refute

[67] RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals PDF

Cannot Refute

[68] A Manifold Representation of the Key in Vision Transformers PDF

Cannot Refute

[69] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning PDF

Cannot Refute

[70] Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA PDF

Cannot Refute

[71] Mixture of Low-rank Experts for Transferable AI-Generated Image Detection PDF

Cannot Refute

[72] Sparse parameterization for epitomic dataset distillation PDF

Cannot Refute

[73] Go Wider Instead of Deeper PDF

Cannot Refute

Contribution

Raptor: Recurrent Approximations to Phase-structured TransfORmers

[57] Three things everyone should know about vision transformers PDF

Cannot Refute

[58] Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks PDF

Cannot Refute

[59] Minivit: Compressing vision transformers with weight multiplexing PDF

Cannot Refute

[60] DTSNet: Dynamic Transformer Slimming for Efficient Vision Recognition PDF

Cannot Refute

[61] Cwpformer: Towards high-performance visual place recognition for robot with cross-weight attention learning PDF

Cannot Refute

[62] Lightweight Recurrent Neural Network for Image Super-Resolution PDF

Cannot Refute

[63] Serial Low-rank Adaptation of Vision Transformer PDF

Cannot Refute

[64] A dual-feature-based adaptive shared transformer network for image captioning PDF

Cannot Refute

[65] ViT-MVT: A unified vision transformer network for multiple vision tasks PDF

Cannot Refute

[66] Attention mechanism for adaptive feature modelling PDF

Cannot Refute

Contribution

Dynamical Interpretability framework for Vision Transformers

[24] Recurrent vision transformer for solving visual reasoning problems PDF

Cannot Refute

[48] Explaining transformers through dynamical systems theory PDF

Cannot Refute

[49] Flowing Through Layers: A Continuous Dynamical Systems Perspective on Transformers PDF

Cannot Refute

[50] Infinite limits of multi-head transformer dynamics PDF

Cannot Refute

[51] A unified perspective on the dynamics of deep transformers PDF

Cannot Refute

[52] Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization PDF

Cannot Refute

[53] The emergence of clusters in self-attention dynamics PDF

Cannot Refute

[54] Multi-Particle Dynamical Systems Modeling Transformers PDF

Cannot Refute

[55] Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium PDF

Cannot Refute

[56] The Mean-Field Dynamics of Transformers PDF

Cannot Refute

Block Recurrent Dynamics in Vision Transformers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[24] Recurrent vision transformer for solving visual reasoning problems PDF

Contribution Analysis

Block-Recurrent Hypothesis (BRH) and empirical validation

[57] Three things everyone should know about vision transformers PDF

[59] Minivit: Compressing vision transformers with weight multiplexing PDF

[65] ViT-MVT: A unified vision transformer network for multiple vision tasks PDF

[67] RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals PDF

[68] A Manifold Representation of the Key in Vision Transformers PDF

[69] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning PDF

[70] Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA PDF

[71] Mixture of Low-rank Experts for Transferable AI-Generated Image Detection PDF

[72] Sparse parameterization for epitomic dataset distillation PDF

[73] Go Wider Instead of Deeper PDF

Raptor: Recurrent Approximations to Phase-structured TransfORmers

[57] Three things everyone should know about vision transformers PDF

[58] Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks PDF

[59] Minivit: Compressing vision transformers with weight multiplexing PDF

[60] DTSNet: Dynamic Transformer Slimming for Efficient Vision Recognition PDF

[61] Cwpformer: Towards high-performance visual place recognition for robot with cross-weight attention learning PDF

[62] Lightweight Recurrent Neural Network for Image Super-Resolution PDF

[63] Serial Low-rank Adaptation of Vision Transformer PDF

[64] A dual-feature-based adaptive shared transformer network for image captioning PDF

[65] ViT-MVT: A unified vision transformer network for multiple vision tasks PDF

[66] Attention mechanism for adaptive feature modelling PDF

Dynamical Interpretability framework for Vision Transformers

[24] Recurrent vision transformer for solving visual reasoning problems PDF

[48] Explaining transformers through dynamical systems theory PDF

[49] Flowing Through Layers: A Continuous Dynamical Systems Perspective on Transformers PDF

[50] Infinite limits of multi-head transformer dynamics PDF

[51] A unified perspective on the dynamics of deep transformers PDF

[52] Unveil benign overfitting for transformer in vision: Training dynamics, convergence, and generalization PDF

[53] The emergence of clusters in self-attention dynamics PDF

[54] Multi-Particle Dynamical Systems Modeling Transformers PDF

[55] Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium PDF

[56] The Mean-Field Dynamics of Transformers PDF

Table of Contents