Block Recurrent Dynamics in Vision Transformers

ICLR 2026 Conference SubmissionAnonymous Authors
Computer VisionInterpretabilityDynamical system
Abstract:

As Vision Transformers (ViTs) become standard backbones across vision, a mechanistic account of their computational phenomenology is now essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH)\textbf{Block-Recurrent Hypothesis (BRH)}, arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original LL blocks can be accurately rewritten using only kLk \ll L distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether this reflects reusable computation, we operationalize our hypothesis in the form of block recurrent surrogates of pretrained ViTs, which we call Recurrent Approximations to Phase-structured TransfORmers (Raptor\texttt{Raptor}). Using small-scale ViTs, we demonstrate that phase-structure metrics correlate with our ability to accurately fit Raptor\texttt{Raptor} and identify the role of stochastic depth in promoting the recurrent block structure. We then provide an empirical existence proof for BRH in foundation models by showing that we can train a Raptor\texttt{Raptor} model to recover 9494% of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks. To provide a mechanistic account of these observations, we leverage our hypothesis to develop a program of Dynamical Interpretability\textbf{Dynamical Interpretability}. We find (i)\textit{\textbf{(i)}} directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations (ii)\textit{\textbf{(ii)}} token-specific dynamics, where cls\texttt{cls} executes sharp late reorientations while patch\texttt{patch} tokens exhibit strong late-stage coherence reminiscent of a mean-field effect and converge rapidly toward their mean direction and (iii)\textit{\textbf{(iii)}} a collapse of the update field to low rank in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find that a compact recurrent program emerges along the depth of ViTs, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Block-Recurrent Hypothesis (BRH), proposing that trained Vision Transformers exhibit a phase-structured depth where computation across L blocks can be rewritten using k≪L distinct blocks applied recurrently. It sits within the 'Block-Recurrent and Phase-Structured Transformers' leaf, which contains only two papers total. This represents a sparse research direction within the broader taxonomy of 47 papers across 17 leaf nodes, suggesting the paper addresses a relatively unexplored aspect of vision transformer interpretability and architectural understanding.

The taxonomy reveals neighboring work in 'Video Sequence Modeling with Recurrent Transformers' (4 papers) and 'Recurrent Modules for Image Restoration' (7 papers), which apply recurrent mechanisms to specific tasks rather than analyzing inherent recurrent structure in pretrained models. The paper diverges from these application-focused directions by providing a mechanistic interpretation framework. Nearby branches in 'Spatial-Temporal Factorization' and 'Transformer Architectural Innovations' address complementary concerns about efficiency and attention mechanisms, but do not examine the dynamical flow interpretation that this work emphasizes through representational similarity analysis and phase detection.

Among 30 candidates examined across three contributions, none were identified as clearly refuting the work. The Block-Recurrent Hypothesis examined 10 candidates with 0 refutable, as did the Raptor surrogate method and the dynamical interpretability framework. This suggests limited direct prior work on block-recurrent depth structure analysis in pretrained ViTs within the search scope. The paper's focus on reusable computation phases and the role of stochastic depth in promoting recurrent structure appears distinct from existing recurrent transformer applications, though the limited search scale means potentially relevant work in mechanistic interpretability or neural network compression may exist beyond these 30 candidates.

Based on the top-30 semantic matches and taxonomy structure, the work appears to occupy a novel position at the intersection of transformer interpretability and recurrent dynamics. The sparse leaf population and absence of refuting candidates within the examined scope suggest substantive novelty, though the analysis does not cover exhaustive mechanistic interpretability literature or broader neural architecture search domains where related compression or phase-detection ideas might exist.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: block recurrent dynamics in vision transformers. The field has organized itself around several complementary directions. One major branch explores recurrent mechanisms in vision transformers, investigating how iterative refinement and temporal dependencies can be integrated into transformer architectures through block-recurrent structures, phase-structured designs, and hybrid recurrent-attention modules. A second branch focuses on spatial-temporal factorization and multi-scale attention, addressing how to efficiently decompose complex visual inputs across scales and time. Additional branches examine specialized recognition and detection tasks, architectural innovations for efficiency, and domain-specific applications ranging from medical imaging to robotics. Works such as TRecViT[2] and Recurrent Video Restoration[3] illustrate how recurrent components can enhance temporal modeling, while others like RRT-MVS[5] and Recurrent Homography Estimation[16] apply these ideas to geometric vision problems. Particularly active lines of work reveal trade-offs between computational efficiency and expressive power. Many studies adopt block-recurrent or phase-structured designs to balance the global receptive field of transformers with the parameter efficiency of recurrence, as seen in video processing tasks like Recurrent Video Restoration[3] and video anomaly detection. The original paper, Block Recurrent Dynamics[0], sits within this cluster of block-recurrent and phase-structured transformers, emphasizing structured iterative processing. Compared to nearby works such as Recurrent Visual Reasoning[24], which focuses on reasoning tasks, Block Recurrent Dynamics[0] appears to prioritize the architectural mechanism itself—how recurrent blocks can be systematically integrated into vision transformers. This contrasts with application-driven approaches like Block-recurrent Thermal Detection[25], which adapts recurrent dynamics to a specific sensing modality. Open questions remain about how best to initialize recurrent states, manage long-range dependencies, and scale these architectures to diverse visual domains.

Claimed Contributions

Block-Recurrent Hypothesis (BRH) and empirical validation

The authors formalize the Block-Recurrent Hypothesis, which states that Vision Transformers can be rewritten using a small number of parameter-tied blocks applied recurrently. They provide empirical evidence across diverse ViTs showing contiguous phase structure in layer-layer similarity matrices and demonstrate that stochastic depth promotes this recurrent block structure.

10 retrieved papers
Raptor: Recurrent Approximations to Phase-structured TransfORmers

The authors develop Raptor, a method to train weight-tied block-recurrent approximations of pretrained ViTs that reconstruct complete internal representation trajectories. They demonstrate that a Raptor model can recover 94% of DINOv2 ImageNet-1k linear probe accuracy using only 2 recurrent blocks, providing constructive verification of functional reuse.

10 retrieved papers
Dynamical Interpretability framework for Vision Transformers

The authors introduce a framework for analyzing ViT depth as an iterated dynamical system. Their analysis reveals directional convergence into class-dependent angular basins, token-specific dynamics with specialized behaviors for cls and patch tokens, and collapse of update fields to low-rank subspaces consistent with convergence to low-dimensional attractors.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Block-Recurrent Hypothesis (BRH) and empirical validation

The authors formalize the Block-Recurrent Hypothesis, which states that Vision Transformers can be rewritten using a small number of parameter-tied blocks applied recurrently. They provide empirical evidence across diverse ViTs showing contiguous phase structure in layer-layer similarity matrices and demonstrate that stochastic depth promotes this recurrent block structure.

Contribution

Raptor: Recurrent Approximations to Phase-structured TransfORmers

The authors develop Raptor, a method to train weight-tied block-recurrent approximations of pretrained ViTs that reconstruct complete internal representation trajectories. They demonstrate that a Raptor model can recover 94% of DINOv2 ImageNet-1k linear probe accuracy using only 2 recurrent blocks, providing constructive verification of functional reuse.

Contribution

Dynamical Interpretability framework for Vision Transformers

The authors introduce a framework for analyzing ViT depth as an iterated dynamical system. Their analysis reveals directional convergence into class-dependent angular basins, token-specific dynamics with specialized behaviors for cls and patch tokens, and collapse of update fields to low-rank subspaces consistent with convergence to low-dimensional attractors.