Revisiting [CLS] and Patch Token Interaction in Vision Transformers

ICLR 2026 Conference SubmissionAnonymous Authors
representationvisiontransformerSSLattentionspecializationarchitectureinterpretabilityDINODINOv2CLIPDEIT
Abstract:

Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite their distinct nature, both token types are processed identically throughout the model. In this work, we investigate the friction between global and local feature learning under different pre-training strategies by analyzing the interactions between class and patch tokens. Our analysis reveals that standard normalization layers introduce an implicit differentiation between these token types. Building on this insight, we propose specialized processing paths that selectively disentangle the computational flow of class and patch tokens, particularly within normalization layers and early query-key-value projections. This targeted specialization leads to significantly improved patch representation quality for dense prediction tasks. Our experiments demonstrate segmentation performance gains of over 2 mIoU points on standard benchmarks, while maintaining strong classification accuracy. The proposed modifications introduce only an 8% increase in parameters, with no additional computational overhead. Through comprehensive ablations, we provide insights into which architectural components benefit most from specialization and how our approach generalizes across model scales and learning frameworks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes specialized processing paths that selectively disentangle class and patch token computation in Vision Transformers, particularly within normalization layers and early query-key-value projections. It resides in the 'Explicit Class-Patch Token Decoupling' leaf, which contains only three papers including this one. This represents a relatively sparse research direction within the broader taxonomy of 35 papers across multiple token processing strategies. The small sibling set suggests this specific approach to architectural specialization—targeting normalization and QKV projections—occupies a focused niche rather than a crowded subfield.

The taxonomy reveals that while explicit decoupling is sparse, neighboring approaches are more populated. The parent category 'Token Specialization and Architectural Modifications' includes multi-class token architectures (four papers) and dual-path attention mechanisms (three papers), indicating alternative strategies for managing token heterogeneity. Adjacent branches like 'Weakly Supervised Semantic Segmentation with Token Interactions' (four subcategories) and 'Patch Token Utilization for Dense Prediction' (three subcategories) leverage token interactions differently—for localization and retrieval rather than architectural separation. The taxonomy's scope notes clarify that methods using token selection or merging without architectural specialization belong elsewhere, positioning this work as fundamentally about structural modification rather than computational efficiency.

Among 24 candidates examined across three contributions, none were found to clearly refute the proposed ideas. The analysis of normalization-induced implicit differentiation examined four candidates with zero refutations. The specialized processing paths contribution examined ten candidates, again with no clear prior work overlap. The comprehensive architectural study similarly found no refutations among ten examined papers. These statistics suggest that within the limited search scope—top-K semantic matches plus citation expansion—the specific combination of normalization analysis and targeted QKV specialization appears relatively unexplored, though the modest search scale (24 papers, not hundreds) means undiscovered prior work remains possible.

Based on the limited literature search, the work appears to occupy a distinct position combining normalization-layer analysis with selective architectural specialization. The sparse population of its taxonomy leaf and absence of refutations among examined candidates suggest novelty, though the 24-paper search scope cannot guarantee exhaustiveness. The approach differs from sibling works by targeting specific architectural components (normalization, QKV) rather than wholesale pathway separation, representing a measured intervention within the class-patch decoupling design space.

Taxonomy

Core-task Taxonomy Papers
35
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Disentangling class and patch token processing in Vision Transformers. The field structure reflects diverse strategies for managing the dual-token architecture inherent to ViTs, where a class token aggregates global information while patch tokens encode spatial details. The taxonomy reveals several major branches: Token Specialization and Architectural Modifications explores explicit redesigns of how class and patch tokens interact or are processed separately, including works like GradToken[23] and DisentangleFormer[25] that explicitly decouple these streams. Token Selection, Pruning, and Efficiency focuses on computational trade-offs by selectively retaining informative tokens. Weakly Supervised Semantic Segmentation with Token Interactions and Patch Token Utilization for Dense Prediction and Retrieval leverage patch tokens for fine-grained spatial tasks. Domain-Specific Token Processing Adaptations, Cross-Domain Transfer and Adversarial Robustness, Interpretability and Feature Analysis, and Few-Shot and Cross-Attention Learning address specialized application contexts, from medical imaging to adversarial settings, where token roles must be carefully managed. A particularly active line of work centers on explicit class-patch decoupling mechanisms that challenge the standard unified self-attention paradigm. CLS Patch Interaction[0] sits within this branch, proposing targeted modifications to how class and patch tokens exchange information during transformer layers. This contrasts with approaches like GradToken[23], which uses gradient-based token importance to guide selective processing, and DisentangleFormer[25], which employs separate pathways for different token types. Meanwhile, works such as Token Importance Diversity[1] and Class Token Infusion[2] explore alternative strategies for balancing global and local representations without full architectural separation. The central tension across these efforts involves whether to maintain unified attention for representational richness or to impose structural separation for interpretability and efficiency, with CLS Patch Interaction[0] navigating this trade-off through controlled interaction patterns rather than complete decoupling.

Claimed Contributions

Analysis of implicit differentiation between class and patch tokens via normalization layers

The authors analyze Vision Transformers and discover that normalization layers (particularly LayerNorm before attention) implicitly separate [CLS] and patch tokens despite their shared parameterization, revealing hidden dynamics in how models attempt to distinguish these functionally distinct token types.

4 retrieved papers
Specialized processing paths for class and patch tokens

The authors introduce an architectural modification that explicitly separates the processing of [CLS] and patch tokens through dedicated layers with distinct weights, particularly in normalization layers and QKV projections in early transformer blocks, while preserving their interaction through attention mechanisms.

10 retrieved papers
Comprehensive study of which architectural components benefit from specialization

The authors conduct systematic ablation studies to identify optimal specialization strategies, showing that targeting the first third of transformer blocks and specific layers (normalization and QKV projections) yields the best performance, and demonstrate generalizability across different model sizes and training paradigms.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Analysis of implicit differentiation between class and patch tokens via normalization layers

The authors analyze Vision Transformers and discover that normalization layers (particularly LayerNorm before attention) implicitly separate [CLS] and patch tokens despite their shared parameterization, revealing hidden dynamics in how models attempt to distinguish these functionally distinct token types.

Contribution

Specialized processing paths for class and patch tokens

The authors introduce an architectural modification that explicitly separates the processing of [CLS] and patch tokens through dedicated layers with distinct weights, particularly in normalization layers and QKV projections in early transformer blocks, while preserving their interaction through attention mechanisms.

Contribution

Comprehensive study of which architectural components benefit from specialization

The authors conduct systematic ablation studies to identify optimal specialization strategies, showing that targeting the first third of transformer blocks and specific layers (normalization and QKV projections) yields the best performance, and demonstrate generalizability across different model sizes and training paradigms.