Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

deep learningarchitecturetokenization

Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data. Despite this trend, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization--LM--detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching the token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a dynamic chunking mechanism that learns content-dependent segmentation strategies jointly with model training, integrated into a hierarchical network (H-Net) architecture. Within the taxonomy, it resides in the 'Dynamic Chunking Mechanisms' leaf under 'Byte-Level Language Modeling Architectures'. This leaf contains only two papers total, indicating a relatively sparse research direction. The work aims to replace the traditional tokenization–LM–detokenization pipeline with a single end-to-end model operating on raw byte sequences, positioning itself at the intersection of adaptive segmentation and hierarchical representation learning.

The taxonomy reveals that neighboring leaves include 'Multiscale Hierarchical Decoders' (focused on model-agnostic decoder stacks) and 'Tokenizer-Free Generative Models' (emphasizing structured output generation). The broader 'Byte-Level Language Modeling Architectures' branch sits alongside 'Hierarchical Representation Learning from Raw Inputs', which addresses multi-level abstractions across modalities beyond language. The scope note for the paper's leaf explicitly excludes fixed chunking methods, clarifying that the focus is on adaptive, learned segmentation. This positioning suggests the work bridges architectural innovation (hierarchical networks) with learning-based preprocessing (dynamic chunking), diverging from both static segmentation and purely generative byte-level approaches.

Among 30 candidates examined, the analysis identified 2 refutable pairs across 3 contributions. The dynamic chunking mechanism itself (10 candidates examined, 0 refutable) appears relatively novel within the limited search scope. However, the H-Net architecture replacing tokenization pipelines (10 candidates, 2 refutable) shows more substantial prior work overlap, suggesting that hierarchical byte-level architectures have been explored before. The recursive multi-stage hierarchical chunking contribution (10 candidates, 0 refutable) also appears less contested. These statistics indicate that while the core segmentation mechanism may be distinctive, the architectural framing has closer precedents in the examined literature.

Based on the limited top-30 semantic search, the work appears to occupy a sparsely populated research direction (only one sibling paper in its taxonomy leaf). The dynamic chunking mechanism shows fewer overlaps with prior work than the hierarchical architecture component. However, the search scope is narrow—30 candidates cannot capture the full landscape of byte-level modeling or hierarchical sequence processing. A more exhaustive review would be needed to assess whether the combination of learned segmentation and multi-stage hierarchy represents a significant departure from existing methods or an incremental refinement.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: end-to-end hierarchical sequence modeling without tokenization. This field addresses the challenge of processing raw sequential data—such as byte streams or continuous signals—by learning hierarchical representations directly, bypassing conventional tokenization steps. The taxonomy reveals four main branches. Byte-Level Language Modeling Architectures explores neural designs that operate on raw bytes, including dynamic chunking mechanisms and multiscale approaches like Multiscale Byte[2] and ByGPT5[14]. Hierarchical Representation Learning from Raw Inputs investigates how models can discover and exploit structure at multiple levels of granularity, with works such as H-Net++[1] and Hierarchy Aware Embedding[5] demonstrating methods for capturing nested patterns. Domain-Specific Hierarchical Sequence Applications extends these ideas to specialized settings—ranging from auditory pathways (Noncanonical Auditory Pathway[6]) to cursive script perception (Cursive Script Perception[15])—while Data Processing and Analysis Methodologies encompasses techniques for handling and interpreting hierarchical data, including qualitative analysis frameworks (Qualitative Data Analysis[9]) and optimization strategies (SGD Adaptive Moment[11]). A particularly active line of work centers on how to segment or chunk byte sequences adaptively during training. Dynamic Chunking[0] sits squarely within this effort, proposing mechanisms that learn to identify meaningful boundaries in raw input streams without predefined tokens. This contrasts with fixed-window approaches and aligns closely with H-Net++[1], which also emphasizes flexible hierarchical partitioning, though H-Net++[1] may focus more on explicit multi-level architectures. Meanwhile, methods like SIDE[3] and MarkerNet[4] explore alternative strategies for injecting structural priors or marker-based segmentation. A key open question across these branches is the trade-off between computational efficiency and the expressiveness of learned hierarchies: fully dynamic systems can capture richer patterns but often require more sophisticated training regimes. Dynamic Chunking[0] contributes to this landscape by demonstrating that learnable chunking can improve both modeling flexibility and downstream task performance, positioning it among recent efforts to make byte-level modeling more practical and scalable.

Claimed Contributions

Dynamic chunking mechanism for end-to-end hierarchical sequence modeling

10 retrieved papers

The authors propose a dynamic chunking (DC) mechanism that learns data-dependent segmentation strategies through gradient-based optimization without external supervision. DC combines a routing module predicting boundaries via similarity scores and a smoothing module that interpolates representations, enabling fully end-to-end learning of how to compress sequences.

10 retrieved papers

Hierarchical network (H-Net) architecture replacing tokenization pipelines

Can Refute

10 retrieved papers

The authors introduce H-Net, a hierarchical U-Net-like architecture with encoder, main network, and decoder components that processes raw byte-level data. This architecture eliminates the need for fixed-vocabulary tokenization by learning segmentation jointly with the model, creating the first truly end-to-end tokenizer-free language model.

10 retrieved papers

Can Refute

Recursive multi-stage hierarchical chunking for learning abstractions

10 retrieved papers

The authors demonstrate that H-Net can be recursively nested to create multiple stages of hierarchy, where each stage learns progressively higher-level abstractions from raw data. This recursive design enables the model to discover and operate over learned abstractions rather than handcrafted features, improving scaling with data and parameters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages PDF

Zakershahrak, Mehrdad, Ghodratnama, Samira, Mehrdad Zakershahrak, Samira Ghodratnama (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dynamic chunking mechanism for end-to-end hierarchical sequence modeling

[1] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages PDF

Cannot Refute

[28] Hierarchical multi-scale attention for semantic segmentation PDF

Cannot Refute

[29] Self-adaptive hierarchical sentence model PDF

Cannot Refute

[30] LW-MHFI-Net: a lightweight multi-scale network for medical image segmentation based on hierarchical feature incorporation PDF

Cannot Refute

[31] Automated high-resolution asphalt pavement crack segmentation using deep convolutional neural networks with repeated hierarchical feature fusion PDF

Cannot Refute

[32] Semantic segmentation of remote-sensing images through fully convolutional neural networks and hierarchical probabilistic graphical models PDF

Cannot Refute

[33] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation PDF

Cannot Refute

[34] Hierarchical morphological segmentation for image sequence coding PDF

Cannot Refute

[35] Semantic image segmentation with contextual hierarchical models PDF

Cannot Refute

[36] Design and evaluation of a hierarchical characterization and adaptive prediction model for cloud workloads PDF

Cannot Refute

Contribution

Hierarchical network (H-Net) architecture replacing tokenization pipelines

[18] Byte latent transformer: Patches scale better than tokens PDF

Can Refute

[27] MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers PDF

Can Refute

[19] ByT5: Towards a token-free future with pre-trained byte-to-byte models PDF

Cannot Refute

[20] Sampling from Your Language Model One Byte at a Time PDF

Cannot Refute

[21] Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations PDF

Cannot Refute

[22] Empowering Character-level Text Infilling by Eliminating Sub-Tokens PDF

Cannot Refute

[23] Mambabyte: Token-free selective state space model PDF

Cannot Refute

[24] From language models over tokens to language models over characters PDF

Cannot Refute

[25] Chared: Character-wise ensemble decoding for large language models PDF

Cannot Refute

[26] Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models PDF

Cannot Refute

Contribution

Recursive multi-stage hierarchical chunking for learning abstractions

[37] Acquisition and utilization of recursive rules in motor sequence generation PDF

Cannot Refute

[38] Pointnet++: Deep hierarchical feature learning on point sets in a metric space PDF

Cannot Refute

[39] Learning multi-level features with matryoshka sparse autoencoders PDF

Cannot Refute

[40] EpiCoder: Encompassing Diversity and Complexity in Code Generation PDF

Cannot Refute

[41] ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation PDF

Cannot Refute

[42] Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding PDF

Cannot Refute

[43] Toward a realistic model of speech processing in the brain with self-supervised learning PDF

Cannot Refute

[44] A novel hierarchical cross-stream aggregation neural network for semantic segmentation of 3-d dental surface models PDF

Cannot Refute

[45] Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames PDF

Cannot Refute

[46] HSNet: An Intelligent Hierarchical Semantic-Aware Network System for Real-Time Semantic Segmentation PDF

Cannot Refute

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages PDF

Contribution Analysis

Dynamic chunking mechanism for end-to-end hierarchical sequence modeling

[1] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages PDF

[28] Hierarchical multi-scale attention for semantic segmentation PDF

[29] Self-adaptive hierarchical sentence model PDF

[30] LW-MHFI-Net: a lightweight multi-scale network for medical image segmentation based on hierarchical feature incorporation PDF

[31] Automated high-resolution asphalt pavement crack segmentation using deep convolutional neural networks with repeated hierarchical feature fusion PDF

[32] Semantic segmentation of remote-sensing images through fully convolutional neural networks and hierarchical probabilistic graphical models PDF

[33] Mamba Goes HoME: Hierarchical Soft Mixture-of-Experts for 3D Medical Image Segmentation PDF

[34] Hierarchical morphological segmentation for image sequence coding PDF

[35] Semantic image segmentation with contextual hierarchical models PDF

[36] Design and evaluation of a hierarchical characterization and adaptive prediction model for cloud workloads PDF

Hierarchical network (H-Net) architecture replacing tokenization pipelines

[18] Byte latent transformer: Patches scale better than tokens PDF

[27] MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers PDF

[19] ByT5: Towards a token-free future with pre-trained byte-to-byte models PDF

[20] Sampling from Your Language Model One Byte at a Time PDF

[21] Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations PDF

[22] Empowering Character-level Text Infilling by Eliminating Sub-Tokens PDF

[23] Mambabyte: Token-free selective state space model PDF

[24] From language models over tokens to language models over characters PDF

[25] Chared: Character-wise ensemble decoding for large language models PDF

[26] Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models PDF

Recursive multi-stage hierarchical chunking for learning abstractions

[37] Acquisition and utilization of recursive rules in motor sequence generation PDF

[38] Pointnet++: Deep hierarchical feature learning on point sets in a metric space PDF

[39] Learning multi-level features with matryoshka sparse autoencoders PDF

[40] EpiCoder: Encompassing Diversity and Complexity in Code Generation PDF

[41] ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation PDF

[42] Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding PDF

[43] Toward a realistic model of speech processing in the brain with self-supervised learning PDF

[44] A novel hierarchical cross-stream aggregation neural network for semantic segmentation of 3-d dental surface models PDF

[45] Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames PDF

[46] HSNet: An Intelligent Hierarchical Semantic-Aware Network System for Real-Time Semantic Segmentation PDF

Table of Contents