Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

ICLR 2026 Conference SubmissionAnonymous Authors
deep learningarchitecturetokenization
Abstract:

Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn everything from raw data. Despite this trend, pre-processing steps such as tokenization remain a barrier to true end-to-end foundation models. We introduce a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies learned jointly with the rest of the model. Incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization--LM--detokenization pipeline with a single model learned fully end-to-end. When compute- and data- matched, an H-Net with one stage of hierarchy operating at the byte level outperforms a strong Transformer language model operating over BPE tokens. Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction, demonstrating significantly better scaling with data and matching the token-based Transformer of twice its size. H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without any heuristics or explicit supervision. Finally, the H-Net's improvement over tokenized pipelines is further increased in languages and modalities with weaker tokenization heuristics, such as Chinese and code, or DNA sequences (nearly 4x improvement in data efficiency over baselines), showing the potential of true end-to-end models that learn and scale better from unprocessed data.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a dynamic chunking mechanism that learns content-dependent segmentation strategies jointly with model training, integrated into a hierarchical network (H-Net) architecture. Within the taxonomy, it resides in the 'Dynamic Chunking Mechanisms' leaf under 'Byte-Level Language Modeling Architectures'. This leaf contains only two papers total, indicating a relatively sparse research direction. The work aims to replace the traditional tokenization–LM–detokenization pipeline with a single end-to-end model operating on raw byte sequences, positioning itself at the intersection of adaptive segmentation and hierarchical representation learning.

The taxonomy reveals that neighboring leaves include 'Multiscale Hierarchical Decoders' (focused on model-agnostic decoder stacks) and 'Tokenizer-Free Generative Models' (emphasizing structured output generation). The broader 'Byte-Level Language Modeling Architectures' branch sits alongside 'Hierarchical Representation Learning from Raw Inputs', which addresses multi-level abstractions across modalities beyond language. The scope note for the paper's leaf explicitly excludes fixed chunking methods, clarifying that the focus is on adaptive, learned segmentation. This positioning suggests the work bridges architectural innovation (hierarchical networks) with learning-based preprocessing (dynamic chunking), diverging from both static segmentation and purely generative byte-level approaches.

Among 30 candidates examined, the analysis identified 2 refutable pairs across 3 contributions. The dynamic chunking mechanism itself (10 candidates examined, 0 refutable) appears relatively novel within the limited search scope. However, the H-Net architecture replacing tokenization pipelines (10 candidates, 2 refutable) shows more substantial prior work overlap, suggesting that hierarchical byte-level architectures have been explored before. The recursive multi-stage hierarchical chunking contribution (10 candidates, 0 refutable) also appears less contested. These statistics indicate that while the core segmentation mechanism may be distinctive, the architectural framing has closer precedents in the examined literature.

Based on the limited top-30 semantic search, the work appears to occupy a sparsely populated research direction (only one sibling paper in its taxonomy leaf). The dynamic chunking mechanism shows fewer overlaps with prior work than the hierarchical architecture component. However, the search scope is narrow—30 candidates cannot capture the full landscape of byte-level modeling or hierarchical sequence processing. A more exhaustive review would be needed to assess whether the combination of learned segmentation and multi-stage hierarchy represents a significant departure from existing methods or an incremental refinement.

Taxonomy

Core-task Taxonomy Papers
17
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: end-to-end hierarchical sequence modeling without tokenization. This field addresses the challenge of processing raw sequential data—such as byte streams or continuous signals—by learning hierarchical representations directly, bypassing conventional tokenization steps. The taxonomy reveals four main branches. Byte-Level Language Modeling Architectures explores neural designs that operate on raw bytes, including dynamic chunking mechanisms and multiscale approaches like Multiscale Byte[2] and ByGPT5[14]. Hierarchical Representation Learning from Raw Inputs investigates how models can discover and exploit structure at multiple levels of granularity, with works such as H-Net++[1] and Hierarchy Aware Embedding[5] demonstrating methods for capturing nested patterns. Domain-Specific Hierarchical Sequence Applications extends these ideas to specialized settings—ranging from auditory pathways (Noncanonical Auditory Pathway[6]) to cursive script perception (Cursive Script Perception[15])—while Data Processing and Analysis Methodologies encompasses techniques for handling and interpreting hierarchical data, including qualitative analysis frameworks (Qualitative Data Analysis[9]) and optimization strategies (SGD Adaptive Moment[11]). A particularly active line of work centers on how to segment or chunk byte sequences adaptively during training. Dynamic Chunking[0] sits squarely within this effort, proposing mechanisms that learn to identify meaningful boundaries in raw input streams without predefined tokens. This contrasts with fixed-window approaches and aligns closely with H-Net++[1], which also emphasizes flexible hierarchical partitioning, though H-Net++[1] may focus more on explicit multi-level architectures. Meanwhile, methods like SIDE[3] and MarkerNet[4] explore alternative strategies for injecting structural priors or marker-based segmentation. A key open question across these branches is the trade-off between computational efficiency and the expressiveness of learned hierarchies: fully dynamic systems can capture richer patterns but often require more sophisticated training regimes. Dynamic Chunking[0] contributes to this landscape by demonstrating that learnable chunking can improve both modeling flexibility and downstream task performance, positioning it among recent efforts to make byte-level modeling more practical and scalable.

Claimed Contributions

Dynamic chunking mechanism for end-to-end hierarchical sequence modeling

The authors propose a dynamic chunking (DC) mechanism that learns data-dependent segmentation strategies through gradient-based optimization without external supervision. DC combines a routing module predicting boundaries via similarity scores and a smoothing module that interpolates representations, enabling fully end-to-end learning of how to compress sequences.

10 retrieved papers
Hierarchical network (H-Net) architecture replacing tokenization pipelines

The authors introduce H-Net, a hierarchical U-Net-like architecture with encoder, main network, and decoder components that processes raw byte-level data. This architecture eliminates the need for fixed-vocabulary tokenization by learning segmentation jointly with the model, creating the first truly end-to-end tokenizer-free language model.

10 retrieved papers
Can Refute
Recursive multi-stage hierarchical chunking for learning abstractions

The authors demonstrate that H-Net can be recursively nested to create multiple stages of hierarchy, where each stage learns progressively higher-level abstractions from raw data. This recursive design enables the model to discover and operate over learned abstractions rather than handcrafted features, improving scaling with data and parameters.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dynamic chunking mechanism for end-to-end hierarchical sequence modeling

The authors propose a dynamic chunking (DC) mechanism that learns data-dependent segmentation strategies through gradient-based optimization without external supervision. DC combines a routing module predicting boundaries via similarity scores and a smoothing module that interpolates representations, enabling fully end-to-end learning of how to compress sequences.

Contribution

Hierarchical network (H-Net) architecture replacing tokenization pipelines

The authors introduce H-Net, a hierarchical U-Net-like architecture with encoder, main network, and decoder components that processes raw byte-level data. This architecture eliminates the need for fixed-vocabulary tokenization by learning segmentation jointly with the model, creating the first truly end-to-end tokenizer-free language model.

Contribution

Recursive multi-stage hierarchical chunking for learning abstractions

The authors demonstrate that H-Net can be recursively nested to create multiple stages of hierarchy, where each stage learns progressively higher-level abstractions from raw data. This recursive design enables the model to discover and operate over learned abstractions rather than handcrafted features, improving scaling with data and parameters.