ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

byte-level language modelingtokenization

Modern language models (LMs) still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. Our approach is grounded in information theory: ByteFlow Net performs compression-driven segmentation based on coding rate of latent representation, allowing the model to dynamically evaluate the information cost of grouping bytes and decide chunk boundaries during processing. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive, robust, and information-grounded language models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ByteFlow Net proposes a compression-driven segmentation architecture that removes fixed tokenizers and learns to chunk raw byte streams based on coding rate of latent representations. The paper sits in the Information-Theoretic Segmentation leaf, which contains only two papers in the entire taxonomy of twenty-five works. This indicates a relatively sparse research direction within the broader field of tokenizer-free language modeling, suggesting that principled information-theoretic approaches to adaptive segmentation remain underexplored compared to fixed-granularity methods or heuristic-based learned segmentation.

The taxonomy reveals neighboring approaches in sibling leaves: Learned Segmentation Mechanisms includes models with trainable boundary prediction modules (three papers), while Hierarchical Multi-Scale Architectures employ fixed hierarchical structures at multiple granularities (four papers). ByteFlow Net diverges from these by grounding segmentation decisions in compression principles rather than learned heuristics or predetermined hierarchies. The broader Adaptive Byte Segmentation Architectures branch contrasts with Fixed-Granularity Byte-Level Models, which process bytes uniformly without dynamic chunking—a fundamental architectural distinction that positions ByteFlow Net in the adaptive paradigm.

Among twenty-two candidates examined, the ByteFlow Net architecture contribution shows two refutable candidates from ten examined, while the end-to-end tokenizer-free paradigm shows three refutable candidates from ten examined. The coding-rate-based chunking criterion appears more novel, with zero refutable candidates among two examined. These statistics suggest that while the overall architectural concept and tokenizer-free approach have some prior work overlap within the limited search scope, the specific compression-based chunking mechanism may represent a less-explored technical direction. The analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review.

Based on the limited search scope of twenty-two candidates, ByteFlow Net appears to occupy a relatively sparse position within information-theoretic segmentation approaches. The specific coding-rate criterion shows minimal overlap with examined prior work, though the broader architectural paradigm has more substantial precedents. The taxonomy structure confirms that adaptive segmentation remains less crowded than fixed-granularity methods, with information-theoretic approaches representing a particularly narrow research thread within the field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: tokenizer-free language modeling through adaptive byte segmentation. The field has evolved around several complementary directions. Fixed-Granularity Byte-Level Models such as ByT5[2] and MambaByte[10] process raw bytes uniformly without explicit segmentation, offering simplicity at the cost of longer sequences. Adaptive Byte Segmentation Architectures introduce dynamic grouping mechanisms—ranging from hierarchical patch-based approaches like MEGABYTE[11] to information-theoretic methods that learn optimal boundaries. Multilingual and Cross-Lingual Byte Modeling explores how byte-level representations handle diverse scripts and low-resource languages, while Tokenization Analysis and Byte-Level Inference investigates the theoretical and practical trade-offs of abandoning subword vocabularies. Domain-Specific Byte-Level Applications extend these ideas to specialized tasks such as DNA sequence modeling[4] or network packet analysis[22], demonstrating the versatility of byte-based paradigms beyond natural language. Recent work has concentrated on bridging the efficiency gap between fixed and adaptive strategies. ByteFlow[0] sits within the Information-Theoretic Segmentation cluster, emphasizing principled boundary detection to balance compression and computational cost. This contrasts with the Byte Latent Transformer[3], a close neighbor that employs learned latent groupings to dynamically chunk byte streams, and with SpaceByte[7], which uses a separate boundary prediction mechanism. While fixed models like ByT5[2] remain competitive for certain multilingual tasks, adaptive approaches such as ByteFlow[0] and Dynamic Chunking[6] aim to capture variable-length semantic units more naturally. Open questions persist around the scalability of these segmentation strategies, the interpretability of learned boundaries, and whether information-theoretic criteria can consistently outperform heuristic or neural grouping methods across diverse languages and domains.

Claimed Contributions

ByteFlow Net architecture with compression-driven segmentation

Can Refute

10 retrieved papers

The authors propose ByteFlow Net, a hierarchical neural architecture that eliminates fixed tokenizers by learning to segment raw byte streams dynamically. The segmentation is driven by a coding-rate-based compression objective that identifies semantically meaningful boundaries adaptively during the forward pass.

10 retrieved papers

Can Refute

Coding-rate-based chunking criterion for dynamic segmentation

2 retrieved papers

The authors introduce a principled information-theoretic chunking mechanism that uses lossy coding rate to measure the information gain at each byte position. This criterion enables the model to dynamically select chunk boundaries by promoting positions with high coding rates to the global level while maintaining a static computation graph through Top-K selection.

2 retrieved papers

End-to-end tokenizer-free modeling paradigm

Can Refute

10 retrieved papers

The authors establish a new modeling paradigm that eliminates the traditional static tokenization stage entirely, replacing it with an end-to-end learnable segmentation process integrated directly into the neural network's forward computation.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Byte latent transformer: Patches scale better than tokens PDF

Ghosh, Gargi, Holtzman, Ari, Iyer, Srini, Lewis, Mike, Li, Margaret, MÃ¼ller, Benjamin, Nguyen John, Pagnoni, Artidoro, Pasunuru, Ramakanth, Rodriguez Pedro, Yu Lili, Zettlemoyer Luke, Zhou, Chunting (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ByteFlow Net architecture with compression-driven segmentation

[3] Byte latent transformer: Patches scale better than tokens PDF

Can Refute

[6] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling PDF

Can Refute

[21] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages PDF

Cannot Refute

[26] BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages PDF

Cannot Refute

[27] Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models PDF

Cannot Refute

[28] ScalaDetect-5G: Ultra High-Precision Highly Elastic Deep Intrusion Detection System for 5G Network. PDF

Cannot Refute

[29] CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation PDF

Cannot Refute

[30] EBSNN: Extended Byte Segment Neural Network for Network Traffic Classification PDF

Cannot Refute

[31] Comparing neuralâand Nâgramâbased language models for word segmentation PDF

Cannot Refute

[32] Finding Hierarchical Structure in Binary Sequences: Evidence from Lindenmayer Grammar Learning. PDF

Cannot Refute

Contribution

Coding-rate-based chunking criterion for dynamic segmentation

[42] Coding theoretic approach to image segmentation PDF

Cannot Refute

[43] Dynamic coding of visual information PDF

Cannot Refute

Contribution

End-to-end tokenizer-free modeling paradigm

[6] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling PDF

Can Refute

[37] Charformer: Fast character transformers via gradient-based subword tokenization PDF

Can Refute

[39] Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation PDF

Can Refute

[33] Character-Aware Neural Language Models PDF

Cannot Refute

[34] Learn your tokens: Word-pooled tokenization for language modeling PDF

Cannot Refute

[35] Rethinking tokenization: Crafting better tokenizers for large language models PDF

Cannot Refute

[36] Emergent architectural dynamics of neural token compression in large language models PDF

Cannot Refute

[38] BabyLM's First Words: Word Segmentation as a Phonological Probing Task PDF

Cannot Refute

[40] Unsupervised Word Segmentation with Bi-directional Neural Language Model PDF

Cannot Refute

[41] Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp PDF

Cannot Refute

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Byte latent transformer: Patches scale better than tokens PDF

Contribution Analysis

ByteFlow Net architecture with compression-driven segmentation

[3] Byte latent transformer: Patches scale better than tokens PDF

[6] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling PDF

[21] H-Net++: Hierarchical Dynamic Chunking for Tokenizer-Free Language Modelling in Morphologically-Rich Languages PDF

[26] BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages PDF

[27] Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models PDF

[28] ScalaDetect-5G: Ultra High-Precision Highly Elastic Deep Intrusion Detection System for 5G Network. PDF

[29] CharSS: Character-Level Transformer Model for Sanskrit Word Segmentation PDF

[30] EBSNN: Extended Byte Segment Neural Network for Network Traffic Classification PDF

[31] Comparing neuralâand Nâgramâbased language models for word segmentation PDF

[32] Finding Hierarchical Structure in Binary Sequences: Evidence from Lindenmayer Grammar Learning. PDF

Coding-rate-based chunking criterion for dynamic segmentation

[42] Coding theoretic approach to image segmentation PDF

[43] Dynamic coding of visual information PDF

End-to-end tokenizer-free modeling paradigm

[6] Dynamic Chunking for End-to-End Hierarchical Sequence Modeling PDF

[37] Charformer: Fast character transformers via gradient-based subword tokenization PDF

[39] Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation PDF

[33] Character-Aware Neural Language Models PDF

[34] Learn your tokens: Word-pooled tokenization for language modeling PDF

[35] Rethinking tokenization: Crafting better tokenizers for large language models PDF

[36] Emergent architectural dynamics of neural token compression in large language models PDF

[38] BabyLM's First Words: Word Segmentation as a Phonological Probing Task PDF

[40] Unsupervised Word Segmentation with Bi-directional Neural Language Model PDF

[41] Between words and characters: A brief history of open-vocabulary modeling and tokenization in nlp PDF

Table of Contents

[31] Comparing neuralâand Nâgramâbased language models for word segmentation PDF