ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

ICLR 2026 Conference SubmissionAnonymous Authors
byte-level language modelingtokenization
Abstract:

Modern language models (LMs) still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. Our approach is grounded in information theory: ByteFlow Net performs compression-driven segmentation based on coding rate of latent representation, allowing the model to dynamically evaluate the information cost of grouping bytes and decide chunk boundaries during processing. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive, robust, and information-grounded language models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ByteFlow Net proposes a compression-driven segmentation architecture that removes fixed tokenizers and learns to chunk raw byte streams based on coding rate of latent representations. The paper sits in the Information-Theoretic Segmentation leaf, which contains only two papers in the entire taxonomy of twenty-five works. This indicates a relatively sparse research direction within the broader field of tokenizer-free language modeling, suggesting that principled information-theoretic approaches to adaptive segmentation remain underexplored compared to fixed-granularity methods or heuristic-based learned segmentation.

The taxonomy reveals neighboring approaches in sibling leaves: Learned Segmentation Mechanisms includes models with trainable boundary prediction modules (three papers), while Hierarchical Multi-Scale Architectures employ fixed hierarchical structures at multiple granularities (four papers). ByteFlow Net diverges from these by grounding segmentation decisions in compression principles rather than learned heuristics or predetermined hierarchies. The broader Adaptive Byte Segmentation Architectures branch contrasts with Fixed-Granularity Byte-Level Models, which process bytes uniformly without dynamic chunking—a fundamental architectural distinction that positions ByteFlow Net in the adaptive paradigm.

Among twenty-two candidates examined, the ByteFlow Net architecture contribution shows two refutable candidates from ten examined, while the end-to-end tokenizer-free paradigm shows three refutable candidates from ten examined. The coding-rate-based chunking criterion appears more novel, with zero refutable candidates among two examined. These statistics suggest that while the overall architectural concept and tokenizer-free approach have some prior work overlap within the limited search scope, the specific compression-based chunking mechanism may represent a less-explored technical direction. The analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review.

Based on the limited search scope of twenty-two candidates, ByteFlow Net appears to occupy a relatively sparse position within information-theoretic segmentation approaches. The specific coding-rate criterion shows minimal overlap with examined prior work, though the broader architectural paradigm has more substantial precedents. The taxonomy structure confirms that adaptive segmentation remains less crowded than fixed-granularity methods, with information-theoretic approaches representing a particularly narrow research thread within the field.

Taxonomy

Core-task Taxonomy Papers
25
3
Claimed Contributions
22
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: tokenizer-free language modeling through adaptive byte segmentation. The field has evolved around several complementary directions. Fixed-Granularity Byte-Level Models such as ByT5[2] and MambaByte[10] process raw bytes uniformly without explicit segmentation, offering simplicity at the cost of longer sequences. Adaptive Byte Segmentation Architectures introduce dynamic grouping mechanisms—ranging from hierarchical patch-based approaches like MEGABYTE[11] to information-theoretic methods that learn optimal boundaries. Multilingual and Cross-Lingual Byte Modeling explores how byte-level representations handle diverse scripts and low-resource languages, while Tokenization Analysis and Byte-Level Inference investigates the theoretical and practical trade-offs of abandoning subword vocabularies. Domain-Specific Byte-Level Applications extend these ideas to specialized tasks such as DNA sequence modeling[4] or network packet analysis[22], demonstrating the versatility of byte-based paradigms beyond natural language. Recent work has concentrated on bridging the efficiency gap between fixed and adaptive strategies. ByteFlow[0] sits within the Information-Theoretic Segmentation cluster, emphasizing principled boundary detection to balance compression and computational cost. This contrasts with the Byte Latent Transformer[3], a close neighbor that employs learned latent groupings to dynamically chunk byte streams, and with SpaceByte[7], which uses a separate boundary prediction mechanism. While fixed models like ByT5[2] remain competitive for certain multilingual tasks, adaptive approaches such as ByteFlow[0] and Dynamic Chunking[6] aim to capture variable-length semantic units more naturally. Open questions persist around the scalability of these segmentation strategies, the interpretability of learned boundaries, and whether information-theoretic criteria can consistently outperform heuristic or neural grouping methods across diverse languages and domains.

Claimed Contributions

ByteFlow Net architecture with compression-driven segmentation

The authors propose ByteFlow Net, a hierarchical neural architecture that eliminates fixed tokenizers by learning to segment raw byte streams dynamically. The segmentation is driven by a coding-rate-based compression objective that identifies semantically meaningful boundaries adaptively during the forward pass.

10 retrieved papers
Can Refute
Coding-rate-based chunking criterion for dynamic segmentation

The authors introduce a principled information-theoretic chunking mechanism that uses lossy coding rate to measure the information gain at each byte position. This criterion enables the model to dynamically select chunk boundaries by promoting positions with high coding rates to the global level while maintaining a static computation graph through Top-K selection.

2 retrieved papers
End-to-end tokenizer-free modeling paradigm

The authors establish a new modeling paradigm that eliminates the traditional static tokenization stage entirely, replacing it with an end-to-end learnable segmentation process integrated directly into the neural network's forward computation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ByteFlow Net architecture with compression-driven segmentation

The authors propose ByteFlow Net, a hierarchical neural architecture that eliminates fixed tokenizers by learning to segment raw byte streams dynamically. The segmentation is driven by a coding-rate-based compression objective that identifies semantically meaningful boundaries adaptively during the forward pass.

Contribution

Coding-rate-based chunking criterion for dynamic segmentation

The authors introduce a principled information-theoretic chunking mechanism that uses lossy coding rate to measure the information gain at each byte position. This criterion enables the model to dynamically select chunk boundaries by promoting positions with high coding rates to the global level while maintaining a static computation graph through Top-K selection.

Contribution

End-to-end tokenizer-free modeling paradigm

The authors establish a new modeling paradigm that eliminates the traditional static tokenization stage entirely, replacing it with an end-to-end learnable segmentation process integrated directly into the neural network's forward computation.

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer | Novelty Validation