StochasTok: Improving Fine-Grained Subword Understanding in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
language modelstokenizationpretrainingfinetuningsubword understanding
Abstract:

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with seemingly simple subword-level tasks, like counting the number of 'r's in 'strawberry'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper, we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline, and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces StochasTok, a stochastic tokenization scheme that randomly splits tokens during training to expose internal word structure. Within the taxonomy, it occupies the 'Stochastic and Dynamic Tokenization Methods' leaf under 'Tokenization Architecture and Representation Design'. Notably, this leaf contains only one paper (the original work itself), indicating a relatively sparse research direction. The broader parent category includes four leaves addressing alternative tokenization architectures, suggesting that stochastic approaches represent a less-explored avenue compared to character-level modeling or hierarchical designs.

The taxonomy reveals neighboring work in character-level modeling (three papers), hierarchical architectures (four papers), and continuous representations (two papers). StochasTok diverges from these by maintaining subword tokenization while introducing controlled randomness, rather than abandoning tokens entirely or layering multiple granularities. The 'Subword Tokenization Analysis and Evaluation' branch (seventeen papers across four leaves) documents extensive empirical work on tokenization failures—the 'strawberry problem' and compositional breakdowns—that motivate StochasTok's design. The taxonomy's scope notes clarify that stochastic methods differ from post-hoc token manipulation (excluded to 'Token-Level Optimization') and static schemes (excluded to 'Subword Tokenization Analysis').

Among twenty-nine candidates examined, the contribution-level analysis shows varied novelty signals. The core StochasTok scheme (Contribution A) examined nine candidates with zero refutations, suggesting limited direct prior work on stochastic token splitting. The pretraining demonstration (Contribution B) examined ten candidates, also with zero refutations, indicating that empirical validation of stochastic tokenization on subword tasks appears underexplored. However, the post-training application (Contribution C) examined ten candidates and found one refutable match, suggesting that adapting pretrained models via tokenization modifications has some precedent in the limited search scope.

Based on the top-29 semantic matches and taxonomy structure, StochasTok appears to occupy a relatively novel position within stochastic tokenization methods. The single-paper leaf and low refutation rates across contributions suggest limited direct overlap, though the analysis does not cover exhaustive literature beyond these candidates. The taxonomy context indicates that while tokenization challenges are well-documented (seventeen analysis papers), stochastic solutions remain less developed compared to architectural alternatives like character-level or hierarchical approaches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: improving subword-level understanding in large language models. The field addresses fundamental challenges in how LLMs process and represent text at the subword level, organizing research into four main branches. Tokenization Architecture and Representation Design explores novel ways to construct and organize token representations, including stochastic and dynamic methods that move beyond fixed vocabularies (e.g., StochasTok[0], Tokenskip[1]) as well as hierarchical and byte-level approaches (SpaceByte[10], Hierarchical Autoregressive Transformers[8]). Subword Tokenization Analysis and Evaluation investigates how existing tokenization schemes affect model behavior, examining issues like the "strawberry problem" (Strawberry Problem[2]) and compositional failures (Tokenization Falling Short[3], Subword Compositionality Understanding[37]). Token-Level Optimization and Manipulation focuses on techniques for pruning, compressing, or selectively modifying tokens during inference or training (Dynamic Token Pruning[21], Toxic Subword Pruning[11]). Token-Level Phenomena and Theoretical Foundations studies the underlying mechanisms and theoretical properties of subword processing, including attention patterns and semantic information flow. Several active lines of work reveal key trade-offs in the field. One tension involves balancing flexibility with computational efficiency: dynamic tokenization methods promise better adaptability to diverse inputs but introduce overhead, while fixed vocabularies remain efficient yet brittle on edge cases. Another contrast appears between analysis-focused studies that diagnose tokenization failures (Tokenization Falling Short[3], Duplicate Subwords Effect[13]) and intervention-focused work that proposes architectural solutions. StochasTok[0] sits within the stochastic and dynamic tokenization cluster, emphasizing probabilistic token selection as a way to improve robustness. Compared to deterministic dynamic approaches like Tokenskip[1], which selectively skips tokens, StochasTok[0] introduces randomness to explore multiple segmentation possibilities. This positions it as exploring a middle ground between fully fixed tokenization and more radical architectural redesigns, aiming to enhance subword understanding through controlled stochasticity rather than complete vocabulary overhaul.

Claimed Contributions

StochasTok: A simple stochastic tokenization scheme

The authors propose StochasTok, a stochastic tokenization method that randomly splits tokens into equivalent pairs of smaller tokens during training. This approach allows language models to observe the fine-grained morphological structure of words, improving subword-level understanding while maintaining compatibility with any base tokenizer.

9 retrieved papers
Demonstration of improved subword understanding through pretraining

The authors demonstrate that pretraining language models with StochasTok leads to substantial improvements on various subword-level tasks such as character counting, substring identification, and multi-digit addition. Models pretrained with StochasTok achieve near-perfect accuracy on language game tasks and can grok mathematical operations.

10 retrieved papers
Post-training application to existing pretrained models

The authors show that StochasTok can be applied after pretraining to retrofit existing models with improved subword understanding. This continued pretraining approach allows models that were originally trained with deterministic tokenization to gain subword-level capabilities without requiring expensive retraining from scratch.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StochasTok: A simple stochastic tokenization scheme

The authors propose StochasTok, a stochastic tokenization method that randomly splits tokens into equivalent pairs of smaller tokens during training. This approach allows language models to observe the fine-grained morphological structure of words, improving subword-level understanding while maintaining compatibility with any base tokenizer.

Contribution

Demonstration of improved subword understanding through pretraining

The authors demonstrate that pretraining language models with StochasTok leads to substantial improvements on various subword-level tasks such as character counting, substring identification, and multi-digit addition. Models pretrained with StochasTok achieve near-perfect accuracy on language game tasks and can grok mathematical operations.

Contribution

Post-training application to existing pretrained models

The authors show that StochasTok can be applied after pretraining to retrofit existing models with improved subword understanding. This continued pretraining approach allows models that were originally trained with deterministic tokenization to gain subword-level capabilities without requiring expensive retraining from scratch.