StochasTok: Improving Fine-Grained Subword Understanding in LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

language modelstokenizationpretrainingfinetuningsubword understanding

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with seemingly simple subword-level tasks, like counting the number of 'r's in 'strawberry'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper, we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline, and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces StochasTok, a stochastic tokenization scheme that randomly splits tokens during training to expose internal word structure. Within the taxonomy, it occupies the 'Stochastic and Dynamic Tokenization Methods' leaf under 'Tokenization Architecture and Representation Design'. Notably, this leaf contains only one paper (the original work itself), indicating a relatively sparse research direction. The broader parent category includes four leaves addressing alternative tokenization architectures, suggesting that stochastic approaches represent a less-explored avenue compared to character-level modeling or hierarchical designs.

The taxonomy reveals neighboring work in character-level modeling (three papers), hierarchical architectures (four papers), and continuous representations (two papers). StochasTok diverges from these by maintaining subword tokenization while introducing controlled randomness, rather than abandoning tokens entirely or layering multiple granularities. The 'Subword Tokenization Analysis and Evaluation' branch (seventeen papers across four leaves) documents extensive empirical work on tokenization failures—the 'strawberry problem' and compositional breakdowns—that motivate StochasTok's design. The taxonomy's scope notes clarify that stochastic methods differ from post-hoc token manipulation (excluded to 'Token-Level Optimization') and static schemes (excluded to 'Subword Tokenization Analysis').

Among twenty-nine candidates examined, the contribution-level analysis shows varied novelty signals. The core StochasTok scheme (Contribution A) examined nine candidates with zero refutations, suggesting limited direct prior work on stochastic token splitting. The pretraining demonstration (Contribution B) examined ten candidates, also with zero refutations, indicating that empirical validation of stochastic tokenization on subword tasks appears underexplored. However, the post-training application (Contribution C) examined ten candidates and found one refutable match, suggesting that adapting pretrained models via tokenization modifications has some precedent in the limited search scope.

Based on the top-29 semantic matches and taxonomy structure, StochasTok appears to occupy a relatively novel position within stochastic tokenization methods. The single-paper leaf and low refutation rates across contributions suggest limited direct overlap, though the analysis does not cover exhaustive literature beyond these candidates. The taxonomy context indicates that while tokenization challenges are well-documented (seventeen analysis papers), stochastic solutions remain less developed compared to architectural alternatives like character-level or hierarchical approaches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: improving subword-level understanding in large language models. The field addresses fundamental challenges in how LLMs process and represent text at the subword level, organizing research into four main branches. Tokenization Architecture and Representation Design explores novel ways to construct and organize token representations, including stochastic and dynamic methods that move beyond fixed vocabularies (e.g., StochasTok[0], Tokenskip[1]) as well as hierarchical and byte-level approaches (SpaceByte[10], Hierarchical Autoregressive Transformers[8]). Subword Tokenization Analysis and Evaluation investigates how existing tokenization schemes affect model behavior, examining issues like the "strawberry problem" (Strawberry Problem[2]) and compositional failures (Tokenization Falling Short[3], Subword Compositionality Understanding[37]). Token-Level Optimization and Manipulation focuses on techniques for pruning, compressing, or selectively modifying tokens during inference or training (Dynamic Token Pruning[21], Toxic Subword Pruning[11]). Token-Level Phenomena and Theoretical Foundations studies the underlying mechanisms and theoretical properties of subword processing, including attention patterns and semantic information flow. Several active lines of work reveal key trade-offs in the field. One tension involves balancing flexibility with computational efficiency: dynamic tokenization methods promise better adaptability to diverse inputs but introduce overhead, while fixed vocabularies remain efficient yet brittle on edge cases. Another contrast appears between analysis-focused studies that diagnose tokenization failures (Tokenization Falling Short[3], Duplicate Subwords Effect[13]) and intervention-focused work that proposes architectural solutions. StochasTok[0] sits within the stochastic and dynamic tokenization cluster, emphasizing probabilistic token selection as a way to improve robustness. Compared to deterministic dynamic approaches like Tokenskip[1], which selectively skips tokens, StochasTok[0] introduces randomness to explore multiple segmentation possibilities. This positions it as exploring a middle ground between fully fixed tokenization and more radical architectural redesigns, aiming to enhance subword understanding through controlled stochasticity rather than complete vocabulary overhaul.

Claimed Contributions

StochasTok: A simple stochastic tokenization scheme

9 retrieved papers

The authors propose StochasTok, a stochastic tokenization method that randomly splits tokens into equivalent pairs of smaller tokens during training. This approach allows language models to observe the fine-grained morphological structure of words, improving subword-level understanding while maintaining compatibility with any base tokenizer.

9 retrieved papers

Demonstration of improved subword understanding through pretraining

10 retrieved papers

The authors demonstrate that pretraining language models with StochasTok leads to substantial improvements on various subword-level tasks such as character counting, substring identification, and multi-digit addition. Models pretrained with StochasTok achieve near-perfect accuracy on language game tasks and can grok mathematical operations.

10 retrieved papers

Post-training application to existing pretrained models

Can Refute

10 retrieved papers

The authors show that StochasTok can be applied after pretraining to retrofit existing models with improved subword understanding. This continued pretraining approach allows models that were originally trained with deterministic tokenization to gain subword-level capabilities without requiring expensive retraining from scratch.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StochasTok: A simple stochastic tokenization scheme

[51] Distributional properties of subword regularization PDF

Cannot Refute

[52] Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization PDF

Cannot Refute

[53] Stochastic tokenization with a language model for neural text classification PDF

Cannot Refute

[54] Self-supervision through random segments with autoregressive coding (randsac) PDF

Cannot Refute

[55] Optimizing Biomedical Text Processing: A Comparative Analysis of Tokenization Methods and Context-Aware Representation Learning PDF

Cannot Refute

[56] Improving Consistency in LLM Inference using Probabilistic Tokenization PDF

Cannot Refute

[57] Improving Self Consistency in LLMs through Probabilistic Tokenization PDF

Cannot Refute

[58] Linguistic features tokenization of text corpora of the Uzbek PDF

Cannot Refute

[59] A Spitting Image: Superpixel Transformers PDF

Cannot Refute

Contribution

Demonstration of improved subword understanding through pretraining

[68] Cross-tokenizer distillation via approximate likelihood matching PDF

Cannot Refute

[69] Language models trained to do arithmetic predict human risky and intertemporal choice PDF

Cannot Refute

[70] Llm the genius paradox: A linguistic and math expert's struggle with simple word-based counting problems PDF

Cannot Refute

[71] A survey of word embeddings based on deep learning PDF

Cannot Refute

[72] Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment PDF

Cannot Refute

[73] Neural Networks for Mathematical ReasoningâEvaluations, Capabilities, and Techniques PDF

Cannot Refute

[74] Scalable Influence and Fact Tracing for Large Language Model Pretraining PDF

Cannot Refute

[75] Refining Pre-trained Language Models for Domain Adaptation with Entity-Aware Discriminative and Contrastive Learning PDF

Cannot Refute

[76] Improving numeracy by input reframing and quantitative pre-finetuning task PDF

Cannot Refute

[77] LUNA: language understanding with number augmentations on transformers via number plugins and pre-training PDF

Cannot Refute

Contribution

Post-training application to existing pretrained models

[39] Retrofitting Large Language Models with Dynamic Tokenization PDF

Can Refute

[11] Toxic Subword Pruning for Dialogue Response Generation on Large Language Models PDF

Cannot Refute

[60] Large Language Models for Data Discovery and Integration: Challenges and Opportunities. PDF

Cannot Refute

[61] PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network PDF

Cannot Refute

[62] OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining PDF

Cannot Refute

[63] Word and Character Semantic Fusion by Pretrained Language Models for Text Classification PDF

Cannot Refute

[64] Sub-character tokenization for Chinese pretrained language models PDF

Cannot Refute

[65] Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment PDF

Cannot Refute

[66] Fast adaptation and robust quantization of multi-modal foundation models from associative memory: A case study in speechLM PDF

Cannot Refute

[67] AlephBERT: Language model pre-training and evaluation from sub-word to sentence level PDF

Cannot Refute

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

StochasTok: A simple stochastic tokenization scheme

[51] Distributional properties of subword regularization PDF

[52] Magnet: Improving the multilingual fairness of language models with adaptive gradient-based tokenization PDF

[53] Stochastic tokenization with a language model for neural text classification PDF

[54] Self-supervision through random segments with autoregressive coding (randsac) PDF

[55] Optimizing Biomedical Text Processing: A Comparative Analysis of Tokenization Methods and Context-Aware Representation Learning PDF

[56] Improving Consistency in LLM Inference using Probabilistic Tokenization PDF

[57] Improving Self Consistency in LLMs through Probabilistic Tokenization PDF

[58] Linguistic features tokenization of text corpora of the Uzbek PDF

[59] A Spitting Image: Superpixel Transformers PDF

Demonstration of improved subword understanding through pretraining

[68] Cross-tokenizer distillation via approximate likelihood matching PDF

[69] Language models trained to do arithmetic predict human risky and intertemporal choice PDF

[70] Llm the genius paradox: A linguistic and math expert's struggle with simple word-based counting problems PDF

[71] A survey of word embeddings based on deep learning PDF

[72] Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment PDF

[73] Neural Networks for Mathematical ReasoningâEvaluations, Capabilities, and Techniques PDF

[74] Scalable Influence and Fact Tracing for Large Language Model Pretraining PDF

[75] Refining Pre-trained Language Models for Domain Adaptation with Entity-Aware Discriminative and Contrastive Learning PDF

[76] Improving numeracy by input reframing and quantitative pre-finetuning task PDF

[77] LUNA: language understanding with number augmentations on transformers via number plugins and pre-training PDF

Post-training application to existing pretrained models

[39] Retrofitting Large Language Models with Dynamic Tokenization PDF

[11] Toxic Subword Pruning for Dialogue Response Generation on Large Language Models PDF

[60] Large Language Models for Data Discovery and Integration: Challenges and Opportunities. PDF

[61] PMANet: Malicious URL detection via post-trained language model guided multi-level feature attention network PDF

[62] OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining PDF

[63] Word and Character Semantic Fusion by Pretrained Language Models for Text Classification PDF

[64] Sub-character tokenization for Chinese pretrained language models PDF

[65] Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment PDF

[66] Fast adaptation and robust quantization of multi-modal foundation models from associative memory: A case study in speechLM PDF

[67] AlephBERT: Language model pre-training and evaluation from sub-word to sentence level PDF

Table of Contents

[73] Neural Networks for Mathematical ReasoningâEvaluations, Capabilities, and Techniques PDF