Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Discrete DiffusionInstruction TuningNLP
Abstract:

Diffusion large language models (dLLMs) have emerged as a promising alternative to autoregressive models, offering flexible generation orders and strong performance on complex reasoning tasks. However, instruction-tuned dLLMs exhibit a critical vulnerability we term <eos> overflow: as allocated sequence length increases, responses paradoxically become shorter, collapsing into early termination or degenerating into streams of <eos> tokens. Although noticed in practice, this issue has not been systematically analyzed. We trace its root cause to the dual role of <eos> as both termination and padding, which concentrates probability mass on <eos> at later positions and propagates backward to trigger early termination. To address this, we introduce Rainbow Padding, a simple remedy that replaces repeated <eos> placeholders with a repeating cycle of distinct padding tokens, distributing probability mass and breaking <eos> dominance. Experiments show that Rainbow Padding substantially improves length robustness and output quality, with as few as seven padding tokens sufficient to prevent early termination. Moreover, the method integrates efficiently into existing instruction-tuned models: LoRA fine-tuning for a single epoch on minimal data yields significant improvements, making this solution highly practical.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies and addresses a specific failure mode in instruction-tuned diffusion language models, termed '<eos> overflow,' where longer allocated sequence lengths paradoxically trigger shorter outputs or degenerate token streams. Within the taxonomy, this work occupies the 'Early Termination Mitigation via Padding Strategies' leaf under 'Decoding Optimization for Diffusion Language Models.' Notably, this leaf contains only the original paper itself, with no sibling papers, indicating a sparse and potentially underexplored research direction within the broader diffusion LLM ecosystem.

The taxonomy reveals that decoding optimization for diffusion LLMs encompasses two distinct approaches: early termination mitigation (this paper's focus) and confidence-based early exit for faster inference. The sibling leaf 'Fast Decoding via Confidence-Based Early Exit' addresses computational efficiency rather than quality degradation, highlighting a complementary but separate research trajectory. Neighboring branches address architectural design, instruction tuning data generation, and training-time optimization, but none directly tackle the padding-induced pathologies that this paper examines. The taxonomy's scope notes explicitly distinguish inference-time decoding challenges from training-time or architectural interventions, positioning this work as an inference-specific remedy.

Among seven candidates examined across three contributions, no refutable prior work was identified. The core contribution—identifying '<eos> overflow'—examined three candidates with zero refutations, while the Rainbow Padding method examined four candidates, also with zero refutations. The analysis of confidence-based decoding amplification examined no candidates. This limited search scope (seven total candidates from top-K semantic search) suggests the literature review captured closely related diffusion LLM work but may not have exhaustively covered all padding or termination strategies in broader sequence generation contexts. The absence of refutations across all contributions indicates potential novelty within the examined candidate set.

Based on the limited search scope and taxonomy structure, the work appears to address a previously uncharacterized failure mode in a sparse research area. The single-paper leaf and zero refutations among seven candidates suggest the specific problem formulation and solution may be novel within the diffusion LLM literature examined. However, the small candidate pool and narrow taxonomy coverage leave open the possibility of related work in adjacent domains not captured by this analysis.

Taxonomy

Core-task Taxonomy Papers
5
3
Claimed Contributions
7
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Mitigating early termination in instruction-tuned diffusion language models. The field structure reflects a maturing ecosystem around diffusion-based text generation, organized into four main branches. The first branch, Diffusion Language Model Architectures and Training, addresses foundational design choices and pretraining strategies that enable diffusion processes to operate over discrete token spaces, as surveyed in works like Diffusion LLMs Survey[2]. The second branch, Instruction Tuning Data Generation and Methodology, focuses on creating high-quality instruction-response datasets, drawing on techniques from Instruction Tuning GPT-4[1] and Self-Instruct Early Stopping[3]. The third branch, Instruction Tuning Optimization and Evaluation, examines how to effectively align diffusion models with human preferences and measure their instruction-following capabilities. The fourth branch, Decoding Optimization for Diffusion Language Models, tackles inference-time challenges such as sampling efficiency and sequence length control, with contributions like Fast Decoding Diffusion[5] and Textcraftor[4]. A particularly active line of work centers on decoding optimization, where researchers grapple with trade-offs between generation quality, computational cost, and controllability. Early termination—where diffusion models prematurely halt generation—emerges as a critical bottleneck, especially after instruction tuning. Rainbow Padding[0] situates itself squarely within the Decoding Optimization branch, specifically targeting early termination mitigation via padding strategies. While Fast Decoding Diffusion[5] emphasizes accelerating the sampling process through fewer diffusion steps, Rainbow Padding[0] addresses a complementary problem: ensuring that instruction-tuned models generate complete, well-formed responses by strategically managing sequence padding. This contrasts with architectural or training-time interventions, instead offering a lightweight inference-time solution that preserves the benefits of instruction tuning without requiring model retraining.

Claimed Contributions

Identification and analysis of <eos> overflow failure mode in instruction-tuned diffusion LLMs

The authors systematically identify and characterize a critical failure mode called <eos> overflow, where instruction-tuned diffusion LLMs paradoxically produce shorter responses when allocated longer generation budgets. They trace this to the dual use of <eos> as both termination marker and padding token, which creates positional bias amplified by confidence-based decoding.

3 retrieved papers
Analysis of how confidence-based decoding amplifies padding-induced bias

The authors provide a mechanistic analysis demonstrating how adaptive decoding strategies interact with padding-induced positional bias to create cascading <eos> predictions that propagate backward through sequences, and show how cyclic padding patterns can interrupt this cascade.

0 retrieved papers
Rainbow Padding method for mitigating early termination

The authors introduce Rainbow Padding, a simple modification to the padding scheme that uses a cyclic sequence of distinct padding tokens instead of repeated <eos> tokens. This approach decouples termination from padding, distributes probability mass across multiple tokens, and can be efficiently integrated into existing instruction-tuned models through minimal fine-tuning.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and analysis of <eos> overflow failure mode in instruction-tuned diffusion LLMs

The authors systematically identify and characterize a critical failure mode called <eos> overflow, where instruction-tuned diffusion LLMs paradoxically produce shorter responses when allocated longer generation budgets. They trace this to the dual use of <eos> as both termination marker and padding token, which creates positional bias amplified by confidence-based decoding.

Contribution

Analysis of how confidence-based decoding amplifies padding-induced bias

The authors provide a mechanistic analysis demonstrating how adaptive decoding strategies interact with padding-induced positional bias to create cascading <eos> predictions that propagate backward through sequences, and show how cyclic padding patterns can interrupt this cascade.

Contribution

Rainbow Padding method for mitigating early termination

The authors introduce Rainbow Padding, a simple modification to the padding scheme that uses a cyclic sequence of distinct padding tokens instead of repeated <eos> tokens. This approach decouples termination from padding, distributes probability mass across multiple tokens, and can be efficiently integrated into existing instruction-tuned models through minimal fine-tuning.