LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMsRLLong-form generation

Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an incentivization-based reinforcement learning approach to enable ultra-long text generation without relying on synthetic supervised fine-tuning data. It sits within the 'Reinforcement Learning for Long Generation' leaf of the taxonomy, which contains only two papers total. This is a notably sparse research direction compared to neighboring areas like supervised fine-tuning or planning-based methods, suggesting the RL-driven path for ultra-long generation remains relatively underexplored. The work positions itself as starting from a base model and using specialized reward models to guide length control, writing quality, and structural formatting.

The taxonomy reveals several neighboring approaches to ultra-long output generation. Supervised fine-tuning methods (LongWriter and others) rely on curated long-form training data, while planning and decomposition strategies break generation into subtasks. Recurrent and iterative generation methods simulate refinement loops. The paper's RL approach diverges from these by avoiding synthetic data dependency and instead using reward signals to incentivize desired behaviors. The broader 'Ultra-Long Output Generation Methods' branch shows multiple pathways to the same goal, with this work occupying the less-populated RL corner alongside one sibling paper.

Among 22 candidates examined across three contributions, the analysis found limited prior work overlap. The core RL-without-synthetic-data contribution examined 10 candidates with none clearly refuting it. The composite reward function contribution examined 2 candidates, with 1 appearing to provide overlapping prior work. The model performance claim examined 10 candidates with no refutations found. This suggests that within the limited search scope, the primary methodological novelty appears relatively intact, though the reward modeling component has at least one relevant predecessor. The small candidate pool (22 total) means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited literature search, the work appears to occupy a sparsely populated research direction with modest prior work overlap in its core contributions. The taxonomy context shows this is one of only two papers in its specific leaf, though the broader field of ultra-long generation offers multiple alternative approaches. The analysis covers top semantic matches and does not claim comprehensive field coverage, so additional related work may exist beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: ultra-long text generation by large language models. The field has evolved into several interconnected branches addressing different facets of producing and managing extended outputs. Context Window Extension and Length Generalization focuses on enabling models to handle longer inputs through techniques like positional encoding modifications (e.g., LongRope[42]) and training strategies that push beyond standard sequence limits. Ultra-Long Output Generation Methods explores mechanisms for producing coherent multi-thousand-token outputs, including reinforcement learning approaches and iterative generation schemes. Evaluation and Benchmarking develops metrics and test suites (such as Hellobench[2] and Longgenbench[48]) to assess both factual accuracy and structural coherence in long-form responses. Meanwhile, Inference Optimization and Efficiency tackles the computational bottlenecks of processing extended contexts, Retrieval-Augmented and Memory-Enhanced Approaches integrate external knowledge to support sustained generation, and Domain-Specific Long-Text Applications adapt these methods to tasks like storytelling or technical documentation. Foundational Analysis and Surveys provide overarching perspectives, while Model Compression and Distillation aim to make long-context capabilities more accessible. Within Ultra-Long Output Generation Methods, a small cluster of works employs reinforcement learning to guide models toward producing longer, higher-quality outputs without sacrificing coherence or factuality. LongWriter-Zero[0] exemplifies this direction by using RL to train models to generate extended documents in a zero-shot manner, addressing the challenge of maintaining consistency across many paragraphs. This approach contrasts with earlier iterative or planning-based methods like RecurrentGPT[35], which decompose generation into sequential steps, and complements supervised fine-tuning strategies seen in LongWriter[34]. Nearby, UloRL[33] similarly leverages reinforcement learning for ultra-long output, exploring reward shaping and policy optimization to balance length with content quality. The central tension across these efforts lies in balancing fluency, factual grounding (as measured by frameworks like FActScore[29]), and computational cost, with LongWriter-Zero[0] positioned among RL-driven techniques that seek to automate length extension while preserving semantic integrity.

Claimed Contributions

Incentivization-based RL approach for ultra-long text generation without synthetic data

10 retrieved papers

The authors introduce a novel framework that uses reinforcement learning exclusively to enable large language models to generate ultra-long, high-quality text without depending on manually curated or synthetically generated supervised fine-tuning datasets. This approach addresses limitations of traditional SFT methods by optimizing for long-range objectives through reward signals.

10 retrieved papers

Composite reward function with specialized reward models

Can Refute

2 retrieved papers

The authors design a composite reward function integrating multiple reward models, each targeting distinct aspects of writing quality: a Length RM for appropriate output length, a Writing RM for holistic quality, and a Format RM for structural integrity. These components are normalized and aggregated to provide balanced learning signals for long-form generation.

2 retrieved papers

Can Refute

LongWriter-Zero model achieving state-of-the-art performance

10 retrieved papers

The authors develop LongWriter-Zero by applying their RL framework to Qwen2.5-32B with continual pretraining and explicit reasoning steps. The model demonstrates superior performance over both SFT baselines and larger models on established benchmarks, establishing a new paradigm for scalable ultra-long text generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[33] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities PDF

Du, Dong, Liu Shulin, Yang Tao, Chen Shaohua, Li Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Incentivization-based RL approach for ultra-long text generation without synthetic data

[33] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities PDF

Cannot Refute

[51] Spell: Self-play reinforcement learning for evolving long-context language models PDF

Cannot Refute

[52] Guiding pretraining in reinforcement learning with large language models PDF

Cannot Refute

[53] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

Cannot Refute

[54] A technical survey of reinforcement learning techniques for large language models PDF

Cannot Refute

[55] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

Cannot Refute

[56] Deep reinforcement learning for sequence-to-sequence models PDF

Cannot Refute

[57] Search-r1: Training llms to reason and leverage search engines with reinforcement learning PDF

Cannot Refute

[58] Continual reinforcement learning for controlled text generation PDF

Cannot Refute

[59] Pipelinerl: Faster on-policy reinforcement learning for long sequence generation PDF

Cannot Refute

Contribution

Composite reward function with specialized reward models

[70] Rlmr: Reinforcement learning with mixed rewards for creative writing PDF

Can Refute

[71] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

Cannot Refute

Contribution

LongWriter-Zero model achieving state-of-the-art performance

[60] Lcfo: Long context and long form output dataset and benchmarking PDF

Cannot Refute

[61] InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model PDF

Cannot Refute

[62] A Contrastive Framework for Neural Text Generation PDF

Cannot Refute

[63] ZeroSCROLLS: A zero-shot benchmark for long text understanding PDF

Cannot Refute

[64] A survey of natural language generation PDF

Cannot Refute

[65] ChatGPT vs state-of-the-art models: a benchmarking study in keyphrase generation task PDF

Cannot Refute

[66] Hurdles to progress in long-form question answering PDF

Cannot Refute

[67] A comprehensive survey on long context language modeling PDF

Cannot Refute

[68] Long Range Arena: A Benchmark for Efficient Transformers PDF

Cannot Refute

[69] Writingbench: A comprehensive benchmark for generative writing PDF

Cannot Refute

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[33] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities PDF

Contribution Analysis

Incentivization-based RL approach for ultra-long text generation without synthetic data

[33] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models' Reasoning Abilities PDF

[51] Spell: Self-play reinforcement learning for evolving long-context language models PDF

[52] Guiding pretraining in reinforcement learning with large language models PDF

[53] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models PDF

[54] A technical survey of reinforcement learning techniques for large language models PDF

[55] Towards large reasoning models: A survey of reinforced reasoning with large language models PDF

[56] Deep reinforcement learning for sequence-to-sequence models PDF

[57] Search-r1: Training llms to reason and leverage search engines with reinforcement learning PDF

[58] Continual reinforcement learning for controlled text generation PDF

[59] Pipelinerl: Faster on-policy reinforcement learning for long sequence generation PDF

Composite reward function with specialized reward models

[70] Rlmr: Reinforcement learning with mixed rewards for creative writing PDF

[71] Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model PDF

LongWriter-Zero model achieving state-of-the-art performance

[60] Lcfo: Long context and long form output dataset and benchmarking PDF

[61] InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model PDF

[62] A Contrastive Framework for Neural Text Generation PDF

[63] ZeroSCROLLS: A zero-shot benchmark for long text understanding PDF

[64] A survey of natural language generation PDF

[65] ChatGPT vs state-of-the-art models: a benchmarking study in keyphrase generation task PDF

[66] Hurdles to progress in long-form question answering PDF

[67] A comprehensive survey on long context language modeling PDF

[68] Long Range Arena: A Benchmark for Efficient Transformers PDF

[69] Writingbench: A comprehensive benchmark for generative writing PDF

Table of Contents