Abstract:

Ultra-long generation by large language models (LLMs) is a widely demanded scenario, yet it remains a significant challenge due to their maximum generation length limit and overall quality degradation as sequence length increases. Previous approaches, exemplified by LongWriter, typically rely on ''teaching'', which involves supervised fine-tuning (SFT) on synthetic long-form outputs. However, this strategy heavily depends on synthetic SFT data, which is difficult and costly to construct, often lacks coherence and consistency, and tends to be overly artificial and structurally monotonous. In this work, we propose an incentivization-based approach that, starting entirely from scratch and without relying on any annotated or synthetic data, leverages reinforcement learning (RL) to foster the emergence of ultra-long, high-quality text generation capabilities in LLMs. We perform RL training starting from a base model, similar to R1-Zero, guiding it to engage in reasoning that facilitates planning and refinement during the writing process. To support this, we employ specialized reward models that steer the LLM towards improved length control, writing quality, and structural formatting. Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and Qwen3-235B.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an incentivization-based reinforcement learning approach to enable ultra-long text generation without relying on synthetic supervised fine-tuning data. It sits within the 'Reinforcement Learning for Long Generation' leaf of the taxonomy, which contains only two papers total. This is a notably sparse research direction compared to neighboring areas like supervised fine-tuning or planning-based methods, suggesting the RL-driven path for ultra-long generation remains relatively underexplored. The work positions itself as starting from a base model and using specialized reward models to guide length control, writing quality, and structural formatting.

The taxonomy reveals several neighboring approaches to ultra-long output generation. Supervised fine-tuning methods (LongWriter and others) rely on curated long-form training data, while planning and decomposition strategies break generation into subtasks. Recurrent and iterative generation methods simulate refinement loops. The paper's RL approach diverges from these by avoiding synthetic data dependency and instead using reward signals to incentivize desired behaviors. The broader 'Ultra-Long Output Generation Methods' branch shows multiple pathways to the same goal, with this work occupying the less-populated RL corner alongside one sibling paper.

Among 22 candidates examined across three contributions, the analysis found limited prior work overlap. The core RL-without-synthetic-data contribution examined 10 candidates with none clearly refuting it. The composite reward function contribution examined 2 candidates, with 1 appearing to provide overlapping prior work. The model performance claim examined 10 candidates with no refutations found. This suggests that within the limited search scope, the primary methodological novelty appears relatively intact, though the reward modeling component has at least one relevant predecessor. The small candidate pool (22 total) means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the limited literature search, the work appears to occupy a sparsely populated research direction with modest prior work overlap in its core contributions. The taxonomy context shows this is one of only two papers in its specific leaf, though the broader field of ultra-long generation offers multiple alternative approaches. The analysis covers top semantic matches and does not claim comprehensive field coverage, so additional related work may exist beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: ultra-long text generation by large language models. The field has evolved into several interconnected branches addressing different facets of producing and managing extended outputs. Context Window Extension and Length Generalization focuses on enabling models to handle longer inputs through techniques like positional encoding modifications (e.g., LongRope[42]) and training strategies that push beyond standard sequence limits. Ultra-Long Output Generation Methods explores mechanisms for producing coherent multi-thousand-token outputs, including reinforcement learning approaches and iterative generation schemes. Evaluation and Benchmarking develops metrics and test suites (such as Hellobench[2] and Longgenbench[48]) to assess both factual accuracy and structural coherence in long-form responses. Meanwhile, Inference Optimization and Efficiency tackles the computational bottlenecks of processing extended contexts, Retrieval-Augmented and Memory-Enhanced Approaches integrate external knowledge to support sustained generation, and Domain-Specific Long-Text Applications adapt these methods to tasks like storytelling or technical documentation. Foundational Analysis and Surveys provide overarching perspectives, while Model Compression and Distillation aim to make long-context capabilities more accessible. Within Ultra-Long Output Generation Methods, a small cluster of works employs reinforcement learning to guide models toward producing longer, higher-quality outputs without sacrificing coherence or factuality. LongWriter-Zero[0] exemplifies this direction by using RL to train models to generate extended documents in a zero-shot manner, addressing the challenge of maintaining consistency across many paragraphs. This approach contrasts with earlier iterative or planning-based methods like RecurrentGPT[35], which decompose generation into sequential steps, and complements supervised fine-tuning strategies seen in LongWriter[34]. Nearby, UloRL[33] similarly leverages reinforcement learning for ultra-long output, exploring reward shaping and policy optimization to balance length with content quality. The central tension across these efforts lies in balancing fluency, factual grounding (as measured by frameworks like FActScore[29]), and computational cost, with LongWriter-Zero[0] positioned among RL-driven techniques that seek to automate length extension while preserving semantic integrity.

Claimed Contributions

Incentivization-based RL approach for ultra-long text generation without synthetic data

The authors introduce a novel framework that uses reinforcement learning exclusively to enable large language models to generate ultra-long, high-quality text without depending on manually curated or synthetically generated supervised fine-tuning datasets. This approach addresses limitations of traditional SFT methods by optimizing for long-range objectives through reward signals.

10 retrieved papers
Composite reward function with specialized reward models

The authors design a composite reward function integrating multiple reward models, each targeting distinct aspects of writing quality: a Length RM for appropriate output length, a Writing RM for holistic quality, and a Format RM for structural integrity. These components are normalized and aggregated to provide balanced learning signals for long-form generation.

2 retrieved papers
Can Refute
LongWriter-Zero model achieving state-of-the-art performance

The authors develop LongWriter-Zero by applying their RL framework to Qwen2.5-32B with continual pretraining and explicit reasoning steps. The model demonstrates superior performance over both SFT baselines and larger models on established benchmarks, establishing a new paradigm for scalable ultra-long text generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Incentivization-based RL approach for ultra-long text generation without synthetic data

The authors introduce a novel framework that uses reinforcement learning exclusively to enable large language models to generate ultra-long, high-quality text without depending on manually curated or synthetically generated supervised fine-tuning datasets. This approach addresses limitations of traditional SFT methods by optimizing for long-range objectives through reward signals.

Contribution

Composite reward function with specialized reward models

The authors design a composite reward function integrating multiple reward models, each targeting distinct aspects of writing quality: a Length RM for appropriate output length, a Writing RM for holistic quality, and a Format RM for structural integrity. These components are normalized and aggregated to provide balanced learning signals for long-form generation.

Contribution

LongWriter-Zero model achieving state-of-the-art performance

The authors develop LongWriter-Zero by applying their RL framework to Qwen2.5-32B with continual pretraining and explicit reasoning steps. The model demonstrates superior performance over both SFT baselines and larger models on established benchmarks, establishing a new paradigm for scalable ultra-long text generation.