StoryAlign: Evaluating and Training Reward Models for Story Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Story GenerationStory RewardReward Bench

Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under-explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains $1,133$ high-quality, human-verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human-preferred stories, with the best model achieving only $66.3\%$ accuracy. To address this limitation, we construct roughly $100,000$ high-quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state-of-the-art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test-time scaling applications for best-of-n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating and training reward models for story generation. The field organizes around several complementary branches that address different facets of guiding narrative systems toward human preferences. One branch focuses on reward model architecture and training methodology, exploring how to build robust scoring functions from human feedback or other signals (e.g., ImageReward[1], VisionReward[6]). A second branch examines reward-guided generation and decoding strategies, investigating how to integrate learned rewards into sampling or search procedures (e.g., Reward-Augmented Decoding[16]). A third major area is reinforcement learning for story and creative text generation, where policy optimization techniques adapt language models to produce coherent, engaging narratives (e.g., Hierarchical Story Generation[19], Recursively Summarizing Books[24]). Additional branches cover evaluation and benchmarking of reward models (e.g., RoleRMBench[47]), intermediate rewards and critique-based methods that provide step-by-step guidance (e.g., Self-Generated Critiques[49]), controlled generation and safety (e.g., Mitigating Toxicity[25]), and specialized applications ranging from visual storytelling (e.g., StoryLLaVA[22]) to domain-specific scenarios (e.g., Automated Scenario Generation[36]). Within the reinforcement learning branch, a particularly active line of work applies RL with human feedback to story generation, balancing creativity with alignment to user preferences. StoryAlign[0] sits squarely in this cluster, emphasizing the challenge of training reward models that capture nuanced narrative quality while avoiding common pitfalls such as length biases (cf. Length Correlations[14]) or reward hacking. Nearby efforts like Learning to Reason[2] and RAGferee[3] explore how to incorporate reasoning or retrieval-augmented signals into the reward landscape, whereas BabyStories[32] investigates simpler narrative domains to isolate core alignment questions. A central tension across these works is whether to rely on end-to-end learned rewards, composite hand-crafted signals (e.g., Composite Rewards[28]), or hybrid critique-based approaches. StoryAlign[0] contributes to this conversation by proposing methods that directly address reward model reliability in open-ended creative settings, positioning itself among studies that seek principled ways to scale human oversight for long-form, stylistically diverse narratives.

Claimed Contributions

StoryRMB benchmark for story preference evaluation

Can Refute

9 retrieved papers

The authors present StoryRMB, a benchmark containing 1,133 high-quality, human-verified instances for evaluating how well reward models capture human story preferences. Each instance includes a prompt, one chosen story, and three rejected stories across five evaluation dimensions: coherence, creativity, characterization, fluency, and relevance.

9 retrieved papers

Can Refute

Automated method for collecting story preference pairs

Can Refute

10 retrieved papers

The authors develop an automated pipeline for constructing approximately 100,000 story preference pairs from human-written stories using three methods: premise back-generation, prompt-guided rewriting, and human-guided continuation. This dataset captures real-world human preferences from online literary platforms.

10 retrieved papers

Can Refute

StoryReward advanced reward model

Can Refute

10 retrieved papers

The authors introduce StoryReward, a reward model trained on their large-scale preference dataset that achieves state-of-the-art performance on StoryRMB. The model outperforms much larger models and demonstrates effectiveness in test-time scaling applications such as best-of-n story selection.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Learning to reason for long-form story generation PDF

Gurung, Alexander, Lapata, Mirella, Alexander Gurung, Mirella Lapata (2025)

[24] Recursively summarizing books with human feedback PDF

Wu, Jeff, Ouyang Long, Jeff Wu, Ziegler, Daniel M., Long Ouyang, Stiennon, Nisan, Daniel M. Ziegler, Lowe Ryan, Nissan Stiennon, Leike, Jan, Ryan Lowe, Christiano, Paul, Jan Leike, P. Christiano (2021)

[32] BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories? PDF

Zhao, Xingmeng, Xingmeng Zhao, Tongnian Wang, Sheri Osborn, Rios, Anthony, Anthony Rios, A. Rios (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StoryRMB benchmark for story preference evaluation

[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

Can Refute

[51] Ares: An automated evaluation framework for retrieval-augmented generation systems PDF

Cannot Refute

[52] OpenMEVA: A benchmark for evaluating open-ended story generation metrics PDF

Cannot Refute

[53] OpenGenAlign: A Preference Dataset and Benchmark for Trustworthy Reward Modeling in Open-Ended, Long-Context Generation PDF

Cannot Refute

[56] Hybrid preferences: Learning to route instances for human vs. AI feedback PDF

Cannot Refute

[57] Are large language models capable of generating human-level narratives? PDF

Cannot Refute

[58] Meta-evaluation methodology and benchmark for automatic story generation PDF

Cannot Refute

[59] Storybench: A multifaceted benchmark for continuous story visualization PDF

Cannot Refute

[60] The authenticity gap in human evaluation PDF

Cannot Refute

Contribution

Automated method for collecting story preference pairs

[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

Can Refute

[68] StoryER: Automatic story evaluation via ranking, rating and reasoning PDF

Can Refute

[33] Automatic story generation: A survey of approaches PDF

Cannot Refute

[41] Robust preference learning for storytelling via contrastive reinforcement learning PDF

Cannot Refute

[69] Eliciting human preferences with language models PDF

Cannot Refute

[70] Hierarchical neural story generation PDF

Cannot Refute

[71] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences PDF

Cannot Refute

[72] Folktale Story Generation and Automatic Evaluation of Generated Text PDF

Cannot Refute

[73] Tailored Tales: Enhancing Children's Reading Comprehension with Preference-Tuned Automatic Story Generation PDF

Cannot Refute

[74] Aligning Text-to-Music Evaluation with Human Preferences PDF

Cannot Refute

Contribution

StoryReward advanced reward model

[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

Can Refute

[41] Robust preference learning for storytelling via contrastive reinforcement learning PDF

Cannot Refute

[61] ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents PDF

Cannot Refute

[62] Ai-slop to ai-polish? aligning language models through edit-based writing rewards and test-time computation PDF

Cannot Refute

[63] User-centric Subjective Leaderboard by Customizable Reward Modeling PDF

Cannot Refute

[64] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data PDF

Cannot Refute

[65] Swag: Storytelling with action guidance PDF

Cannot Refute

[66] Reward collapse in aligning large language models: A prompt-aware approach to preference rankings PDF

Cannot Refute

[67] Larger or Smaller Reward Margins to Select Preferences for Alignment? PDF

Cannot Refute

[68] StoryER: Automatic story evaluation via ranking, rating and reasoning PDF

Cannot Refute

StoryAlign: Evaluating and Training Reward Models for Story Generation

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Learning to reason for long-form story generation PDF

[24] Recursively summarizing books with human feedback PDF

[32] BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories? PDF

Contribution Analysis

StoryRMB benchmark for story preference evaluation

[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

[51] Ares: An automated evaluation framework for retrieval-augmented generation systems PDF

[52] OpenMEVA: A benchmark for evaluating open-ended story generation metrics PDF

[53] OpenGenAlign: A Preference Dataset and Benchmark for Trustworthy Reward Modeling in Open-Ended, Long-Context Generation PDF

[56] Hybrid preferences: Learning to route instances for human vs. AI feedback PDF

[57] Are large language models capable of generating human-level narratives? PDF

[58] Meta-evaluation methodology and benchmark for automatic story generation PDF

[59] Storybench: A multifaceted benchmark for continuous story visualization PDF

[60] The authenticity gap in human evaluation PDF

Automated method for collecting story preference pairs

[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

[68] StoryER: Automatic story evaluation via ranking, rating and reasoning PDF

[33] Automatic story generation: A survey of approaches PDF

[41] Robust preference learning for storytelling via contrastive reinforcement learning PDF

[69] Eliciting human preferences with language models PDF

[70] Hierarchical neural story generation PDF

[71] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences PDF

[72] Folktale Story Generation and Automatic Evaluation of Generated Text PDF

[73] Tailored Tales: Enhancing Children's Reading Comprehension with Preference-Tuned Automatic Story Generation PDF

[74] Aligning Text-to-Music Evaluation with Human Preferences PDF

StoryReward advanced reward model

[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

[41] Robust preference learning for storytelling via contrastive reinforcement learning PDF

[61] ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents PDF

[62] Ai-slop to ai-polish? aligning language models through edit-based writing rewards and test-time computation PDF

[63] User-centric Subjective Leaderboard by Customizable Reward Modeling PDF

[64] Beyond Scalar Reward Model: Learning Generative Judge from Preference Data PDF

[65] Swag: Storytelling with action guidance PDF

[66] Reward collapse in aligning large language models: A prompt-aware approach to preference rankings PDF

[67] Larger or Smaller Reward Margins to Select Preferences for Alignment? PDF

[68] StoryER: Automatic story evaluation via ranking, rating and reasoning PDF

Table of Contents