StoryAlign: Evaluating and Training Reward Models for Story Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Story GenerationStory RewardReward Bench
Abstract:

Story generation aims to automatically produce coherent, structured, and engaging narratives. Although large language models (LLMs) have significantly advanced text generation, stories generated by LLMs still diverge from human-authored works regarding complex narrative structure and human-aligned preferences. A key reason is the absence of effective modeling of human story preferences, which are inherently subjective and under-explored. In this work, we systematically evaluate the modeling of human story preferences and introduce StoryRMB, the first benchmark for assessing reward models on story preferences. StoryRMB contains 1,1331,133 high-quality, human-verified instances, each consisting of a prompt, one chosen story, and three rejected stories. We find existing reward models struggle to select human-preferred stories, with the best model achieving only 66.3%66.3\% accuracy. To address this limitation, we construct roughly 100,000100,000 high-quality story preference pairs across diverse domains and develop StoryReward, an advanced reward model for story preference trained on this dataset. StoryReward achieves state-of-the-art (SoTA) performance on StoryRMB, outperforming much larger models. We also adopt StoryReward in downstream test-time scaling applications for best-of-n (BoN) story selection and find that it generally chooses stories better aligned with human preferences. We will release our dataset, model, and code to facilitate future research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: evaluating and training reward models for story generation. The field organizes around several complementary branches that address different facets of guiding narrative systems toward human preferences. One branch focuses on reward model architecture and training methodology, exploring how to build robust scoring functions from human feedback or other signals (e.g., ImageReward[1], VisionReward[6]). A second branch examines reward-guided generation and decoding strategies, investigating how to integrate learned rewards into sampling or search procedures (e.g., Reward-Augmented Decoding[16]). A third major area is reinforcement learning for story and creative text generation, where policy optimization techniques adapt language models to produce coherent, engaging narratives (e.g., Hierarchical Story Generation[19], Recursively Summarizing Books[24]). Additional branches cover evaluation and benchmarking of reward models (e.g., RoleRMBench[47]), intermediate rewards and critique-based methods that provide step-by-step guidance (e.g., Self-Generated Critiques[49]), controlled generation and safety (e.g., Mitigating Toxicity[25]), and specialized applications ranging from visual storytelling (e.g., StoryLLaVA[22]) to domain-specific scenarios (e.g., Automated Scenario Generation[36]). Within the reinforcement learning branch, a particularly active line of work applies RL with human feedback to story generation, balancing creativity with alignment to user preferences. StoryAlign[0] sits squarely in this cluster, emphasizing the challenge of training reward models that capture nuanced narrative quality while avoiding common pitfalls such as length biases (cf. Length Correlations[14]) or reward hacking. Nearby efforts like Learning to Reason[2] and RAGferee[3] explore how to incorporate reasoning or retrieval-augmented signals into the reward landscape, whereas BabyStories[32] investigates simpler narrative domains to isolate core alignment questions. A central tension across these works is whether to rely on end-to-end learned rewards, composite hand-crafted signals (e.g., Composite Rewards[28]), or hybrid critique-based approaches. StoryAlign[0] contributes to this conversation by proposing methods that directly address reward model reliability in open-ended creative settings, positioning itself among studies that seek principled ways to scale human oversight for long-form, stylistically diverse narratives.

Claimed Contributions

StoryRMB benchmark for story preference evaluation

The authors present StoryRMB, a benchmark containing 1,133 high-quality, human-verified instances for evaluating how well reward models capture human story preferences. Each instance includes a prompt, one chosen story, and three rejected stories across five evaluation dimensions: coherence, creativity, characterization, fluency, and relevance.

9 retrieved papers
Can Refute
Automated method for collecting story preference pairs

The authors develop an automated pipeline for constructing approximately 100,000 story preference pairs from human-written stories using three methods: premise back-generation, prompt-guided rewriting, and human-guided continuation. This dataset captures real-world human preferences from online literary platforms.

10 retrieved papers
Can Refute
StoryReward advanced reward model

The authors introduce StoryReward, a reward model trained on their large-scale preference dataset that achieves state-of-the-art performance on StoryRMB. The model outperforms much larger models and demonstrates effectiveness in test-time scaling applications such as best-of-n story selection.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

StoryRMB benchmark for story preference evaluation

The authors present StoryRMB, a benchmark containing 1,133 high-quality, human-verified instances for evaluating how well reward models capture human story preferences. Each instance includes a prompt, one chosen story, and three rejected stories across five evaluation dimensions: coherence, creativity, characterization, fluency, and relevance.

Contribution

Automated method for collecting story preference pairs

The authors develop an automated pipeline for constructing approximately 100,000 story preference pairs from human-written stories using three methods: premise back-generation, prompt-guided rewriting, and human-guided continuation. This dataset captures real-world human preferences from online literary platforms.

Contribution

StoryReward advanced reward model

The authors introduce StoryReward, a reward model trained on their large-scale preference dataset that achieves state-of-the-art performance on StoryRMB. The model outperforms much larger models and demonstrates effectiveness in test-time scaling applications such as best-of-n story selection.