StoryAlign: Evaluating and Training Reward Models for Story Generation
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present StoryRMB, a benchmark containing 1,133 high-quality, human-verified instances for evaluating how well reward models capture human story preferences. Each instance includes a prompt, one chosen story, and three rejected stories across five evaluation dimensions: coherence, creativity, characterization, fluency, and relevance.
The authors develop an automated pipeline for constructing approximately 100,000 story preference pairs from human-written stories using three methods: premise back-generation, prompt-guided rewriting, and human-guided continuation. This dataset captures real-world human preferences from online literary platforms.
The authors introduce StoryReward, a reward model trained on their large-scale preference dataset that achieves state-of-the-art performance on StoryRMB. The model outperforms much larger models and demonstrates effectiveness in test-time scaling applications such as best-of-n story selection.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Learning to reason for long-form story generation PDF
[24] Recursively summarizing books with human feedback PDF
[32] BabyStories: Can Reinforcement Learning Teach Baby Language Models to Write Better Stories? PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
StoryRMB benchmark for story preference evaluation
The authors present StoryRMB, a benchmark containing 1,133 high-quality, human-verified instances for evaluating how well reward models capture human story preferences. Each instance includes a prompt, one chosen story, and three rejected stories across five evaluation dimensions: coherence, creativity, characterization, fluency, and relevance.
[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF
[51] Ares: An automated evaluation framework for retrieval-augmented generation systems PDF
[52] OpenMEVA: A benchmark for evaluating open-ended story generation metrics PDF
[53] OpenGenAlign: A Preference Dataset and Benchmark for Trustworthy Reward Modeling in Open-Ended, Long-Context Generation PDF
[56] Hybrid preferences: Learning to route instances for human vs. AI feedback PDF
[57] Are large language models capable of generating human-level narratives? PDF
[58] Meta-evaluation methodology and benchmark for automatic story generation PDF
[59] Storybench: A multifaceted benchmark for continuous story visualization PDF
[60] The authenticity gap in human evaluation PDF
Automated method for collecting story preference pairs
The authors develop an automated pipeline for constructing approximately 100,000 story preference pairs from human-written stories using three methods: premise back-generation, prompt-guided rewriting, and human-guided continuation. This dataset captures real-world human preferences from online literary platforms.
[54] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF
[68] StoryER: Automatic story evaluation via ranking, rating and reasoning PDF
[33] Automatic story generation: A survey of approaches PDF
[41] Robust preference learning for storytelling via contrastive reinforcement learning PDF
[69] Eliciting human preferences with language models PDF
[70] Hierarchical neural story generation PDF
[71] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences PDF
[72] Folktale Story Generation and Automatic Evaluation of Generated Text PDF
[73] Tailored Tales: Enhancing Children's Reading Comprehension with Preference-Tuned Automatic Story Generation PDF
[74] Aligning Text-to-Music Evaluation with Human Preferences PDF
StoryReward advanced reward model
The authors introduce StoryReward, a reward model trained on their large-scale preference dataset that achieves state-of-the-art performance on StoryRMB. The model outperforms much larger models and demonstrates effectiveness in test-time scaling applications such as best-of-n story selection.