Measuring LLM Novelty As The Frontier Of Original And High-Quality Output

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

generationevaluationmemorizationnoveltybenchmarkcreativity

As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as originality with respect to model training data, but original outputs can be of low quality. In contrast, non-expert judges more reliably score quality but may favor memorized outputs, limiting the reliability of human preference as a metric. We introduce a new novelty metric for LLM generations that balances originality and quality---the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. Using this framework, we identify trends that affect the novelty of generations from three families of open-data models (OLMo, OLMo-2, and Pythia) on three creative tasks---story completion, poetry writing, and creative tool use. We find that model-generated text from some base LLMs is less novel than human-written text from the internet. However, increasing model scale (OLMo 1B to 7B to 32B) and post-training reliably improves novelty due to improvements in output quality. We also find that improving the base model at the same scale (\eg OLMo 7B to OLMo-2 7B) leads to higher novelty due to higher originality. Finally, we observe that inference-time methods, such as prompting and providing novel in-context examples, have a much smaller effect on novelty, often increasing originality at the expense of quality. This highlights the need for further research into more effective elicitation strategies as we use models for creative applications.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a novelty metric combining n-gram originality (fraction of unseen n-grams) with task-specific quality scores, then applies this framework to analyze three model families across creative tasks. It resides in the 'Semantic and Quality-Weighted Novelty Metrics' leaf, which contains only three papers total. This leaf sits within the broader 'Novelty and Originality Measurement Frameworks' branch, distinguishing itself from pure n-gram methods and from multidimensional creativity assessments. The sparse population of this specific leaf suggests the quality-weighted approach remains relatively underexplored compared to adjacent research directions.

The taxonomy reveals neighboring work in 'N-gram and Training Data Comparison Methods' (three papers using structural overlap without quality weighting) and 'Domain-Specific Novelty Benchmarks' (three papers on task-tailored evaluation). The broader 'Multidimensional Creativity Assessment' branch contains substantially more papers (eleven across three leaves), indicating that holistic creativity frameworks dominate over targeted novelty metrics. The paper's focus on balancing originality and quality positions it at the intersection of measurement rigor and practical utility, diverging from purely psychometric approaches while remaining more focused than general creativity evaluations.

Among thirty candidates examined, none clearly refuted any of the three contributions. The novelty metric itself (ten candidates, zero refutations) appears distinct within the limited search scope, as does the analysis of model scale and post-training effects (ten candidates, zero refutations). The evaluation of inference-time methods (ten candidates, zero refutations) similarly shows no direct overlap. However, the search examined top-K semantic matches rather than exhaustive coverage, meaning closely related work outside this candidate set could exist. The statistics suggest the specific combination of harmonic-mean quality weighting and systematic model-family comparison is not prominently represented in the examined literature.

Based on the limited search scope and sparse taxonomy leaf, the work appears to occupy a relatively distinct position within novelty measurement research. The absence of refutations across thirty candidates and the small number of sibling papers suggest the quality-weighted approach is less crowded than adjacent areas. However, this assessment reflects the examined sample rather than the entire field, and the taxonomy's structure indicates that related ideas exist in neighboring branches focused on creativity dimensions or domain-specific benchmarks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating novelty in large language model text generation. The field has organized itself around several complementary perspectives. At the broadest level, researchers distinguish between frameworks that measure novelty and originality directly—often through semantic distance or quality-weighted metrics—and those that assess creativity more holistically across multiple dimensions such as fluency, flexibility, and elaboration. A parallel stream focuses on research idea and scientific hypothesis generation, where novelty is judged by domain-specific standards of innovation. Meanwhile, concerns about diversity and mode collapse examine whether LLMs produce sufficiently varied outputs or fall into repetitive patterns. Human-AI co-creativity studies compare machine-generated content against human baselines, while detection and attribution work seeks to identify AI-generated text. Domain-specific applications (e.g., storytelling, design, humor) and broader LLM evaluation methodologies round out the taxonomy, reflecting both targeted creative tasks and general quality assessment. Within this landscape, a particularly active line of work explores how to operationalize novelty through computational metrics that balance semantic distance with output quality. LLM Novelty Frontier[0] sits squarely in this branch, proposing semantic and quality-weighted measures that go beyond simple n-gram overlap or perplexity. Nearby, Language of Creativity[3] examines linguistic markers of creative expression, while Automated Creativity Scoring[11] develops scalable rubrics for evaluating originality. These efforts contrast with multidimensional frameworks like Creativity in LLMs[1] and LLM Creativity[4], which integrate novelty as one facet among several creativity dimensions. A recurring tension is whether novelty should be measured in isolation or as part of a broader creative profile, and whether automated metrics can capture the nuanced judgments that human evaluators apply. LLM Novelty Frontier[0] emphasizes the former approach, offering targeted novelty scores that complement but remain distinct from holistic creativity assessments.

Claimed Contributions

Novelty metric balancing originality and quality

10 retrieved papers

The authors propose measuring novelty as the harmonic mean of two dimensions: originality (fraction of n-grams absent from training data) and task-specific quality (measured via LLM-as-a-judge). This metric addresses limitations of prior work that evaluates only originality or relies solely on human preference.

10 retrieved papers

Analysis of factors affecting LLM novelty

10 retrieved papers

The authors systematically analyze how model scale, post-training, and base model improvements affect novelty across story completion, poetry writing, and creative tool use tasks. They find that scaling and post-training improve novelty through higher quality, while improved base models increase originality.

10 retrieved papers

Evaluation of inference-time elicitation methods

10 retrieved papers

The authors investigate whether inference-time interventions like temperature sampling, novel in-context examples, and prompting techniques can elicit more novel outputs. They demonstrate that these methods typically trade off originality gains against quality losses, yielding minimal improvements to overall novelty.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] The language of creativity: Evidence from humans and large language models PDF

William Orwig, Emma R Edenbaum, Joshua D Greene, D. Schacter, Emma R. Edenbaum, Joshua D. Greene, Daniel L. Schacter (2024)

[11] Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings. PDF

Nadine T. Maliakkal, Simone A. Luchini, Paul V DiStefano, Antonio Laverghetta, Paul V. DiStefano, John D. Patterson, Roger E. Beaty, Roni Reiter-Palmon, Roger Beaty, R. ReiterâPalmon (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novelty metric balancing originality and quality

[3] The language of creativity: Evidence from humans and large language models PDF

Cannot Refute

[5] Jointly reinforcing diversity and quality in language model generations PDF

Cannot Refute

[14] Evaluating creative short story generation in humans and large language models PDF

Cannot Refute

[15] Empowering ai as autonomous researchers: Evaluating llms in generating novel research ideas through automated metrics PDF

Cannot Refute

[38] NoveltyBench: Evaluating Language Models for Humanlike Diversity PDF

Cannot Refute

[69] Understanding the Quality-Diversity Trade-off in Diffusion Language Models PDF

Cannot Refute

[70] Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs PDF

Cannot Refute

[71] WaterJudge: Quality-Detection Trade-off when Watermarking Large Language Models PDF

Cannot Refute

[72] Surveying the effects of quality, diversity, and complexity in synthetic data from large language models PDF

Cannot Refute

[73] EARTH: Structuring creative evolution through model error in generative AI PDF

Cannot Refute

Contribution

Analysis of factors affecting LLM novelty

[5] Jointly reinforcing diversity and quality in language model generations PDF

Cannot Refute

[11] Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings. PDF

Cannot Refute

[51] Generative representational instruction tuning PDF

Cannot Refute

[52] Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities PDF

Cannot Refute

[53] Small language models can outperform humans in short creative writing: A study comparing slms with humans and llms PDF

Cannot Refute

[54] Quantization as a Foundation for Deployable High Performance Diffusion Models within the Landscape of Large Scale Generative AI PDF

Cannot Refute

[55] PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning PDF

Cannot Refute

[56] Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training PDF

Cannot Refute

[57] Automatic Scoring of Creative Problem-Solving with Large Language Models: A Comparison of Originality and Quality Ratings PDF

Cannot Refute

[58] Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Redditâs Showerthoughts PDF

Cannot Refute

Contribution

Evaluation of inference-time elicitation methods

[59] TaleBrush: Sketching stories with generative pretrained language models PDF

Cannot Refute

[60] Decoding decoded: Understanding hyperparameter effects in open-ended text generation PDF

Cannot Refute

[61] A General Framework for Inference-time Scaling and Steering of Diffusion Models PDF

Cannot Refute

[62] On Decoding Strategies for Neural Text Generators PDF

Cannot Refute

[63] Guard: Glocal uncertainty-aware robust decoding for effective and efficient open-ended text generation PDF

Cannot Refute

[64] Semantic uncertainty in advanced decoding methods for LLM generation PDF

Cannot Refute

[65] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models PDF

Cannot Refute

[66] Diversity-rewarded CFG distillation PDF

Cannot Refute

[67] Fabric: Personalizing diffusion models with iterative feedback PDF

Cannot Refute

[68] Generative AI-based text generation methods using pre-trained GPT-2 model PDF

Cannot Refute

Measuring LLM Novelty As The Frontier Of Original And High-Quality Output

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] The language of creativity: Evidence from humans and large language models PDF

[11] Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings. PDF

Contribution Analysis

Novelty metric balancing originality and quality

[3] The language of creativity: Evidence from humans and large language models PDF

[5] Jointly reinforcing diversity and quality in language model generations PDF

[14] Evaluating creative short story generation in humans and large language models PDF

[15] Empowering ai as autonomous researchers: Evaluating llms in generating novel research ideas through automated metrics PDF

[38] NoveltyBench: Evaluating Language Models for Humanlike Diversity PDF

[69] Understanding the Quality-Diversity Trade-off in Diffusion Language Models PDF

[70] Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs PDF

[71] WaterJudge: Quality-Detection Trade-off when Watermarking Large Language Models PDF

[72] Surveying the effects of quality, diversity, and complexity in synthetic data from large language models PDF

[73] EARTH: Structuring creative evolution through model error in generative AI PDF

Analysis of factors affecting LLM novelty

[5] Jointly reinforcing diversity and quality in language model generations PDF

[11] Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings. PDF

[51] Generative representational instruction tuning PDF

[52] Fine-tuning large language models for domain adaptation: exploration of training strategies, scaling, model merging and synergistic capabilities PDF

[53] Small language models can outperform humans in short creative writing: A study comparing slms with humans and llms PDF

[54] Quantization as a Foundation for Deployable High Performance Diffusion Models within the Landscape of Large Scale Generative AI PDF

[55] PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning PDF

[56] Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training PDF

[57] Automatic Scoring of Creative Problem-Solving with Large Language Models: A Comparison of Originality and Quality Ratings PDF

[58] Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Redditâs Showerthoughts PDF

Evaluation of inference-time elicitation methods

[59] TaleBrush: Sketching stories with generative pretrained language models PDF

[60] Decoding decoded: Understanding hyperparameter effects in open-ended text generation PDF

[61] A General Framework for Inference-time Scaling and Steering of Diffusion Models PDF

[62] On Decoding Strategies for Neural Text Generators PDF

[63] Guard: Glocal uncertainty-aware robust decoding for effective and efficient open-ended text generation PDF

[64] Semantic uncertainty in advanced decoding methods for LLM generation PDF

[65] Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models PDF

[66] Diversity-rewarded CFG distillation PDF

[67] Fabric: Personalizing diffusion models with iterative feedback PDF

[68] Generative AI-based text generation methods using pre-trained GPT-2 model PDF

Table of Contents

[58] Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Redditâs Showerthoughts PDF