Measuring LLM Novelty As The Frontier Of Original And High-Quality Output

ICLR 2026 Conference SubmissionAnonymous Authors
generationevaluationmemorizationnoveltybenchmarkcreativity
Abstract:

As large language models (LLMs) are increasingly used for ideation and scientific discovery, it is important to evaluate their ability to generate novel output. Prior work evaluates novelty as originality with respect to model training data, but original outputs can be of low quality. In contrast, non-expert judges more reliably score quality but may favor memorized outputs, limiting the reliability of human preference as a metric. We introduce a new novelty metric for LLM generations that balances originality and quality---the harmonic mean of the fraction of \ngrams unseen during training and a task-specific quality score. Using this framework, we identify trends that affect the novelty of generations from three families of open-data models (OLMo, OLMo-2, and Pythia) on three creative tasks---story completion, poetry writing, and creative tool use. We find that model-generated text from some base LLMs is less novel than human-written text from the internet. However, increasing model scale (OLMo 1B to 7B to 32B) and post-training reliably improves novelty due to improvements in output quality. We also find that improving the base model at the same scale (\eg OLMo 7B to OLMo-2 7B) leads to higher novelty due to higher originality. Finally, we observe that inference-time methods, such as prompting and providing novel in-context examples, have a much smaller effect on novelty, often increasing originality at the expense of quality. This highlights the need for further research into more effective elicitation strategies as we use models for creative applications.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a novelty metric combining n-gram originality (fraction of unseen n-grams) with task-specific quality scores, then applies this framework to analyze three model families across creative tasks. It resides in the 'Semantic and Quality-Weighted Novelty Metrics' leaf, which contains only three papers total. This leaf sits within the broader 'Novelty and Originality Measurement Frameworks' branch, distinguishing itself from pure n-gram methods and from multidimensional creativity assessments. The sparse population of this specific leaf suggests the quality-weighted approach remains relatively underexplored compared to adjacent research directions.

The taxonomy reveals neighboring work in 'N-gram and Training Data Comparison Methods' (three papers using structural overlap without quality weighting) and 'Domain-Specific Novelty Benchmarks' (three papers on task-tailored evaluation). The broader 'Multidimensional Creativity Assessment' branch contains substantially more papers (eleven across three leaves), indicating that holistic creativity frameworks dominate over targeted novelty metrics. The paper's focus on balancing originality and quality positions it at the intersection of measurement rigor and practical utility, diverging from purely psychometric approaches while remaining more focused than general creativity evaluations.

Among thirty candidates examined, none clearly refuted any of the three contributions. The novelty metric itself (ten candidates, zero refutations) appears distinct within the limited search scope, as does the analysis of model scale and post-training effects (ten candidates, zero refutations). The evaluation of inference-time methods (ten candidates, zero refutations) similarly shows no direct overlap. However, the search examined top-K semantic matches rather than exhaustive coverage, meaning closely related work outside this candidate set could exist. The statistics suggest the specific combination of harmonic-mean quality weighting and systematic model-family comparison is not prominently represented in the examined literature.

Based on the limited search scope and sparse taxonomy leaf, the work appears to occupy a relatively distinct position within novelty measurement research. The absence of refutations across thirty candidates and the small number of sibling papers suggest the quality-weighted approach is less crowded than adjacent areas. However, this assessment reflects the examined sample rather than the entire field, and the taxonomy's structure indicates that related ideas exist in neighboring branches focused on creativity dimensions or domain-specific benchmarks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating novelty in large language model text generation. The field has organized itself around several complementary perspectives. At the broadest level, researchers distinguish between frameworks that measure novelty and originality directly—often through semantic distance or quality-weighted metrics—and those that assess creativity more holistically across multiple dimensions such as fluency, flexibility, and elaboration. A parallel stream focuses on research idea and scientific hypothesis generation, where novelty is judged by domain-specific standards of innovation. Meanwhile, concerns about diversity and mode collapse examine whether LLMs produce sufficiently varied outputs or fall into repetitive patterns. Human-AI co-creativity studies compare machine-generated content against human baselines, while detection and attribution work seeks to identify AI-generated text. Domain-specific applications (e.g., storytelling, design, humor) and broader LLM evaluation methodologies round out the taxonomy, reflecting both targeted creative tasks and general quality assessment. Within this landscape, a particularly active line of work explores how to operationalize novelty through computational metrics that balance semantic distance with output quality. LLM Novelty Frontier[0] sits squarely in this branch, proposing semantic and quality-weighted measures that go beyond simple n-gram overlap or perplexity. Nearby, Language of Creativity[3] examines linguistic markers of creative expression, while Automated Creativity Scoring[11] develops scalable rubrics for evaluating originality. These efforts contrast with multidimensional frameworks like Creativity in LLMs[1] and LLM Creativity[4], which integrate novelty as one facet among several creativity dimensions. A recurring tension is whether novelty should be measured in isolation or as part of a broader creative profile, and whether automated metrics can capture the nuanced judgments that human evaluators apply. LLM Novelty Frontier[0] emphasizes the former approach, offering targeted novelty scores that complement but remain distinct from holistic creativity assessments.

Claimed Contributions

Novelty metric balancing originality and quality

The authors propose measuring novelty as the harmonic mean of two dimensions: originality (fraction of n-grams absent from training data) and task-specific quality (measured via LLM-as-a-judge). This metric addresses limitations of prior work that evaluates only originality or relies solely on human preference.

10 retrieved papers
Analysis of factors affecting LLM novelty

The authors systematically analyze how model scale, post-training, and base model improvements affect novelty across story completion, poetry writing, and creative tool use tasks. They find that scaling and post-training improve novelty through higher quality, while improved base models increase originality.

10 retrieved papers
Evaluation of inference-time elicitation methods

The authors investigate whether inference-time interventions like temperature sampling, novel in-context examples, and prompting techniques can elicit more novel outputs. They demonstrate that these methods typically trade off originality gains against quality losses, yielding minimal improvements to overall novelty.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novelty metric balancing originality and quality

The authors propose measuring novelty as the harmonic mean of two dimensions: originality (fraction of n-grams absent from training data) and task-specific quality (measured via LLM-as-a-judge). This metric addresses limitations of prior work that evaluates only originality or relies solely on human preference.

Contribution

Analysis of factors affecting LLM novelty

The authors systematically analyze how model scale, post-training, and base model improvements affect novelty across story completion, poetry writing, and creative tool use tasks. They find that scaling and post-training improve novelty through higher quality, while improved base models increase originality.

Contribution

Evaluation of inference-time elicitation methods

The authors investigate whether inference-time interventions like temperature sampling, novel in-context examples, and prompting techniques can elicit more novel outputs. They demonstrate that these methods typically trade off originality gains against quality losses, yielding minimal improvements to overall novelty.