Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity

ICLR 2026 Conference SubmissionAnonymous Authors
creativitycreative writingevaluationcreativity evaluationmachine creativityn-gram novelty
Abstract:

NN-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and nn-gram novelty through 7542 expert writer annotations (n=26n=26) of novelty, pragmaticality, and sensicality via \emph{close reading} of human and AI-generated text. We find that while nn-gram novelty is positively associated with expert writer-judged creativity, 91%\approx 91\% of top-quartile expressions by nn-gram novelty are not judged as creative, cautioning against relying on nn-gram novelty alone. Furthermore, unlike human-written text, higher nn-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper operationalizes textual creativity as a dual construct combining novelty and appropriateness (sensicality plus pragmaticality), then empirically tests whether n-gram novelty alone suffices as a proxy. It resides in the 'Critique and Comparative Analysis of Creativity Metrics' leaf alongside three sibling papers that similarly scrutinize existing evaluation approaches. This leaf sits within the broader 'Creativity Evaluation Methodologies and Metrics' branch, which contains three leaves totaling twelve papers. The critique-focused direction is moderately populated, suggesting active debate over how to measure creative language without reducing it to surface-level statistics.

The taxonomy reveals neighboring leaves dedicated to 'Novel Metric Development' (four papers proposing new scoring systems) and 'Domain-Specific Creativity Evaluation' (four papers examining poetry, humor, and game content). The original paper bridges these areas: it critiques n-gram novelty (aligning with its own leaf's scope) while proposing a richer operationalization (touching on novel metric territory). Upstream, the 'Theoretical Foundations' branch (eight papers) establishes cognitive and philosophical grounding for creativity, which the paper invokes when defining appropriateness. Downstream, the 'Human-AI Creativity Comparison' branch (five papers) conducts empirical studies similar to the paper's LLM evaluation component, though those works focus on benchmarking rather than metric critique.

Among thirty candidates examined, none clearly refutes the three contributions. The first contribution—operationalizing creativity beyond n-gram novelty—examined ten candidates with zero refutable overlaps, suggesting that the dual-construct framing (novelty plus appropriateness via close reading) may be relatively novel within the search scope. The second contribution (expert writer annotation dataset) and third (close reading task for LLM evaluation) each examined ten candidates with zero refutations, indicating that the specific methodology of recruiting expert writers for fine-grained sensicality and pragmaticality judgments has limited direct precedent in the retrieved literature. These statistics reflect the bounded search, not exhaustive coverage.

Given the limited search scope of thirty semantically similar candidates, the paper appears to occupy a distinctive niche: it combines metric critique with a human-expert annotation study grounded in close reading, a methodology less common in the retrieved computational creativity literature. The taxonomy context shows that while metric critique is an active area, the specific operationalization and evaluation approach may differentiate this work from prior efforts. Broader literature beyond the top-thirty matches could reveal additional overlaps, particularly in creativity psychology or writing studies domains not fully captured by semantic search.

Taxonomy

Core-task Taxonomy Papers
43
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating textual creativity beyond n-gram novelty metrics. The field has evolved from simple surface-level measures toward richer conceptual frameworks that capture what makes language genuinely creative. The taxonomy reveals five major branches: theoretical foundations that ground creativity in linguistic and cognitive theory (Creativity Linguistic Theory[3], Creative Symbol Grounding[19]); methodologies and metrics that propose alternatives to n-gram counting (Multi-Novelty[4], Computational Novelty Framework[6]); human-AI comparison studies that benchmark machine outputs against human creative performance (AI as Salieri[2], Comparative Linguistic Creativity[9]); generation systems exploring techniques like constrained poetry or metaphor production (Creative Story Generation[1], Pun Intended[15]); and specialized applications spanning domains from lyric writing (AI Lyricist[10]) to procedural content (Procedural Content Creativity[7]). These branches interact closely: new metrics often emerge from theoretical insights, while generation systems provide testbeds for evaluation approaches. A particularly active tension runs through the critique and comparative analysis cluster, where researchers question whether existing metrics truly capture creative depth or merely statistical surprise. Death of Novelty[0] sits squarely within this critical line, challenging the adequacy of n-gram-based measures alongside neighbors like Dynamics Automatized Creativity[5] and Computational Novelty Framework[6], which similarly scrutinize how we operationalize novelty and appropriateness (Balancing Novelty Appropriateness[8]). Rethinking Creativity Evaluation[14] echoes these concerns, advocating for multi-dimensional frameworks that go beyond surface patterns. Meanwhile, works like Surprisal Metaphor Novelty[16] and Predicting Metaphor Novelty[11] explore psycholinguistic grounding for creativity metrics, suggesting that surprisal and semantic distance may better align with human judgments than simple n-gram divergence. The original paper thus contributes to an ongoing reassessment of foundational assumptions, pushing the community toward evaluation paradigms that respect the layered, context-sensitive nature of creative language.

Claimed Contributions

Operationalization of textual creativity beyond n-gram novelty

The authors propose a new operationalization of textual creativity that extends beyond n-gram novelty by requiring both novelty and appropriateness (decomposed into sensicality and pragmaticality), aligning with the standard psychological definition of creativity.

10 retrieved papers
Expert writer annotation dataset via close reading

The authors collected a dataset of 7542 annotations from 26 professional writers who performed close reading of human and AI-generated passages, rating expressions for novelty, pragmaticality, and sensicality, along with 226 creative expression highlights with justifications.

10 retrieved papers
Close reading task for LLM evaluation

The authors introduce a close reading task to evaluate whether LLMs can replicate human judgments by identifying creative and non-pragmatic expressions in text, testing zero-shot, few-shot, and finetuned models on this task.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Operationalization of textual creativity beyond n-gram novelty

The authors propose a new operationalization of textual creativity that extends beyond n-gram novelty by requiring both novelty and appropriateness (decomposed into sensicality and pragmaticality), aligning with the standard psychological definition of creativity.

Contribution

Expert writer annotation dataset via close reading

The authors collected a dataset of 7542 annotations from 26 professional writers who performed close reading of human and AI-generated passages, rating expressions for novelty, pragmaticality, and sensicality, along with 226 creative expression highlights with justifications.

Contribution

Close reading task for LLM evaluation

The authors introduce a close reading task to evaluate whether LLMs can replicate human judgments by identifying creative and non-pragmatic expressions in text, testing zero-shot, few-shot, and finetuned models on this task.

Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity | Novelty Validation