Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

creativitycreative writingevaluationcreativity evaluationmachine creativityn-gram novelty

$N$ -gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and $n$ -gram novelty through 7542 expert writer annotations ( $n=26$ ) of novelty, pragmaticality, and sensicality via \emph{close reading} of human and AI-generated text. We find that while $n$ -gram novelty is positively associated with expert writer-judged creativity, $\approx 91\%$ of top-quartile expressions by $n$ -gram novelty are not judged as creative, cautioning against relying on $n$ -gram novelty alone. Furthermore, unlike human-written text, higher $n$ -gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper operationalizes textual creativity as a dual construct combining novelty and appropriateness (sensicality plus pragmaticality), then empirically tests whether n-gram novelty alone suffices as a proxy. It resides in the 'Critique and Comparative Analysis of Creativity Metrics' leaf alongside three sibling papers that similarly scrutinize existing evaluation approaches. This leaf sits within the broader 'Creativity Evaluation Methodologies and Metrics' branch, which contains three leaves totaling twelve papers. The critique-focused direction is moderately populated, suggesting active debate over how to measure creative language without reducing it to surface-level statistics.

The taxonomy reveals neighboring leaves dedicated to 'Novel Metric Development' (four papers proposing new scoring systems) and 'Domain-Specific Creativity Evaluation' (four papers examining poetry, humor, and game content). The original paper bridges these areas: it critiques n-gram novelty (aligning with its own leaf's scope) while proposing a richer operationalization (touching on novel metric territory). Upstream, the 'Theoretical Foundations' branch (eight papers) establishes cognitive and philosophical grounding for creativity, which the paper invokes when defining appropriateness. Downstream, the 'Human-AI Creativity Comparison' branch (five papers) conducts empirical studies similar to the paper's LLM evaluation component, though those works focus on benchmarking rather than metric critique.

Among thirty candidates examined, none clearly refutes the three contributions. The first contribution—operationalizing creativity beyond n-gram novelty—examined ten candidates with zero refutable overlaps, suggesting that the dual-construct framing (novelty plus appropriateness via close reading) may be relatively novel within the search scope. The second contribution (expert writer annotation dataset) and third (close reading task for LLM evaluation) each examined ten candidates with zero refutations, indicating that the specific methodology of recruiting expert writers for fine-grained sensicality and pragmaticality judgments has limited direct precedent in the retrieved literature. These statistics reflect the bounded search, not exhaustive coverage.

Given the limited search scope of thirty semantically similar candidates, the paper appears to occupy a distinctive niche: it combines metric critique with a human-expert annotation study grounded in close reading, a methodology less common in the retrieved computational creativity literature. The taxonomy context shows that while metric critique is an active area, the specific operationalization and evaluation approach may differentiate this work from prior efforts. Broader literature beyond the top-thirty matches could reveal additional overlaps, particularly in creativity psychology or writing studies domains not fully captured by semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating textual creativity beyond n-gram novelty metrics. The field has evolved from simple surface-level measures toward richer conceptual frameworks that capture what makes language genuinely creative. The taxonomy reveals five major branches: theoretical foundations that ground creativity in linguistic and cognitive theory (Creativity Linguistic Theory[3], Creative Symbol Grounding[19]); methodologies and metrics that propose alternatives to n-gram counting (Multi-Novelty[4], Computational Novelty Framework[6]); human-AI comparison studies that benchmark machine outputs against human creative performance (AI as Salieri[2], Comparative Linguistic Creativity[9]); generation systems exploring techniques like constrained poetry or metaphor production (Creative Story Generation[1], Pun Intended[15]); and specialized applications spanning domains from lyric writing (AI Lyricist[10]) to procedural content (Procedural Content Creativity[7]). These branches interact closely: new metrics often emerge from theoretical insights, while generation systems provide testbeds for evaluation approaches. A particularly active tension runs through the critique and comparative analysis cluster, where researchers question whether existing metrics truly capture creative depth or merely statistical surprise. Death of Novelty[0] sits squarely within this critical line, challenging the adequacy of n-gram-based measures alongside neighbors like Dynamics Automatized Creativity[5] and Computational Novelty Framework[6], which similarly scrutinize how we operationalize novelty and appropriateness (Balancing Novelty Appropriateness[8]). Rethinking Creativity Evaluation[14] echoes these concerns, advocating for multi-dimensional frameworks that go beyond surface patterns. Meanwhile, works like Surprisal Metaphor Novelty[16] and Predicting Metaphor Novelty[11] explore psycholinguistic grounding for creativity metrics, suggesting that surprisal and semantic distance may better align with human judgments than simple n-gram divergence. The original paper thus contributes to an ongoing reassessment of foundational assumptions, pushing the community toward evaluation paradigms that respect the layered, context-sensitive nature of creative language.

Claimed Contributions

Operationalization of textual creativity beyond n-gram novelty

10 retrieved papers

The authors propose a new operationalization of textual creativity that extends beyond n-gram novelty by requiring both novelty and appropriateness (decomposed into sensicality and pragmaticality), aligning with the standard psychological definition of creativity.

10 retrieved papers

Expert writer annotation dataset via close reading

10 retrieved papers

The authors collected a dataset of 7542 annotations from 26 professional writers who performed close reading of human and AI-generated passages, rating expressions for novelty, pragmaticality, and sensicality, along with 226 creative expression highlights with justifications.

10 retrieved papers

Close reading task for LLM evaluation

10 retrieved papers

The authors introduce a close reading task to evaluate whether LLMs can replicate human judgments by identifying creative and non-pragmatic expressions in text, testing zero-shot, few-shot, and finetuned models on this task.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Dynamics of automatized measures of creativity: mapping the landscape to quantify creative ideation PDF

Ijaz Ul Haq, M. PifarrÃ©, Manoli PifarrÃ© (2023)

[6] A framework for exploring computational models of novelty in unstructured text PDF

Maryam Mohseni, Mary Lou Maher, M. Mohseni, M. Maher (2022)

[14] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations PDF

LU Li-Chun, Tian, Yufei, Sun Shao-hua, Peng, Nanyun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Operationalization of textual creativity beyond n-gram novelty

[64] Understanding elementary students' creativity as a trade-off between originality and task appropriateness: A Pareto optimization study. PDF

Cannot Refute

[65] Semantic distance: An automated measure of creativity that is novel and appropriate. PDF

Cannot Refute

[66] Creative evaluation: The role of memory in novelty & effectiveness judgements PDF

Cannot Refute

[67] Creativity, Expectancy Violations, and Impression Formation: Effects of Novelty and Appropriateness in Online Dating Profile Texts PDF

Cannot Refute

[68] Novelty Seeking Differences in Temporal Dynamics for Novelty and Appropriateness Processing of Creative Information: An ERP Investigation PDF

Cannot Refute

[69] Toward a meta-theory of creativity forms: How novelty and usefulness shape creativity PDF

Cannot Refute

[70] Creative Factors and Psychotherapeutic Insight: Effects of Novelty and Appropriateness PDF

Cannot Refute

[71] Promoting and Assessing Creativity for the Training of Transcreators: Some Inspiring Training Resources PDF

Cannot Refute

[72] Evaluating creativity: How idea context and rater personality affect considerations of novelty and usefulness PDF

Cannot Refute

[73] What's creative about sentences? A computational approach to assessing creativity in a sentence generation task PDF

Cannot Refute

Contribution

Expert writer annotation dataset via close reading

[54] Art or artifice? large language models and the false promise of creativity PDF

Cannot Refute

[55] A generalist medical language model for disease diagnosis assistance PDF

Cannot Refute

[56] People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text PDF

Cannot Refute

[57] â¦ a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and â¦ PDF

Cannot Refute

[58] Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements PDF

Cannot Refute

[59] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

Cannot Refute

[60] Decoding AI and human authorship: nuances revealed through NLP and statistical analysis PDF

Cannot Refute

[61] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights PDF

Cannot Refute

[62] Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news PDF

Cannot Refute

[63] FoodSky: A food-oriented large language model that can pass the chef and dietetic examinations PDF

Cannot Refute

Contribution

Close reading task for LLM evaluation

[44] The language of creativity: Evidence from humans and large language models PDF

Cannot Refute

[45] AI-Enhanced Literary Education: Unveiling the Potential of Generative AI in Literary Education PDF

Cannot Refute

[46] Exploring Cognitive Difference in Poetry Collection via Large Language Models and Metaphors: A Case Study of the Book of Songs PDF

Cannot Refute

[47] Artificial Dreams: Surreal Visual Storytelling as Inquiry Into AI'Hallucination' PDF

Cannot Refute

[48] Peeping into AI: A Literary Deconstruction of an Essay by AI GPT3 and an interview with AI LaMDA PDF

Cannot Refute

[49] Authoring Culture: The Foundations of Twenty-First Century Writing PDF

Cannot Refute

[50] Bridging Classical Rhetoric and AI: A Systematic Framework for Developing Authorial Voice Through Large Language Models PDF

Cannot Refute

[51] LLMs for Understanding and Preserving Historical Musical Codes: Code Collection, Curation & Dataset PDF

Cannot Refute

[52] Beyond the Text to algorithms: Global AI Innovations and the Reimagining of Literary Education in Algerian University PDF

Cannot Refute

[53] Pedagogical Voice or Algorithmic Authority? A Critical Discourse Analysis of AI Tutors in Language Learning PDF

Cannot Refute

Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Dynamics of automatized measures of creativity: mapping the landscape to quantify creative ideation PDF

[6] A framework for exploring computational models of novelty in unstructured text PDF

[14] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations PDF

Contribution Analysis

Operationalization of textual creativity beyond n-gram novelty

[64] Understanding elementary students' creativity as a trade-off between originality and task appropriateness: A Pareto optimization study. PDF

[65] Semantic distance: An automated measure of creativity that is novel and appropriate. PDF

[66] Creative evaluation: The role of memory in novelty & effectiveness judgements PDF

[67] Creativity, Expectancy Violations, and Impression Formation: Effects of Novelty and Appropriateness in Online Dating Profile Texts PDF

[68] Novelty Seeking Differences in Temporal Dynamics for Novelty and Appropriateness Processing of Creative Information: An ERP Investigation PDF

[69] Toward a meta-theory of creativity forms: How novelty and usefulness shape creativity PDF

[70] Creative Factors and Psychotherapeutic Insight: Effects of Novelty and Appropriateness PDF

[71] Promoting and Assessing Creativity for the Training of Transcreators: Some Inspiring Training Resources PDF

[72] Evaluating creativity: How idea context and rater personality affect considerations of novelty and usefulness PDF

[73] What's creative about sentences? A computational approach to assessing creativity in a sentence generation task PDF

Expert writer annotation dataset via close reading

[54] Art or artifice? large language models and the false promise of creativity PDF

[55] A generalist medical language model for disease diagnosis assistance PDF

[56] People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text PDF

[57] â¦ a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and â¦ PDF

[58] Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements PDF

[59] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF

[60] Decoding AI and human authorship: nuances revealed through NLP and statistical analysis PDF

[61] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights PDF

[62] Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news PDF

[63] FoodSky: A food-oriented large language model that can pass the chef and dietetic examinations PDF

Close reading task for LLM evaluation

[44] The language of creativity: Evidence from humans and large language models PDF

[45] AI-Enhanced Literary Education: Unveiling the Potential of Generative AI in Literary Education PDF

[46] Exploring Cognitive Difference in Poetry Collection via Large Language Models and Metaphors: A Case Study of the Book of Songs PDF

[47] Artificial Dreams: Surreal Visual Storytelling as Inquiry Into AI'Hallucination' PDF

[48] Peeping into AI: A Literary Deconstruction of an Essay by AI GPT3 and an interview with AI LaMDA PDF

[49] Authoring Culture: The Foundations of Twenty-First Century Writing PDF

[50] Bridging Classical Rhetoric and AI: A Systematic Framework for Developing Authorial Voice Through Large Language Models PDF

[51] LLMs for Understanding and Preserving Historical Musical Codes: Code Collection, Curation & Dataset PDF

[52] Beyond the Text to algorithms: Global AI Innovations and the Reimagining of Literary Education in Algerian University PDF

[53] Pedagogical Voice or Algorithmic Authority? A Critical Discourse Analysis of AI Tutors in Language Learning PDF

Table of Contents

[57] â¦ a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and â¦ PDF