Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity
Overview
Overall Novelty Assessment
The paper operationalizes textual creativity as a dual construct combining novelty and appropriateness (sensicality plus pragmaticality), then empirically tests whether n-gram novelty alone suffices as a proxy. It resides in the 'Critique and Comparative Analysis of Creativity Metrics' leaf alongside three sibling papers that similarly scrutinize existing evaluation approaches. This leaf sits within the broader 'Creativity Evaluation Methodologies and Metrics' branch, which contains three leaves totaling twelve papers. The critique-focused direction is moderately populated, suggesting active debate over how to measure creative language without reducing it to surface-level statistics.
The taxonomy reveals neighboring leaves dedicated to 'Novel Metric Development' (four papers proposing new scoring systems) and 'Domain-Specific Creativity Evaluation' (four papers examining poetry, humor, and game content). The original paper bridges these areas: it critiques n-gram novelty (aligning with its own leaf's scope) while proposing a richer operationalization (touching on novel metric territory). Upstream, the 'Theoretical Foundations' branch (eight papers) establishes cognitive and philosophical grounding for creativity, which the paper invokes when defining appropriateness. Downstream, the 'Human-AI Creativity Comparison' branch (five papers) conducts empirical studies similar to the paper's LLM evaluation component, though those works focus on benchmarking rather than metric critique.
Among thirty candidates examined, none clearly refutes the three contributions. The first contribution—operationalizing creativity beyond n-gram novelty—examined ten candidates with zero refutable overlaps, suggesting that the dual-construct framing (novelty plus appropriateness via close reading) may be relatively novel within the search scope. The second contribution (expert writer annotation dataset) and third (close reading task for LLM evaluation) each examined ten candidates with zero refutations, indicating that the specific methodology of recruiting expert writers for fine-grained sensicality and pragmaticality judgments has limited direct precedent in the retrieved literature. These statistics reflect the bounded search, not exhaustive coverage.
Given the limited search scope of thirty semantically similar candidates, the paper appears to occupy a distinctive niche: it combines metric critique with a human-expert annotation study grounded in close reading, a methodology less common in the retrieved computational creativity literature. The taxonomy context shows that while metric critique is an active area, the specific operationalization and evaluation approach may differentiate this work from prior efforts. Broader literature beyond the top-thirty matches could reveal additional overlaps, particularly in creativity psychology or writing studies domains not fully captured by semantic search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a new operationalization of textual creativity that extends beyond n-gram novelty by requiring both novelty and appropriateness (decomposed into sensicality and pragmaticality), aligning with the standard psychological definition of creativity.
The authors collected a dataset of 7542 annotations from 26 professional writers who performed close reading of human and AI-generated passages, rating expressions for novelty, pragmaticality, and sensicality, along with 226 creative expression highlights with justifications.
The authors introduce a close reading task to evaluate whether LLMs can replicate human judgments by identifying creative and non-pragmatic expressions in text, testing zero-shot, few-shot, and finetuned models on this task.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] Dynamics of automatized measures of creativity: mapping the landscape to quantify creative ideation PDF
[6] A framework for exploring computational models of novelty in unstructured text PDF
[14] Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Operationalization of textual creativity beyond n-gram novelty
The authors propose a new operationalization of textual creativity that extends beyond n-gram novelty by requiring both novelty and appropriateness (decomposed into sensicality and pragmaticality), aligning with the standard psychological definition of creativity.
[64] Understanding elementary students' creativity as a trade-off between originality and task appropriateness: A Pareto optimization study. PDF
[65] Semantic distance: An automated measure of creativity that is novel and appropriate. PDF
[66] Creative evaluation: The role of memory in novelty & effectiveness judgements PDF
[67] Creativity, Expectancy Violations, and Impression Formation: Effects of Novelty and Appropriateness in Online Dating Profile Texts PDF
[68] Novelty Seeking Differences in Temporal Dynamics for Novelty and Appropriateness Processing of Creative Information: An ERP Investigation PDF
[69] Toward a meta-theory of creativity forms: How novelty and usefulness shape creativity PDF
[70] Creative Factors and Psychotherapeutic Insight: Effects of Novelty and Appropriateness PDF
[71] Promoting and Assessing Creativity for the Training of Transcreators: Some Inspiring Training Resources PDF
[72] Evaluating creativity: How idea context and rater personality affect considerations of novelty and usefulness PDF
[73] What's creative about sentences? A computational approach to assessing creativity in a sentence generation task PDF
Expert writer annotation dataset via close reading
The authors collected a dataset of 7542 annotations from 26 professional writers who performed close reading of human and AI-generated passages, rating expressions for novelty, pragmaticality, and sensicality, along with 226 creative expression highlights with justifications.
[54] Art or artifice? large language models and the false promise of creativity PDF
[55] A generalist medical language model for disease diagnosis assistance PDF
[56] People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text PDF
[57] ⦠a large sized curated and annotated corpus for discriminating between human written and AI generated texts: A case study of text sourced from Wikipedia and ⦠PDF
[58] Beyond plain toxic: building datasets for detection of flammable topics and inappropriate statements PDF
[59] LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing PDF
[60] Decoding AI and human authorship: nuances revealed through NLP and statistical analysis PDF
[61] Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights PDF
[62] Assessing the accuracy and explainability of using ChatGPT to evaluate the quality of health news PDF
[63] FoodSky: A food-oriented large language model that can pass the chef and dietetic examinations PDF
Close reading task for LLM evaluation
The authors introduce a close reading task to evaluate whether LLMs can replicate human judgments by identifying creative and non-pragmatic expressions in text, testing zero-shot, few-shot, and finetuned models on this task.