CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

ICLR 2026 Conference SubmissionAnonymous Authors
benchmarkLLMreasoninglong-context reasoningCognitive Load TheoryCLTsynthetic benchmarknatural language benchmarkintrinsic difficultyextraneous loadneedle-in-a-haystack
Abstract:

Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty (dd) controls intrinsic load; distractor-to-signal ratio (ρ\rho) regulates extraneous load; and task length (NN) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CogniLoad, a synthetic benchmark that applies Cognitive Load Theory to evaluate long-context reasoning in LLMs through independently tunable parameters: intrinsic difficulty, distractor-to-signal ratio, and task length. Within the taxonomy, it resides in the 'Synthetic Benchmark Design with Parametric Control' leaf, which contains only two papers total. This represents a relatively sparse research direction focused specifically on benchmarks with systematic parametric control over cognitive load dimensions, suggesting the paper enters a nascent but well-defined area of investigation.

The taxonomy reveals that CogniLoad's parent branch, 'Cognitive Load Theory Foundations and Benchmarking', encompasses three distinct research directions. Its sibling leaves include 'Cognitive Load Mechanisms and Interference Effects' (examining phenomena like proactive interference and context saturation) and 'Human-Model Cognitive Alignment Studies' (comparing model limitations with human cognitive constraints). These neighboring areas provide complementary perspectives—mechanistic studies investigate specific cognitive phenomena, while alignment research grounds model behavior in human baselines—but neither offers the systematic parametric control that defines CogniLoad's methodological contribution.

Among the three contributions analyzed, the first two appear relatively novel within the limited search scope. 'Grounding LLM evaluation in Cognitive Load Theory' examined 10 candidates with zero refutations, while 'CogniLoad benchmark with independent cognitive load control' examined 8 candidates, also with zero refutations. However, the third contribution, 'Automatic puzzle generation and evaluation algorithm', shows more substantial prior work: among 10 candidates examined, 3 were identified as potentially refuting. This suggests that while the theoretical framing and benchmark design may be distinctive, the technical implementation of puzzle generation has more established precedents in the examined literature.

Based on the analysis of 28 candidate papers from semantic search, CogniLoad appears to occupy a relatively novel position in systematically operationalizing cognitive load theory for LLM evaluation. The sparse population of its taxonomy leaf and low refutation rates for core contributions suggest meaningful differentiation from prior work. However, this assessment is constrained by the limited search scope and does not constitute an exhaustive survey of all potentially relevant benchmarking literature.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
28
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: long-context reasoning with controllable cognitive load dimensions. This emerging field examines how language models handle extended inputs under varying cognitive demands, drawing inspiration from human cognitive load theory. The taxonomy organizes research into three main branches: Cognitive Load Theory Foundations and Benchmarking, which develops theoretical frameworks and evaluation protocols for measuring cognitive strain; Context Management and Memory Optimization Frameworks, which addresses architectural strategies for efficient information retention and retrieval; and Application Domains and Task-Specific Implementations, which explores how cognitive load principles manifest across diverse problem settings. Representative works span from foundational studies on human-like context limitations (Context Limitations Human-like[8]) to recent frameworks that explicitly model cognitive constraints (Cognitive Bandwidth Bottleneck[2], Cognitive Workspace[7]). The field reflects growing recognition that scaling context windows alone is insufficient without understanding the qualitative dimensions of reasoning difficulty. Particularly active lines of work explore synthetic benchmark design with parametric control over task complexity, enabling systematic study of how models degrade under increasing cognitive demands. CogniLoad[0] sits squarely within this benchmarking thrust, offering controllable dimensions to isolate specific load factors in long-context scenarios. Its emphasis on parametric control aligns closely with seqBench[15], which similarly provides structured evaluation of sequential reasoning capabilities. Meanwhile, neighboring efforts like SAGE[3] and Cognitive Load-Aware Inference[12] pursue complementary angles: the former develops adaptive reasoning strategies, while the latter optimizes inference under explicit load constraints. A central tension across these branches concerns whether cognitive load should be treated as an intrinsic property of tasks (as in benchmark design) or as a dynamic resource managed by the system (as in memory optimization frameworks). Open questions remain about transferring insights from controlled synthetic settings to real-world applications where multiple load dimensions interact unpredictably.

Claimed Contributions

Grounding LLM evaluation in Cognitive Load Theory

The authors establish a theoretical foundation for evaluating large language models by mapping benchmark parameters to the three types of cognitive load from Cognitive Load Theory: intrinsic load (ICL), extraneous load (ECL), and germane load (GCL). This provides a principled framework for understanding LLM reasoning limitations.

10 retrieved papers
CogniLoad benchmark with independent cognitive load control

The authors present CogniLoad, a novel synthetic benchmark that enables independent manipulation of intrinsic difficulty (d), distractor density (ρ), and task length (N). This factorial design allows precise diagnosis of LLM failure modes across distinct cognitive load dimensions.

8 retrieved papers
Automatic puzzle generation and evaluation algorithm

The authors develop an algorithmic framework for automatically generating and evaluating natural-language logic puzzles with tunable parameters. This enables scalable, reproducible benchmarking of reasoning capabilities across different models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Grounding LLM evaluation in Cognitive Load Theory

The authors establish a theoretical foundation for evaluating large language models by mapping benchmark parameters to the three types of cognitive load from Cognitive Load Theory: intrinsic load (ICL), extraneous load (ECL), and germane load (GCL). This provides a principled framework for understanding LLM reasoning limitations.

Contribution

CogniLoad benchmark with independent cognitive load control

The authors present CogniLoad, a novel synthetic benchmark that enables independent manipulation of intrinsic difficulty (d), distractor density (ρ), and task length (N). This factorial design allows precise diagnosis of LLM failure modes across distinct cognitive load dimensions.

Contribution

Automatic puzzle generation and evaluation algorithm

The authors develop an algorithmic framework for automatically generating and evaluating natural-language logic puzzles with tunable parameters. This enables scalable, reproducible benchmarking of reasoning capabilities across different models.

CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density | Novelty Validation