CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

benchmarkLLMreasoninglong-context reasoningCognitive Load TheoryCLTsynthetic benchmarknatural language benchmarkintrinsic difficultyextraneous loadneedle-in-a-haystack

Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ( $d$ ) controls intrinsic load; distractor-to-signal ratio ( $\rho$ ) regulates extraneous load; and task length ( $N$ ) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces CogniLoad, a synthetic benchmark that applies Cognitive Load Theory to evaluate long-context reasoning in LLMs through independently tunable parameters: intrinsic difficulty, distractor-to-signal ratio, and task length. Within the taxonomy, it resides in the 'Synthetic Benchmark Design with Parametric Control' leaf, which contains only two papers total. This represents a relatively sparse research direction focused specifically on benchmarks with systematic parametric control over cognitive load dimensions, suggesting the paper enters a nascent but well-defined area of investigation.

The taxonomy reveals that CogniLoad's parent branch, 'Cognitive Load Theory Foundations and Benchmarking', encompasses three distinct research directions. Its sibling leaves include 'Cognitive Load Mechanisms and Interference Effects' (examining phenomena like proactive interference and context saturation) and 'Human-Model Cognitive Alignment Studies' (comparing model limitations with human cognitive constraints). These neighboring areas provide complementary perspectives—mechanistic studies investigate specific cognitive phenomena, while alignment research grounds model behavior in human baselines—but neither offers the systematic parametric control that defines CogniLoad's methodological contribution.

Among the three contributions analyzed, the first two appear relatively novel within the limited search scope. 'Grounding LLM evaluation in Cognitive Load Theory' examined 10 candidates with zero refutations, while 'CogniLoad benchmark with independent cognitive load control' examined 8 candidates, also with zero refutations. However, the third contribution, 'Automatic puzzle generation and evaluation algorithm', shows more substantial prior work: among 10 candidates examined, 3 were identified as potentially refuting. This suggests that while the theoretical framing and benchmark design may be distinctive, the technical implementation of puzzle generation has more established precedents in the examined literature.

Based on the analysis of 28 candidate papers from semantic search, CogniLoad appears to occupy a relatively novel position in systematically operationalizing cognitive load theory for LLM evaluation. The sparse population of its taxonomy leaf and low refutation rates for core contributions suggest meaningful differentiation from prior work. However, this assessment is constrained by the limited search scope and does not constitute an exhaustive survey of all potentially relevant benchmarking literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: long-context reasoning with controllable cognitive load dimensions. This emerging field examines how language models handle extended inputs under varying cognitive demands, drawing inspiration from human cognitive load theory. The taxonomy organizes research into three main branches: Cognitive Load Theory Foundations and Benchmarking, which develops theoretical frameworks and evaluation protocols for measuring cognitive strain; Context Management and Memory Optimization Frameworks, which addresses architectural strategies for efficient information retention and retrieval; and Application Domains and Task-Specific Implementations, which explores how cognitive load principles manifest across diverse problem settings. Representative works span from foundational studies on human-like context limitations (Context Limitations Human-like[8]) to recent frameworks that explicitly model cognitive constraints (Cognitive Bandwidth Bottleneck[2], Cognitive Workspace[7]). The field reflects growing recognition that scaling context windows alone is insufficient without understanding the qualitative dimensions of reasoning difficulty. Particularly active lines of work explore synthetic benchmark design with parametric control over task complexity, enabling systematic study of how models degrade under increasing cognitive demands. CogniLoad[0] sits squarely within this benchmarking thrust, offering controllable dimensions to isolate specific load factors in long-context scenarios. Its emphasis on parametric control aligns closely with seqBench[15], which similarly provides structured evaluation of sequential reasoning capabilities. Meanwhile, neighboring efforts like SAGE[3] and Cognitive Load-Aware Inference[12] pursue complementary angles: the former develops adaptive reasoning strategies, while the latter optimizes inference under explicit load constraints. A central tension across these branches concerns whether cognitive load should be treated as an intrinsic property of tasks (as in benchmark design) or as a dynamic resource managed by the system (as in memory optimization frameworks). Open questions remain about transferring insights from controlled synthetic settings to real-world applications where multiple load dimensions interact unpredictably.

Claimed Contributions

Grounding LLM evaluation in Cognitive Load Theory

10 retrieved papers

The authors establish a theoretical foundation for evaluating large language models by mapping benchmark parameters to the three types of cognitive load from Cognitive Load Theory: intrinsic load (ICL), extraneous load (ECL), and germane load (GCL). This provides a principled framework for understanding LLM reasoning limitations.

10 retrieved papers

CogniLoad benchmark with independent cognitive load control

8 retrieved papers

The authors present CogniLoad, a novel synthetic benchmark that enables independent manipulation of intrinsic difficulty (d), distractor density (ρ), and task length (N). This factorial design allows precise diagnosis of LLM failure modes across distinct cognitive load dimensions.

8 retrieved papers

Automatic puzzle generation and evaluation algorithm

Can Refute

10 retrieved papers

The authors develop an algorithmic framework for automatically generating and evaluating natural-language logic puzzles with tunable parameters. This enables scalable, reproducible benchmarking of reasoning capabilities across different models.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs PDF

Mohammad Ramezanali, Paolo Santi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Grounding LLM evaluation in Cognitive Load Theory

[30] Cognitive overload: Jailbreaking large language models with overloaded logical thinking PDF

Cannot Refute

[31] Improving Young Learners with Copilot: The Influence of Large Language Models (LLMs) on Cognitive Load and Self-Efficacy in K-12 Programming Education PDF

Cannot Refute

[32] United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory PDF

Cannot Refute

[33] Developing the PsyCogMetricsâ¢ AI Lab to Evaluate Large Language Models and Advance Cognitive ScienceâA Three-Cycle Action Design Science Study PDF

Cannot Refute

[34] The cognitive impacts of large language model interactions on problem solving and decision making using EEG analysis PDF

Cannot Refute

[35] Addressing educational overload with generative AI through dual coding and cognitive load theories PDF

Cannot Refute

[36] Understanding Review Helpfulness through Diagnosticity and Cognitive Load: Comparative Analysis of LLM and ML Models on Restaurant Reviews PDF

Cannot Refute

[37] â¦ of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models PDF

Cannot Refute

[38] Research on the Integration of Multimodal Large Language Models (MLLM) and Augmented Reality (AR) for Smart Navigation with Real-Time Cross-Language Interaction and Cognitive Load Balancing Strategies PDF

Cannot Refute

[39] Cognitive ease at a cost: LLMs reduce mental effort but compromise depth in student scientific inquiry PDF

Cannot Refute

Contribution

CogniLoad benchmark with independent cognitive load control

[12] Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models PDF

Cannot Refute

[16] Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning PDF

Cannot Refute

[24] Understanding instructional design effects by differentiated measurement of intrinsic, extraneous, and germane cognitive load PDF

Cannot Refute

[25] Development and validation of a theory-based questionnaire to measure different types of cognitive load PDF

Cannot Refute

[26] The Impact of Simple, Brief, and Adaptive Instructions within Virtual Reality Training: Components of Cognitive Load Theory in an Assembly Task PDF

Cannot Refute

[27] Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition PDF

Cannot Refute

[28] Use of Eye-Tracking Technology to Investigate Cognitive Load Theory PDF

Cannot Refute

[29] Trainee perception of cognitive load during observed faculty teaching of procedural skills PDF

Cannot Refute

Contribution

Automatic puzzle generation and evaluation algorithm

[40] AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models PDF

Can Refute

[45] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles PDF

Can Refute

[48] SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas PDF

Can Refute

[41] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving PDF

Cannot Refute

[42] LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models PDF

Cannot Refute

[43] Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity PDF

Cannot Refute

[44] LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic PDF

Cannot Refute

[46] Puzzle-Level Generation With Simple-Tiled and Graph-Based Wave Function Collapse Algorithms PDF

Cannot Refute

[47] Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies PDF

Cannot Refute

[49] MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning PDF

Cannot Refute

CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs PDF

Contribution Analysis

Grounding LLM evaluation in Cognitive Load Theory

[30] Cognitive overload: Jailbreaking large language models with overloaded logical thinking PDF

[31] Improving Young Learners with Copilot: The Influence of Large Language Models (LLMs) on Cognitive Load and Self-Efficacy in K-12 Programming Education PDF

[32] United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory PDF

[33] Developing the PsyCogMetricsâ¢ AI Lab to Evaluate Large Language Models and Advance Cognitive ScienceâA Three-Cycle Action Design Science Study PDF

[34] The cognitive impacts of large language model interactions on problem solving and decision making using EEG analysis PDF

[35] Addressing educational overload with generative AI through dual coding and cognitive load theories PDF

[36] Understanding Review Helpfulness through Diagnosticity and Cognitive Load: Comparative Analysis of LLM and ML Models on Restaurant Reviews PDF

[37] â¦ of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models PDF

[38] Research on the Integration of Multimodal Large Language Models (MLLM) and Augmented Reality (AR) for Smart Navigation with Real-Time Cross-Language Interaction and Cognitive Load Balancing Strategies PDF

[39] Cognitive ease at a cost: LLMs reduce mental effort but compromise depth in student scientific inquiry PDF

CogniLoad benchmark with independent cognitive load control

[12] Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models PDF

[16] Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning PDF

[24] Understanding instructional design effects by differentiated measurement of intrinsic, extraneous, and germane cognitive load PDF

[25] Development and validation of a theory-based questionnaire to measure different types of cognitive load PDF

[26] The Impact of Simple, Brief, and Adaptive Instructions within Virtual Reality Training: Components of Cognitive Load Theory in an Assembly Task PDF

[27] Cognitive Load Traces as Symbolic and Visual Accounts of Deep Model Cognition PDF

[28] Use of Eye-Tracking Technology to Investigate Cognitive Load Theory PDF

[29] Trainee perception of cognitive load during observed faculty teaching of procedural skills PDF

Automatic puzzle generation and evaluation algorithm

[40] AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models PDF

[45] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles PDF

[48] SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas PDF

[41] PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving PDF

[42] LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models PDF

[43] Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity PDF

[44] LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic PDF

[46] Puzzle-Level Generation With Simple-Tiled and Graph-Based Wave Function Collapse Algorithms PDF

[47] Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies PDF

[49] MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning PDF

Table of Contents

[33] Developing the PsyCogMetricsâ¢ AI Lab to Evaluate Large Language Models and Advance Cognitive ScienceâA Three-Cycle Action Design Science Study PDF

[37] â¦ of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models PDF