HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

long-tail benchmarklogic puzzle gameslarge reasoning model

Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the "long-tail" of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HardcoreLogic, a benchmark of over 5,000 puzzles across 10 games designed to test large reasoning models on non-canonical logic puzzle variants. According to the taxonomy tree, this work occupies a unique position: it is the sole paper in the 'Long-Tail and Variant-Based Puzzle Benchmarks' leaf, which sits under the broader 'Puzzle-Based Reasoning Benchmarks and Datasets' branch. This leaf explicitly focuses on systematic robustness testing through uncommon puzzle configurations, distinguishing it from the more populated sibling categories of synthetic generation (4 papers) and static curated benchmarks (7 papers).

The taxonomy reveals that HardcoreLogic's closest conceptual neighbors are in adjacent leaves: 'Logical Constraint Satisfaction Puzzles' (3 papers on SAT-based and rule-based generators) and 'Text-Based Logical Puzzle Reasoning' (3 papers on deductive tasks without visual components). While these siblings address canonical puzzle formats or synthetic generation with verifiable rewards, HardcoreLogic diverges by systematically transforming existing puzzles through increased complexity, uncommon elements, and unsolvable instances. The taxonomy's scope notes clarify that variant-based benchmarks explicitly exclude canonical formats, positioning this work as a stress test for generalization rather than a static evaluation suite.

Among the 20 candidates examined through limited semantic search, none were found to clearly refute any of the three contributions. The 'HardcoreLogic benchmark' contribution examined 0 candidates (likely due to its specificity as a named artifact). The 'systematic long-tail transformation methodology' and 'comprehensive error analysis framework' each examined 10 candidates, with all 10 classified as non-refutable or unclear in both cases. This suggests that within the examined scope, the specific combination of transformation dimensions (IC, UE, UP) and the error analysis approach appear distinct from prior work, though the limited search scale (20 papers, not hundreds) means unexplored literature may exist.

Given the constrained search scope of 20 candidates and the paper's solitary position in its taxonomy leaf, the work appears to occupy a relatively sparse research direction. The absence of sibling papers and the zero refutations across examined candidates suggest novelty in the specific benchmark design and transformation methodology, though the broader themes of robustness testing and memorization analysis connect to established concerns in adjacent taxonomy branches. The analysis reflects top-K semantic matches and does not constitute an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating large reasoning models on long-tail logic puzzle variants. The field structure reflects a broad interest in testing and understanding reasoning capabilities through diverse puzzle-based challenges. The taxonomy organizes work into several main branches: Puzzle-Based Reasoning Benchmarks and Datasets, which houses efforts to construct specialized test sets ranging from visual puzzles like PuzzleVQA[2] to constraint-satisfaction problems such as SATBench[11]; Reasoning Mechanisms and Cognitive Processes, exploring how models perform inference steps and scale reasoning effort; Bias and Fairness in Puzzle Reasoning, examining systematic errors and biases that emerge in logic grid tasks; Commonsense and Abductive Reasoning, addressing puzzles that require world knowledge or lateral thinking (e.g., BRAINTEASER[9], TurtleSoup Puzzles[12]); Formal Verification and Theorem Proving, connecting puzzle solving to rigorous logical frameworks; and Complex Systems Decision Support, linking puzzle reasoning to real-world decision-making scenarios. Together, these branches capture the spectrum from synthetic benchmarks to cognitive modeling and practical applications. Within the Puzzle-Based Reasoning Benchmarks and Datasets branch, a particularly active line of work focuses on long-tail and variant-based puzzle benchmarks that stress-test models on rare or systematically modified puzzle instances. HardcoreLogic[0] sits squarely in this cluster, emphasizing evaluation on unusual logic puzzle variants that probe whether large reasoning models can generalize beyond common training distributions. Nearby efforts such as Enigmata[1] and Knights and Knaves[10] similarly explore specialized puzzle families, while PuzzleWorld[6] and Puzzle Prodigies[7] offer broader multi-puzzle testbeds. A key theme across these works is the tension between memorization and genuine reasoning: studies like Memorization Logical Reasoning[4] and UNcommonsense Reasoning[5] highlight how models may rely on pattern matching rather than robust inference. HardcoreLogic[0] contributes to this conversation by targeting the long tail of puzzle variants, aiming to reveal whether scaling and training advances translate into flexible problem-solving or merely surface-level pattern recognition.

Claimed Contributions

HardcoreLogic benchmark with long-tail logic puzzle transformations

0 retrieved papers

The authors introduce HardcoreLogic, a benchmark containing over 5,000 logic puzzles spanning 10 game types. The benchmark systematically transforms canonical puzzles through three dimensions: Increased Complexity, Uncommon Elements, and Unsolvable Puzzles, reducing reliance on memorization and testing model robustness on non-canonical variants.

0 retrieved papers

Systematic long-tail transformation methodology

10 retrieved papers

The authors develop a systematic transformation framework that modifies standard logic puzzles along three dimensions: expanding search spaces and strengthening constraints (IC), introducing novel rules and altered forms (UE), and creating unsolvable instances (UP). This methodology enables controlled difficulty scaling and reduces training data overlap.

10 retrieved papers

Comprehensive error analysis framework for LRM failures

10 retrieved papers

The authors perform a systematic error analysis categorizing LRM failures into six types for solvable puzzles and four types for unsolvable puzzles. This analysis reveals that factual errors dominate across models, stronger models exhibit brute-force behaviors, and weaker models struggle with degenerate outputs, providing insights into reasoning limitations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HardcoreLogic benchmark with long-tail logic puzzle transformations

Contribution

Systematic long-tail transformation methodology

[1] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles PDF

Cannot Refute

[15] AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models PDF

Cannot Refute

[22] FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving PDF

Cannot Refute

[23] The Transformation Logics PDF

Cannot Refute

[24] Puzzles: A benchmark for neural algorithmic reasoning PDF

Cannot Refute

[25] Generating Solvable and Difficult Logic Grid Puzzles PDF

Cannot Refute

[26] Developing configurations and solutions for logical puzzles with UML and OCL PDF

Cannot Refute

[27] Finding the question: A puzzle-based approach to the logic of discovery PDF

Cannot Refute

[28] Difficulty Rating of Sudoku Puzzles: An Overview and Evaluation PDF

Cannot Refute

[29] The CrossSong Puzzle: Developing a logic puzzle for musical thinking PDF

Cannot Refute

Contribution

Comprehensive error analysis framework for LRM failures

[30] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF

Cannot Refute

[31] Cladder: Assessing causal reasoning in language models PDF

Cannot Refute

[32] Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models PDF

Cannot Refute

[33] Causaleval: Towards better causal reasoning in language models PDF

Cannot Refute

[34] Debate on graph: a flexible and reliable reasoning framework for large language models PDF

Cannot Refute

[35] The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity PDF

Cannot Refute

[36] Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT PDF

Cannot Refute

[37] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

Cannot Refute

[38] Unlocking the mysteries of OpenAI o1: A survey of the reasoning abilities of large language models PDF

Cannot Refute

[39] A systematic comparison of syllogistic reasoning in humans and language models PDF

Cannot Refute

HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

HardcoreLogic benchmark with long-tail logic puzzle transformations

Systematic long-tail transformation methodology

[1] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles PDF

[15] AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models PDF

[22] FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving PDF

[23] The Transformation Logics PDF

[24] Puzzles: A benchmark for neural algorithmic reasoning PDF

[25] Generating Solvable and Difficult Logic Grid Puzzles PDF

[26] Developing configurations and solutions for logical puzzles with UML and OCL PDF

[27] Finding the question: A puzzle-based approach to the logic of discovery PDF

[28] Difficulty Rating of Sudoku Puzzles: An Overview and Evaluation PDF

[29] The CrossSong Puzzle: Developing a logic puzzle for musical thinking PDF

Comprehensive error analysis framework for LRM failures

[30] Logicbench: Towards systematic evaluation of logical reasoning ability of large language models PDF

[31] Cladder: Assessing causal reasoning in language models PDF

[32] Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models PDF

[33] Causaleval: Towards better causal reasoning in language models PDF

[34] Debate on graph: a flexible and reliable reasoning framework for large language models PDF

[35] The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity PDF

[36] Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT PDF

[37] Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks PDF

[38] Unlocking the mysteries of OpenAI o1: A survey of the reasoning abilities of large language models PDF

[39] A systematic comparison of syllogistic reasoning in humans and language models PDF

Table of Contents