HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

ICLR 2026 Conference SubmissionAnonymous Authors
long-tail benchmarklogic puzzle gameslarge reasoning model
Abstract:

Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the "long-tail" of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HardcoreLogic, a benchmark of over 5,000 puzzles across 10 games designed to test large reasoning models on non-canonical logic puzzle variants. According to the taxonomy tree, this work occupies a unique position: it is the sole paper in the 'Long-Tail and Variant-Based Puzzle Benchmarks' leaf, which sits under the broader 'Puzzle-Based Reasoning Benchmarks and Datasets' branch. This leaf explicitly focuses on systematic robustness testing through uncommon puzzle configurations, distinguishing it from the more populated sibling categories of synthetic generation (4 papers) and static curated benchmarks (7 papers).

The taxonomy reveals that HardcoreLogic's closest conceptual neighbors are in adjacent leaves: 'Logical Constraint Satisfaction Puzzles' (3 papers on SAT-based and rule-based generators) and 'Text-Based Logical Puzzle Reasoning' (3 papers on deductive tasks without visual components). While these siblings address canonical puzzle formats or synthetic generation with verifiable rewards, HardcoreLogic diverges by systematically transforming existing puzzles through increased complexity, uncommon elements, and unsolvable instances. The taxonomy's scope notes clarify that variant-based benchmarks explicitly exclude canonical formats, positioning this work as a stress test for generalization rather than a static evaluation suite.

Among the 20 candidates examined through limited semantic search, none were found to clearly refute any of the three contributions. The 'HardcoreLogic benchmark' contribution examined 0 candidates (likely due to its specificity as a named artifact). The 'systematic long-tail transformation methodology' and 'comprehensive error analysis framework' each examined 10 candidates, with all 10 classified as non-refutable or unclear in both cases. This suggests that within the examined scope, the specific combination of transformation dimensions (IC, UE, UP) and the error analysis approach appear distinct from prior work, though the limited search scale (20 papers, not hundreds) means unexplored literature may exist.

Given the constrained search scope of 20 candidates and the paper's solitary position in its taxonomy leaf, the work appears to occupy a relatively sparse research direction. The absence of sibling papers and the zero refutations across examined candidates suggest novelty in the specific benchmark design and transformation methodology, though the broader themes of robustness testing and memorization analysis connect to established concerns in adjacent taxonomy branches. The analysis reflects top-K semantic matches and does not constitute an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Evaluating large reasoning models on long-tail logic puzzle variants. The field structure reflects a broad interest in testing and understanding reasoning capabilities through diverse puzzle-based challenges. The taxonomy organizes work into several main branches: Puzzle-Based Reasoning Benchmarks and Datasets, which houses efforts to construct specialized test sets ranging from visual puzzles like PuzzleVQA[2] to constraint-satisfaction problems such as SATBench[11]; Reasoning Mechanisms and Cognitive Processes, exploring how models perform inference steps and scale reasoning effort; Bias and Fairness in Puzzle Reasoning, examining systematic errors and biases that emerge in logic grid tasks; Commonsense and Abductive Reasoning, addressing puzzles that require world knowledge or lateral thinking (e.g., BRAINTEASER[9], TurtleSoup Puzzles[12]); Formal Verification and Theorem Proving, connecting puzzle solving to rigorous logical frameworks; and Complex Systems Decision Support, linking puzzle reasoning to real-world decision-making scenarios. Together, these branches capture the spectrum from synthetic benchmarks to cognitive modeling and practical applications. Within the Puzzle-Based Reasoning Benchmarks and Datasets branch, a particularly active line of work focuses on long-tail and variant-based puzzle benchmarks that stress-test models on rare or systematically modified puzzle instances. HardcoreLogic[0] sits squarely in this cluster, emphasizing evaluation on unusual logic puzzle variants that probe whether large reasoning models can generalize beyond common training distributions. Nearby efforts such as Enigmata[1] and Knights and Knaves[10] similarly explore specialized puzzle families, while PuzzleWorld[6] and Puzzle Prodigies[7] offer broader multi-puzzle testbeds. A key theme across these works is the tension between memorization and genuine reasoning: studies like Memorization Logical Reasoning[4] and UNcommonsense Reasoning[5] highlight how models may rely on pattern matching rather than robust inference. HardcoreLogic[0] contributes to this conversation by targeting the long tail of puzzle variants, aiming to reveal whether scaling and training advances translate into flexible problem-solving or merely surface-level pattern recognition.

Claimed Contributions

HardcoreLogic benchmark with long-tail logic puzzle transformations

The authors introduce HardcoreLogic, a benchmark containing over 5,000 logic puzzles spanning 10 game types. The benchmark systematically transforms canonical puzzles through three dimensions: Increased Complexity, Uncommon Elements, and Unsolvable Puzzles, reducing reliance on memorization and testing model robustness on non-canonical variants.

0 retrieved papers
Systematic long-tail transformation methodology

The authors develop a systematic transformation framework that modifies standard logic puzzles along three dimensions: expanding search spaces and strengthening constraints (IC), introducing novel rules and altered forms (UE), and creating unsolvable instances (UP). This methodology enables controlled difficulty scaling and reduces training data overlap.

10 retrieved papers
Comprehensive error analysis framework for LRM failures

The authors perform a systematic error analysis categorizing LRM failures into six types for solvable puzzles and four types for unsolvable puzzles. This analysis reveals that factual errors dominate across models, stronger models exhibit brute-force behaviors, and weaker models struggle with degenerate outputs, providing insights into reasoning limitations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HardcoreLogic benchmark with long-tail logic puzzle transformations

The authors introduce HardcoreLogic, a benchmark containing over 5,000 logic puzzles spanning 10 game types. The benchmark systematically transforms canonical puzzles through three dimensions: Increased Complexity, Uncommon Elements, and Unsolvable Puzzles, reducing reliance on memorization and testing model robustness on non-canonical variants.

Contribution

Systematic long-tail transformation methodology

The authors develop a systematic transformation framework that modifies standard logic puzzles along three dimensions: expanding search spaces and strengthening constraints (IC), introducing novel rules and altered forms (UE), and creating unsolvable instances (UP). This methodology enables controlled difficulty scaling and reduces training data overlap.

Contribution

Comprehensive error analysis framework for LRM failures

The authors perform a systematic error analysis categorizing LRM failures into six types for solvable puzzles and four types for unsolvable puzzles. This analysis reveals that factual errors dominate across models, stronger models exhibit brute-force behaviors, and weaker models struggle with degenerate outputs, providing insights into reasoning limitations.

HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games | Novelty Validation