HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
Overview
Overall Novelty Assessment
The paper introduces HardcoreLogic, a benchmark of over 5,000 puzzles across 10 games designed to test large reasoning models on non-canonical logic puzzle variants. According to the taxonomy tree, this work occupies a unique position: it is the sole paper in the 'Long-Tail and Variant-Based Puzzle Benchmarks' leaf, which sits under the broader 'Puzzle-Based Reasoning Benchmarks and Datasets' branch. This leaf explicitly focuses on systematic robustness testing through uncommon puzzle configurations, distinguishing it from the more populated sibling categories of synthetic generation (4 papers) and static curated benchmarks (7 papers).
The taxonomy reveals that HardcoreLogic's closest conceptual neighbors are in adjacent leaves: 'Logical Constraint Satisfaction Puzzles' (3 papers on SAT-based and rule-based generators) and 'Text-Based Logical Puzzle Reasoning' (3 papers on deductive tasks without visual components). While these siblings address canonical puzzle formats or synthetic generation with verifiable rewards, HardcoreLogic diverges by systematically transforming existing puzzles through increased complexity, uncommon elements, and unsolvable instances. The taxonomy's scope notes clarify that variant-based benchmarks explicitly exclude canonical formats, positioning this work as a stress test for generalization rather than a static evaluation suite.
Among the 20 candidates examined through limited semantic search, none were found to clearly refute any of the three contributions. The 'HardcoreLogic benchmark' contribution examined 0 candidates (likely due to its specificity as a named artifact). The 'systematic long-tail transformation methodology' and 'comprehensive error analysis framework' each examined 10 candidates, with all 10 classified as non-refutable or unclear in both cases. This suggests that within the examined scope, the specific combination of transformation dimensions (IC, UE, UP) and the error analysis approach appear distinct from prior work, though the limited search scale (20 papers, not hundreds) means unexplored literature may exist.
Given the constrained search scope of 20 candidates and the paper's solitary position in its taxonomy leaf, the work appears to occupy a relatively sparse research direction. The absence of sibling papers and the zero refutations across examined candidates suggest novelty in the specific benchmark design and transformation methodology, though the broader themes of robustness testing and memorization analysis connect to established concerns in adjacent taxonomy branches. The analysis reflects top-K semantic matches and does not constitute an exhaustive field survey.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce HardcoreLogic, a benchmark containing over 5,000 logic puzzles spanning 10 game types. The benchmark systematically transforms canonical puzzles through three dimensions: Increased Complexity, Uncommon Elements, and Unsolvable Puzzles, reducing reliance on memorization and testing model robustness on non-canonical variants.
The authors develop a systematic transformation framework that modifies standard logic puzzles along three dimensions: expanding search spaces and strengthening constraints (IC), introducing novel rules and altered forms (UE), and creating unsolvable instances (UP). This methodology enables controlled difficulty scaling and reduces training data overlap.
The authors perform a systematic error analysis categorizing LRM failures into six types for solvable puzzles and four types for unsolvable puzzles. This analysis reveals that factual errors dominate across models, stronger models exhibit brute-force behaviors, and weaker models struggle with degenerate outputs, providing insights into reasoning limitations.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
HardcoreLogic benchmark with long-tail logic puzzle transformations
The authors introduce HardcoreLogic, a benchmark containing over 5,000 logic puzzles spanning 10 game types. The benchmark systematically transforms canonical puzzles through three dimensions: Increased Complexity, Uncommon Elements, and Unsolvable Puzzles, reducing reliance on memorization and testing model robustness on non-canonical variants.
Systematic long-tail transformation methodology
The authors develop a systematic transformation framework that modifies standard logic puzzles along three dimensions: expanding search spaces and strengthening constraints (IC), introducing novel rules and altered forms (UE), and creating unsolvable instances (UP). This methodology enables controlled difficulty scaling and reduces training data overlap.
[1] Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles PDF
[15] AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models PDF
[22] FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving PDF
[23] The Transformation Logics PDF
[24] Puzzles: A benchmark for neural algorithmic reasoning PDF
[25] Generating Solvable and Difficult Logic Grid Puzzles PDF
[26] Developing configurations and solutions for logical puzzles with UML and OCL PDF
[27] Finding the question: A puzzle-based approach to the logic of discovery PDF
[28] Difficulty Rating of Sudoku Puzzles: An Overview and Evaluation PDF
[29] The CrossSong Puzzle: Developing a logic puzzle for musical thinking PDF
Comprehensive error analysis framework for LRM failures
The authors perform a systematic error analysis categorizing LRM failures into six types for solvable puzzles and four types for unsolvable puzzles. This analysis reveals that factual errors dominate across models, stronger models exhibit brute-force behaviors, and weaker models struggle with degenerate outputs, providing insights into reasoning limitations.