HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Overview
Overall Novelty Assessment
The paper introduces HeuriGym, an agentic benchmark framework for evaluating LLM-generated heuristics in combinatorial optimization, alongside the Quality-Yield Index metric and a suite of nine benchmark problems. It resides in the Benchmarking and Evaluation leaf of the taxonomy, which contains four papers total. This represents a relatively sparse research direction compared to more crowded areas like Evolutionary and Reflective Heuristic Search (five papers) or Domain-Specific Heuristic Discovery (four papers), suggesting that systematic evaluation infrastructure remains underdeveloped despite rapid growth in heuristic generation methods.
The taxonomy reveals that while numerous branches focus on algorithmic approaches—evolutionary frameworks, tree search methods, direct solution generation—the Benchmarking and Evaluation category addresses a distinct need for rigorous assessment tools. Neighboring leaves like Iterative Optimization and Prompting (two papers) and Hyper-Heuristic and Instance-Specific Methods (three papers) develop complementary techniques but lack standardized evaluation protocols. The scope note for Benchmarking and Evaluation explicitly excludes methods proposing new algorithms, positioning HeuriGym as infrastructure rather than a novel optimization technique, which differentiates it from the majority of the fifty-paper taxonomy focused on algorithmic innovation.
Among thirty candidates examined, none clearly refute any of the three contributions. The HeuriGym framework contribution examined ten candidates with zero refutable overlaps, as did the Quality-Yield Index metric and the benchmark suite. This suggests that within the limited search scope, no prior work provides a directly comparable agentic evaluation framework combining iterative refinement, code execution feedback, and the specific QYI metric. However, the sibling papers in Benchmarking and Evaluation—including comprehensive evaluation studies and capability assessments—likely address overlapping evaluation goals, though the analysis does not indicate they provide identical infrastructure or metrics.
Based on the top-thirty semantic matches examined, the work appears to occupy a distinct position within evaluation methodology for LLM-based optimization. The absence of refutable candidates across all contributions, combined with the sparse Benchmarking and Evaluation category, suggests the specific combination of agentic framework design and the QYI metric may be novel. However, this assessment is constrained by the limited search scope and does not account for potential overlap with evaluation frameworks outside the examined candidates or in adjacent fields beyond combinatorial optimization.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose HeuriGym, an end-to-end agentic framework that enables LLMs to generate heuristic algorithms for combinatorial optimization problems, receive execution feedback, and iteratively refine solutions. The framework includes automated verification, quantitative evaluation, and supports realistic programming tasks across multiple domains.
The authors introduce QYI as a unified metric that combines solution quality (relative to expert baselines) and yield (success rate) using a harmonic mean formulation. This metric addresses limitations of traditional PASS@k metrics by capturing both feasibility and solution quality in multi-round agentic settings.
The authors construct a benchmark of nine carefully selected combinatorial optimization problems from domains including computer systems, logistics, and biology. These problems feature well-defined objectives, large solution spaces, and are designed to resist memorization while requiring genuine algorithmic reasoning and problem-specific heuristic design.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] On the capability of LLMs in combinatorial optimization PDF
[21] A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization PDF
[45] Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
HeuriGym: An agentic benchmark framework for evaluating LLM-generated heuristics
The authors propose HeuriGym, an end-to-end agentic framework that enables LLMs to generate heuristic algorithms for combinatorial optimization problems, receive execution feedback, and iteratively refine solutions. The framework includes automated verification, quantitative evaluation, and supports realistic programming tasks across multiple domains.
[1] Large language models as end-to-end combinatorial optimization solvers PDF
[5] ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution PDF
[6] Self-guiding exploration for combinatorial problems PDF
[33] Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization PDF
[70] STRCMP: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization PDF
[71] Learning Improvement Heuristics for Solving Routing Problems PDF
[72] Heuristic Search Value Iteration for POMDPs PDF
[73] Exact and heuristic methods in combinatorial optimization PDF
[74] Search-based llms for code optimization PDF
[75] Tree of Thoughts: Deliberate Problem Solving with Large Language Models PDF
Quality-Yield Index (QYI) metric
The authors introduce QYI as a unified metric that combines solution quality (relative to expert baselines) and yield (success rate) using a harmonic mean formulation. This metric addresses limitations of traditional PASS@k metrics by capturing both feasibility and solution quality in multi-round agentic settings.
[51] Dynamic impact for ant colony optimization algorithm PDF
[52] A comprehensive review on multi-objective optimization techniques: Past, present and future PDF
[53] Comparative study of state-of-the-art metaheuristics for solving constrained mechanical design optimization problems: experimental analyses and performance ⦠PDF
[54] Dhole optimization algorithm: A new metaheuristic algorithm for solving optimization problems PDF
[55] Seasons optimization algorithm PDF
[56] A Particle Swarm Optimization-Guided Ivy Algorithm for Global Optimization Problems PDF
[57] Novel performance metrics for robust multi-objective optimization algorithms PDF
[58] Failure risk management: adaptive performance control and mission abort decisions PDF
[59] Probability and certainty in the performance of evolutionary and swarm optimization algorithms PDF
[60] The hybrid harris hawks optimizer-arithmetic optimization algorithm: a new hybrid algorithm for sizing optimization and design of microgrids PDF
Benchmark suite of nine combinatorial optimization problems
The authors construct a benchmark of nine carefully selected combinatorial optimization problems from domains including computer systems, logistics, and biology. These problems feature well-defined objectives, large solution spaces, and are designed to resist memorization while requiring genuine algorithmic reasoning and problem-specific heuristic design.