HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkLarge Language ModelsCombinatorial OptimizationCode GenerationAgentAutomatic Heuristic Generation
Abstract:

While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on various problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HeuriGym, an agentic benchmark framework for evaluating LLM-generated heuristics in combinatorial optimization, alongside the Quality-Yield Index metric and a suite of nine benchmark problems. It resides in the Benchmarking and Evaluation leaf of the taxonomy, which contains four papers total. This represents a relatively sparse research direction compared to more crowded areas like Evolutionary and Reflective Heuristic Search (five papers) or Domain-Specific Heuristic Discovery (four papers), suggesting that systematic evaluation infrastructure remains underdeveloped despite rapid growth in heuristic generation methods.

The taxonomy reveals that while numerous branches focus on algorithmic approaches—evolutionary frameworks, tree search methods, direct solution generation—the Benchmarking and Evaluation category addresses a distinct need for rigorous assessment tools. Neighboring leaves like Iterative Optimization and Prompting (two papers) and Hyper-Heuristic and Instance-Specific Methods (three papers) develop complementary techniques but lack standardized evaluation protocols. The scope note for Benchmarking and Evaluation explicitly excludes methods proposing new algorithms, positioning HeuriGym as infrastructure rather than a novel optimization technique, which differentiates it from the majority of the fifty-paper taxonomy focused on algorithmic innovation.

Among thirty candidates examined, none clearly refute any of the three contributions. The HeuriGym framework contribution examined ten candidates with zero refutable overlaps, as did the Quality-Yield Index metric and the benchmark suite. This suggests that within the limited search scope, no prior work provides a directly comparable agentic evaluation framework combining iterative refinement, code execution feedback, and the specific QYI metric. However, the sibling papers in Benchmarking and Evaluation—including comprehensive evaluation studies and capability assessments—likely address overlapping evaluation goals, though the analysis does not indicate they provide identical infrastructure or metrics.

Based on the top-thirty semantic matches examined, the work appears to occupy a distinct position within evaluation methodology for LLM-based optimization. The absence of refutable candidates across all contributions, combined with the sparse Benchmarking and Evaluation category, suggests the specific combination of agentic framework design and the QYI metric may be novel. However, this assessment is constrained by the limited search scope and does not account for potential overlap with evaluation frameworks outside the examined candidates or in adjacent fields beyond combinatorial optimization.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: LLM-generated heuristics for combinatorial optimization problems. The field has rapidly diversified into several major branches that reflect different strategies for leveraging large language models in optimization. LLM-Based Heuristic Generation Frameworks focus on using LLMs to produce or evolve heuristic code, often through iterative refinement or evolutionary search, as seen in works like ReEvo[5] and Evolution of Heuristics[28]. Direct Solution Generation and End-to-End Solving explores LLMs that attempt to solve problems in one shot or through autoregressive generation, exemplified by approaches such as one-shot autoregressive generation[23]. Hybrid and Augmented Approaches combine LLMs with traditional solvers or neural methods, while Iterative Optimization and Prompting emphasizes multi-step reasoning and prompt engineering to guide search. Hyper-Heuristic and Instance-Specific Methods, including works like Llm-driven instance-specific heuristic generation[40], tailor heuristics to particular problem instances. Meanwhile, Benchmarking and Evaluation provides critical infrastructure for assessing these diverse techniques, and Application-Specific Implementations demonstrate deployment in domains ranging from scheduling to routing. A particularly active line of work centers on evolutionary and reflective frameworks that iteratively refine heuristics, contrasting with more direct generation methods that rely on single-pass prompting. Multi-objective approaches such as Pareto-Grid-Guided Large Language Models[36] and Multi-objective evolution of heuristic[30] address trade-offs between solution quality and computational cost, while reduction-based methods explore structured reasoning to decompose complex problems. HeuriGym[0] sits within the Benchmarking and Evaluation branch, providing a standardized environment for comparing LLM-driven heuristic generation methods. Its emphasis on systematic evaluation complements nearby works like A Comprehensive Evaluation of[21] and Exploring combinatorial problem solving[45], which also focus on rigorous assessment of LLM capabilities. By offering a unified testbed, HeuriGym[0] addresses the open question of how to fairly compare the growing variety of LLM-based optimization techniques across different problem classes and instance characteristics.

Claimed Contributions

HeuriGym: An agentic benchmark framework for evaluating LLM-generated heuristics

The authors propose HeuriGym, an end-to-end agentic framework that enables LLMs to generate heuristic algorithms for combinatorial optimization problems, receive execution feedback, and iteratively refine solutions. The framework includes automated verification, quantitative evaluation, and supports realistic programming tasks across multiple domains.

10 retrieved papers
Quality-Yield Index (QYI) metric

The authors introduce QYI as a unified metric that combines solution quality (relative to expert baselines) and yield (success rate) using a harmonic mean formulation. This metric addresses limitations of traditional PASS@k metrics by capturing both feasibility and solution quality in multi-round agentic settings.

10 retrieved papers
Benchmark suite of nine combinatorial optimization problems

The authors construct a benchmark of nine carefully selected combinatorial optimization problems from domains including computer systems, logistics, and biology. These problems feature well-defined objectives, large solution spaces, and are designed to resist memorization while requiring genuine algorithmic reasoning and problem-specific heuristic design.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HeuriGym: An agentic benchmark framework for evaluating LLM-generated heuristics

The authors propose HeuriGym, an end-to-end agentic framework that enables LLMs to generate heuristic algorithms for combinatorial optimization problems, receive execution feedback, and iteratively refine solutions. The framework includes automated verification, quantitative evaluation, and supports realistic programming tasks across multiple domains.

Contribution

Quality-Yield Index (QYI) metric

The authors introduce QYI as a unified metric that combines solution quality (relative to expert baselines) and yield (success rate) using a harmonic mean formulation. This metric addresses limitations of traditional PASS@k metrics by capturing both feasibility and solution quality in multi-round agentic settings.

Contribution

Benchmark suite of nine combinatorial optimization problems

The authors construct a benchmark of nine carefully selected combinatorial optimization problems from domains including computer systems, logistics, and biology. These problems feature well-defined objectives, large solution spaces, and are designed to resist memorization while requiring genuine algorithmic reasoning and problem-specific heuristic design.