HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

BenchmarkLarge Language ModelsCombinatorial OptimizationCode GenerationAgentAutomatic Heuristic Generation

While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on various problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HeuriGym, an agentic benchmark framework for evaluating LLM-generated heuristics in combinatorial optimization, alongside the Quality-Yield Index metric and a suite of nine benchmark problems. It resides in the Benchmarking and Evaluation leaf of the taxonomy, which contains four papers total. This represents a relatively sparse research direction compared to more crowded areas like Evolutionary and Reflective Heuristic Search (five papers) or Domain-Specific Heuristic Discovery (four papers), suggesting that systematic evaluation infrastructure remains underdeveloped despite rapid growth in heuristic generation methods.

The taxonomy reveals that while numerous branches focus on algorithmic approaches—evolutionary frameworks, tree search methods, direct solution generation—the Benchmarking and Evaluation category addresses a distinct need for rigorous assessment tools. Neighboring leaves like Iterative Optimization and Prompting (two papers) and Hyper-Heuristic and Instance-Specific Methods (three papers) develop complementary techniques but lack standardized evaluation protocols. The scope note for Benchmarking and Evaluation explicitly excludes methods proposing new algorithms, positioning HeuriGym as infrastructure rather than a novel optimization technique, which differentiates it from the majority of the fifty-paper taxonomy focused on algorithmic innovation.

Among thirty candidates examined, none clearly refute any of the three contributions. The HeuriGym framework contribution examined ten candidates with zero refutable overlaps, as did the Quality-Yield Index metric and the benchmark suite. This suggests that within the limited search scope, no prior work provides a directly comparable agentic evaluation framework combining iterative refinement, code execution feedback, and the specific QYI metric. However, the sibling papers in Benchmarking and Evaluation—including comprehensive evaluation studies and capability assessments—likely address overlapping evaluation goals, though the analysis does not indicate they provide identical infrastructure or metrics.

Based on the top-thirty semantic matches examined, the work appears to occupy a distinct position within evaluation methodology for LLM-based optimization. The absence of refutable candidates across all contributions, combined with the sparse Benchmarking and Evaluation category, suggests the specific combination of agentic framework design and the QYI metric may be novel. However, this assessment is constrained by the limited search scope and does not account for potential overlap with evaluation frameworks outside the examined candidates or in adjacent fields beyond combinatorial optimization.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: LLM-generated heuristics for combinatorial optimization problems. The field has rapidly diversified into several major branches that reflect different strategies for leveraging large language models in optimization. LLM-Based Heuristic Generation Frameworks focus on using LLMs to produce or evolve heuristic code, often through iterative refinement or evolutionary search, as seen in works like ReEvo[5] and Evolution of Heuristics[28]. Direct Solution Generation and End-to-End Solving explores LLMs that attempt to solve problems in one shot or through autoregressive generation, exemplified by approaches such as one-shot autoregressive generation[23]. Hybrid and Augmented Approaches combine LLMs with traditional solvers or neural methods, while Iterative Optimization and Prompting emphasizes multi-step reasoning and prompt engineering to guide search. Hyper-Heuristic and Instance-Specific Methods, including works like Llm-driven instance-specific heuristic generation[40], tailor heuristics to particular problem instances. Meanwhile, Benchmarking and Evaluation provides critical infrastructure for assessing these diverse techniques, and Application-Specific Implementations demonstrate deployment in domains ranging from scheduling to routing. A particularly active line of work centers on evolutionary and reflective frameworks that iteratively refine heuristics, contrasting with more direct generation methods that rely on single-pass prompting. Multi-objective approaches such as Pareto-Grid-Guided Large Language Models[36] and Multi-objective evolution of heuristic[30] address trade-offs between solution quality and computational cost, while reduction-based methods explore structured reasoning to decompose complex problems. HeuriGym[0] sits within the Benchmarking and Evaluation branch, providing a standardized environment for comparing LLM-driven heuristic generation methods. Its emphasis on systematic evaluation complements nearby works like A Comprehensive Evaluation of[21] and Exploring combinatorial problem solving[45], which also focus on rigorous assessment of LLM capabilities. By offering a unified testbed, HeuriGym[0] addresses the open question of how to fairly compare the growing variety of LLM-based optimization techniques across different problem classes and instance characteristics.

Claimed Contributions

HeuriGym: An agentic benchmark framework for evaluating LLM-generated heuristics

10 retrieved papers

The authors propose HeuriGym, an end-to-end agentic framework that enables LLMs to generate heuristic algorithms for combinatorial optimization problems, receive execution feedback, and iteratively refine solutions. The framework includes automated verification, quantitative evaluation, and supports realistic programming tasks across multiple domains.

10 retrieved papers

Quality-Yield Index (QYI) metric

10 retrieved papers

The authors introduce QYI as a unified metric that combines solution quality (relative to expert baselines) and yield (success rate) using a harmonic mean formulation. This metric addresses limitations of traditional PASS@k metrics by capturing both feasibility and solution quality in multi-round agentic settings.

10 retrieved papers

Benchmark suite of nine combinatorial optimization problems

10 retrieved papers

The authors construct a benchmark of nine carefully selected combinatorial optimization problems from domains including computer systems, logistics, and biology. These problems feature well-defined objectives, large solution spaces, and are designed to resist memorization while requiring genuine algorithmic reasoning and problem-specific heuristic design.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] On the capability of LLMs in combinatorial optimization PDF

Muhammad Asif Khan, Layth Hamad (2024)

[21] A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization PDF

Feng Sheng-yu, Sun Wei-wei, Li, Shanda, Talwalkar, Ameet, Yang, Yiming (2025)

[45] Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo PDF

Masoud, Mahmoud, Elhenawy Mohammed (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HeuriGym: An agentic benchmark framework for evaluating LLM-generated heuristics

[1] Large language models as end-to-end combinatorial optimization solvers PDF

Cannot Refute

[5] ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution PDF

Cannot Refute

[6] Self-guiding exploration for combinatorial problems PDF

Cannot Refute

[33] Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization PDF

Cannot Refute

[70] STRCMP: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization PDF

Cannot Refute

[71] Learning Improvement Heuristics for Solving Routing Problems PDF

Cannot Refute

[72] Heuristic Search Value Iteration for POMDPs PDF

Cannot Refute

[73] Exact and heuristic methods in combinatorial optimization PDF

Cannot Refute

[74] Search-based llms for code optimization PDF

Cannot Refute

[75] Tree of Thoughts: Deliberate Problem Solving with Large Language Models PDF

Cannot Refute

Contribution

Quality-Yield Index (QYI) metric

[51] Dynamic impact for ant colony optimization algorithm PDF

Cannot Refute

[52] A comprehensive review on multi-objective optimization techniques: Past, present and future PDF

Cannot Refute

[53] Comparative study of state-of-the-art metaheuristics for solving constrained mechanical design optimization problems: experimental analyses and performance â¦ PDF

Cannot Refute

[54] Dhole optimization algorithm: A new metaheuristic algorithm for solving optimization problems PDF

Cannot Refute

[55] Seasons optimization algorithm PDF

Cannot Refute

[56] A Particle Swarm Optimization-Guided Ivy Algorithm for Global Optimization Problems PDF

Cannot Refute

[57] Novel performance metrics for robust multi-objective optimization algorithms PDF

Cannot Refute

[58] Failure risk management: adaptive performance control and mission abort decisions PDF

Cannot Refute

[59] Probability and certainty in the performance of evolutionary and swarm optimization algorithms PDF

Cannot Refute

[60] The hybrid harris hawks optimizer-arithmetic optimization algorithm: a new hybrid algorithm for sizing optimization and design of microgrids PDF

Cannot Refute

Contribution

Benchmark suite of nine combinatorial optimization problems

[13] Using Reasoning Models to Generate Search Heuristics that Solve Open Instances of Combinatorial Design Problems PDF

Cannot Refute

[61] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

Cannot Refute

[62] OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems PDF

Cannot Refute

[63] ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure PDF

Cannot Refute

[64] A Scalable Test Problem Generator for Sequential Transfer Optimization PDF

Cannot Refute

[65] Mobile Test Suite Generation via Combinatorial Sequences PDF

Cannot Refute

[66] Optimal and Bounded-Suboptimal Multi-Goal Task Assignment and Path Finding PDF

Cannot Refute

[67] COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization PDF

Cannot Refute

[68] Optimizing LLM decision-making and time-series analysis using DSPy PDF

Cannot Refute

[69] A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning PDF

Cannot Refute

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] On the capability of LLMs in combinatorial optimization PDF

[21] A Comprehensive Evaluation of Contemporary ML-Based Solvers for Combinatorial Optimization PDF

[45] Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo PDF

Contribution Analysis

HeuriGym: An agentic benchmark framework for evaluating LLM-generated heuristics

[1] Large language models as end-to-end combinatorial optimization solvers PDF

[5] ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution PDF

[6] Self-guiding exploration for combinatorial problems PDF

[33] Generalizable Heuristic Generation Through Large Language Models with Meta-Optimization PDF

[70] STRCMP: Integrating Graph Structural Priors with Language Models for Combinatorial Optimization PDF

[71] Learning Improvement Heuristics for Solving Routing Problems PDF

[72] Heuristic Search Value Iteration for POMDPs PDF

[73] Exact and heuristic methods in combinatorial optimization PDF

[74] Search-based llms for code optimization PDF

[75] Tree of Thoughts: Deliberate Problem Solving with Large Language Models PDF

Quality-Yield Index (QYI) metric

[51] Dynamic impact for ant colony optimization algorithm PDF

[52] A comprehensive review on multi-objective optimization techniques: Past, present and future PDF

[53] Comparative study of state-of-the-art metaheuristics for solving constrained mechanical design optimization problems: experimental analyses and performance â¦ PDF

[54] Dhole optimization algorithm: A new metaheuristic algorithm for solving optimization problems PDF

[55] Seasons optimization algorithm PDF

[56] A Particle Swarm Optimization-Guided Ivy Algorithm for Global Optimization Problems PDF

[57] Novel performance metrics for robust multi-objective optimization algorithms PDF

[58] Failure risk management: adaptive performance control and mission abort decisions PDF

[59] Probability and certainty in the performance of evolutionary and swarm optimization algorithms PDF

[60] The hybrid harris hawks optimizer-arithmetic optimization algorithm: a new hybrid algorithm for sizing optimization and design of microgrids PDF

Benchmark suite of nine combinatorial optimization problems

[13] Using Reasoning Models to Generate Search Heuristics that Solve Open Instances of Combinatorial Design Problems PDF

[61] MCoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought PDF

[62] OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems PDF

[63] ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure PDF

[64] A Scalable Test Problem Generator for Sequential Transfer Optimization PDF

[65] Mobile Test Suite Generation via Combinatorial Sequences PDF

[66] Optimal and Bounded-Suboptimal Multi-Goal Task Assignment and Path Finding PDF

[67] COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization PDF

[68] Optimizing LLM decision-making and time-series analysis using DSPy PDF

[69] A Tree-Based Search Algorithm with Global Pheromone and Local Signal Guidance for Scientific Chart Reasoning PDF

Table of Contents

[53] Comparative study of state-of-the-art metaheuristics for solving constrained mechanical design optimization problems: experimental analyses and performance â¦ PDF