GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
prompt optimizationnatural languagereflectionlarge language modelsagent designagent discoverycode optimizationcompound AI systemsgeneticlanguage based learningevolutionary algorithms
Abstract:

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +10% accuracy on AIME-2025).

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GEPA, a prompt optimizer that combines genetic algorithms with natural language reflection to refine prompts for compound AI systems. According to the taxonomy, GEPA occupies the 'Genetic-Reflective Hybrid Optimization' leaf under 'Evolutionary and Multi-Objective Prompt Optimization'. This leaf currently contains only the original paper itself, with no sibling papers identified. The broader evolutionary branch includes one other leaf ('Evolutionary Multi-Objective Instruction Generation' with one paper), suggesting this hybrid genetic-reflective direction is relatively sparse compared to other areas of the field.

The taxonomy reveals several neighboring research directions. The closest conceptual relatives appear in 'Self-Reflective and Iterative Learning Systems' (three papers across meta-introspection, agentic context engineering, and composite learning units) and 'Task-Adaptive and Feedback-Driven Prompt Frameworks' (three papers covering critique-synthesis optimization, constraint-driven refinement, and dynamic prompting). The taxonomy explicitly distinguishes GEPA's approach from pure evolutionary methods (which lack reflection) and from pure reflection-only systems (which lack genetic mechanisms). This positioning suggests GEPA bridges two established paradigms—evolutionary search and self-reflective learning—that have previously been explored separately in the literature.

Among the three contributions analyzed, the core GEPA system examined ten candidates with none appearing to refute it, while the Pareto-based selection strategy examined three candidates with similar results. However, the reflective prompt mutation mechanism examined three candidates and found one that appears to provide overlapping prior work. This suggests that while GEPA's overall architecture may be distinctive, the use of natural language feedback for prompt refinement has precedent in the limited set of sixteen total candidates examined. The analysis explicitly notes this is based on top-K semantic search plus citation expansion, not an exhaustive literature review.

Given the limited search scope, GEPA appears to occupy a relatively novel position by explicitly hybridizing genetic algorithms with reflection-based prompt evolution. The sparse population of its taxonomy leaf and the absence of sibling papers suggest this specific combination is underexplored. However, the presence of overlapping work on reflective mutation indicates that individual components draw on established techniques. The assessment is constrained by examining only sixteen candidates total, leaving open the possibility of additional relevant work beyond the top semantic matches.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
16
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: optimizing prompts for compound AI systems using natural language reflection. The field encompasses a diverse set of approaches for improving how language models are instructed and composed. At the highest level, the taxonomy reveals several major branches: evolutionary and multi-objective methods that treat prompt design as a search problem; differentiation-based techniques that borrow gradient-inspired ideas from classical optimization; self-reflective and iterative learning systems that enable models to critique and refine their own outputs; task-adaptive and feedback-driven frameworks that adjust prompts in response to performance signals; reactive and cross-model steering methods that dynamically guide generation; scaffolded and multi-step architectures that decompose complex reasoning into structured stages; and domain-specific applications that tailor language-guided optimization to specialized fields. Works such as Textgrad[1] and Trace AutoDiff[8] illustrate gradient-inspired approaches, while DSPy Assertions[9] and Scaffolded Language Models[13] exemplify structured reasoning pipelines, and domain applications like LLM Reticular Chemistry[2] and Text to Robotic Assembly[15] demonstrate how these techniques extend beyond general-purpose tasks. A particularly active line of inquiry explores how evolutionary search can be combined with reflective feedback to balance exploration and exploitation in prompt space. GEPA[0] sits within this genetic-reflective hybrid cluster, blending population-based mutation with natural language critique to iteratively refine prompts. This contrasts with purely evolutionary methods like PromptWizard[6] and InstOptima[5], which rely more heavily on discrete variation operators, and with purely reflective systems such as Reflectevo[4] and Agentic Context Engineering[3], which emphasize iterative self-improvement without explicit genetic mechanisms. The hybrid approach aims to leverage the broad search coverage of evolutionary algorithms while incorporating the semantic guidance that reflection provides, addressing a central trade-off in the field: how to efficiently navigate vast prompt spaces without sacrificing the nuanced understanding that language models can bring to their own optimization.

Claimed Contributions

GEPA (Genetic-Pareto) prompt optimizer

GEPA is a sample-efficient prompt optimization method for compound AI systems that combines reflective prompt evolution with Pareto-based candidate selection. It samples trajectories, reflects on them in natural language to diagnose problems, proposes prompt updates, and combines lessons from the Pareto frontier of attempts.

10 retrieved papers
Reflective prompt mutation using natural language feedback

The method leverages execution and evaluation traces as diagnostic signals, using LLMs to perform reflective credit assignment and propose targeted prompt updates based on natural language feedback rather than scalar rewards alone.

3 retrieved papers
Can Refute
Pareto-based candidate selection strategy

GEPA employs a Pareto-based illumination strategy that maintains candidates achieving the best score on at least one task, stochastically sampling from this frontier to balance exploration and exploitation, avoiding local optima that trap greedy selection methods.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GEPA (Genetic-Pareto) prompt optimizer

GEPA is a sample-efficient prompt optimization method for compound AI systems that combines reflective prompt evolution with Pareto-based candidate selection. It samples trajectories, reflects on them in natural language to diagnose problems, proposes prompt updates, and combines lessons from the Pareto frontier of attempts.

Contribution

Reflective prompt mutation using natural language feedback

The method leverages execution and evaluation traces as diagnostic signals, using LLMs to perform reflective credit assignment and propose targeted prompt updates based on natural language feedback rather than scalar rewards alone.

Contribution

Pareto-based candidate selection strategy

GEPA employs a Pareto-based illumination strategy that maintains candidates achieving the best score on at least one task, stochastically sampling from this frontier to balance exploration and exploitation, avoiding local optima that trap greedy selection methods.