GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

prompt optimizationnatural languagereflectionlarge language modelsagent designagent discoverycode optimizationcompound AI systemsgeneticlanguage based learningevolutionary algorithms

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +10% accuracy on AIME-2025).

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GEPA, a prompt optimizer that combines genetic algorithms with natural language reflection to refine prompts for compound AI systems. According to the taxonomy, GEPA occupies the 'Genetic-Reflective Hybrid Optimization' leaf under 'Evolutionary and Multi-Objective Prompt Optimization'. This leaf currently contains only the original paper itself, with no sibling papers identified. The broader evolutionary branch includes one other leaf ('Evolutionary Multi-Objective Instruction Generation' with one paper), suggesting this hybrid genetic-reflective direction is relatively sparse compared to other areas of the field.

The taxonomy reveals several neighboring research directions. The closest conceptual relatives appear in 'Self-Reflective and Iterative Learning Systems' (three papers across meta-introspection, agentic context engineering, and composite learning units) and 'Task-Adaptive and Feedback-Driven Prompt Frameworks' (three papers covering critique-synthesis optimization, constraint-driven refinement, and dynamic prompting). The taxonomy explicitly distinguishes GEPA's approach from pure evolutionary methods (which lack reflection) and from pure reflection-only systems (which lack genetic mechanisms). This positioning suggests GEPA bridges two established paradigms—evolutionary search and self-reflective learning—that have previously been explored separately in the literature.

Among the three contributions analyzed, the core GEPA system examined ten candidates with none appearing to refute it, while the Pareto-based selection strategy examined three candidates with similar results. However, the reflective prompt mutation mechanism examined three candidates and found one that appears to provide overlapping prior work. This suggests that while GEPA's overall architecture may be distinctive, the use of natural language feedback for prompt refinement has precedent in the limited set of sixteen total candidates examined. The analysis explicitly notes this is based on top-K semantic search plus citation expansion, not an exhaustive literature review.

Given the limited search scope, GEPA appears to occupy a relatively novel position by explicitly hybridizing genetic algorithms with reflection-based prompt evolution. The sparse population of its taxonomy leaf and the absence of sibling papers suggest this specific combination is underexplored. However, the presence of overlapping work on reflective mutation indicates that individual components draw on established techniques. The assessment is constrained by examining only sixteen candidates total, leaving open the possibility of additional relevant work beyond the top semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: optimizing prompts for compound AI systems using natural language reflection. The field encompasses a diverse set of approaches for improving how language models are instructed and composed. At the highest level, the taxonomy reveals several major branches: evolutionary and multi-objective methods that treat prompt design as a search problem; differentiation-based techniques that borrow gradient-inspired ideas from classical optimization; self-reflective and iterative learning systems that enable models to critique and refine their own outputs; task-adaptive and feedback-driven frameworks that adjust prompts in response to performance signals; reactive and cross-model steering methods that dynamically guide generation; scaffolded and multi-step architectures that decompose complex reasoning into structured stages; and domain-specific applications that tailor language-guided optimization to specialized fields. Works such as Textgrad[1] and Trace AutoDiff[8] illustrate gradient-inspired approaches, while DSPy Assertions[9] and Scaffolded Language Models[13] exemplify structured reasoning pipelines, and domain applications like LLM Reticular Chemistry[2] and Text to Robotic Assembly[15] demonstrate how these techniques extend beyond general-purpose tasks. A particularly active line of inquiry explores how evolutionary search can be combined with reflective feedback to balance exploration and exploitation in prompt space. GEPA[0] sits within this genetic-reflective hybrid cluster, blending population-based mutation with natural language critique to iteratively refine prompts. This contrasts with purely evolutionary methods like PromptWizard[6] and InstOptima[5], which rely more heavily on discrete variation operators, and with purely reflective systems such as Reflectevo[4] and Agentic Context Engineering[3], which emphasize iterative self-improvement without explicit genetic mechanisms. The hybrid approach aims to leverage the broad search coverage of evolutionary algorithms while incorporating the semantic guidance that reflection provides, addressing a central trade-off in the field: how to efficiently navigate vast prompt spaces without sacrificing the nuanced understanding that language models can bring to their own optimization.

Claimed Contributions

GEPA (Genetic-Pareto) prompt optimizer

10 retrieved papers

GEPA is a sample-efficient prompt optimization method for compound AI systems that combines reflective prompt evolution with Pareto-based candidate selection. It samples trajectories, reflects on them in natural language to diagnose problems, proposes prompt updates, and combines lessons from the Pareto frontier of attempts.

10 retrieved papers

Reflective prompt mutation using natural language feedback

Can Refute

3 retrieved papers

The method leverages execution and evaluation traces as diagnostic signals, using LLMs to perform reflective credit assignment and propose targeted prompt updates based on natural language feedback rather than scalar rewards alone.

3 retrieved papers

Can Refute

Pareto-based candidate selection strategy

3 retrieved papers

GEPA employs a Pareto-based illumination strategy that maintains candidates achieving the best score on at least one task, stochastically sampling from this frontier to balance exploration and exploitation, avoiding local optima that trap greedy selection methods.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GEPA (Genetic-Pareto) prompt optimizer

[23] Systematic survey of various prompt optimization methods and their classifications PDF

Cannot Refute

[24] Promptwizard: Optimizing prompts via task-aware, feedback-driven self-evolution PDF

Cannot Refute

[25] Quality-Diversity through AI Feedback PDF

Cannot Refute

[26] PromptPilot: Autonomous Prompt Optimization via Genetic Particle Filtering and Dynamic Exploration PDF

Cannot Refute

[27] Prompt evolutionary design optimization with generative shape and vision-language models PDF

Cannot Refute

[28] A Toolbox for Improving Evolutionary Prompt Search PDF

Cannot Refute

[29] Prompt's Evolution for Language Model-Driven Data Generation PDF

Cannot Refute

[30] SI-Agent: An Agentic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models PDF

Cannot Refute

[31] AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema PDF

Cannot Refute

[32] Dynamic Prompt Evolution via Multi-Attribute Feedback for Text-to-Image Generation PDF

Cannot Refute

Contribution

Reflective prompt mutation using natural language feedback

[21] SCOPE: Prompt Evolution for Enhancing Agent Effectiveness PDF

Can Refute

[20] Hybrid Fuzzing with LLM-Guided Input Mutation and Semantic Feedback PDF

Cannot Refute

[22] Guided Debugging with Natural Language Processing: Building an Adaptive and Context-Aware Intelligent Tutoring System for Novice Programmers PDF

Cannot Refute

Contribution

Pareto-based candidate selection strategy

[17] Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness PDF

Cannot Refute

[18] Achieving Efficient Prompt Engineering in Large Language Models Using a Hybrid and Multi-Objective Optimization Framework PDF

Cannot Refute

[19] Genetic Programming Across Domains: Leveraging Evolutionary Computation to Address Practical Problems PDF

Cannot Refute

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

GEPA (Genetic-Pareto) prompt optimizer

[23] Systematic survey of various prompt optimization methods and their classifications PDF

[24] Promptwizard: Optimizing prompts via task-aware, feedback-driven self-evolution PDF

[25] Quality-Diversity through AI Feedback PDF

[26] PromptPilot: Autonomous Prompt Optimization via Genetic Particle Filtering and Dynamic Exploration PDF

[27] Prompt evolutionary design optimization with generative shape and vision-language models PDF

[28] A Toolbox for Improving Evolutionary Prompt Search PDF

[29] Prompt's Evolution for Language Model-Driven Data Generation PDF

[30] SI-Agent: An Agentic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models PDF

[31] AEGIS : Automated Co-Evolutionary Framework for Guarding Prompt Injections Schema PDF

[32] Dynamic Prompt Evolution via Multi-Attribute Feedback for Text-to-Image Generation PDF

Reflective prompt mutation using natural language feedback

[21] SCOPE: Prompt Evolution for Enhancing Agent Effectiveness PDF

[20] Hybrid Fuzzing with LLM-Guided Input Mutation and Semantic Feedback PDF

[22] Guided Debugging with Natural Language Processing: Building an Adaptive and Context-Aware Intelligent Tutoring System for Novice Programmers PDF

Pareto-based candidate selection strategy

[17] Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness PDF

[18] Achieving Efficient Prompt Engineering in Large Language Models Using a Hybrid and Multi-Objective Optimization Framework PDF

[19] Genetic Programming Across Domains: Leveraging Evolutionary Computation to Address Practical Problems PDF

Table of Contents