MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelSelf-playMulti-Agent SystemStrategic Games
Abstract:

Developing large language models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARS agent achieves significant performance gains of up to 10.0% on AIME, 6.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at https://anonymous.4open.science/r/MARS-LLM.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MARS, an end-to-end reinforcement learning framework for multi-agent reasoning through self-play in cooperative and competitive games. It resides in the 'Multi-Agent Self-Play in Strategic Games' leaf, which contains four papers including the original work. This leaf sits within the broader 'Self-Play Training Frameworks for Multi-Agent Reasoning' branch, indicating a moderately populated research direction focused on zero-shot strategic learning without human supervision. The taxonomy reveals this is an active but not overcrowded area, with sibling works like Multi-agent KTO and MARSHAL exploring similar self-play dynamics.

The taxonomy tree shows neighboring leaves addressing related but distinct challenges: 'Zero-Shot Self-Play with Verifiable Rewards' focuses on outcome-based training in structured environments, while 'Adversarial Language Games for Strategic Learning' emphasizes deception and linguistic strategy. The 'Supervised and Hybrid Training Approaches' branch explores methods blending self-play with human data, contrasting with MARS's purely self-supervised approach. The scope notes clarify that MARS's game-based setting excludes single-agent reasoning tasks and non-game environments, positioning it squarely within strategic multi-agent interaction research rather than general LLM training or domain-specific applications.

Among twenty-four candidates examined, the contribution-level analysis reveals mixed novelty signals. The core MARS framework (ten candidates examined, zero refutations) appears relatively novel within the limited search scope. However, the turn-level advantage estimator and agent-specific normalization (four candidates examined, one refutation) show overlap with prior work, suggesting incremental refinement of existing credit assignment techniques. The generalization claim from games to multi-agent systems (ten candidates examined, zero refutations) appears less explored in the examined literature, though the limited sample size prevents definitive conclusions about its novelty across the broader field.

Based on the top-twenty-four semantic matches examined, MARS appears to offer a coherent integration of self-play mechanisms with multi-agent strategic reasoning, though specific technical components show partial overlap with existing methods. The analysis covers a focused subset of the literature rather than an exhaustive survey, leaving open questions about how MARS compares to work outside the examined candidate pool or in adjacent research communities not captured by the taxonomy structure.

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Multi-agent reasoning of large language models through self-play in strategic games. The field has organized itself around several complementary directions. Self-Play Training Frameworks for Multi-Agent Reasoning explore how agents can improve strategic reasoning by competing or cooperating with themselves, often drawing on game-theoretic principles to refine policy learning. Supervised and Hybrid Training Approaches blend self-play with external supervision or human demonstrations, aiming to balance exploration with guided learning. Domain-Specific Self-Play Applications adapt these ideas to particular settings such as negotiation, dialogue games, or embodied environments. Evaluation Benchmarks and Analysis provide standardized testbeds and metrics to measure strategic competence, while Multi-Agent System Optimization and Meta-Learning investigate higher-level mechanisms for evolving agent populations and tuning training dynamics. Representative works like Llmarena[1] and Absolute Zero[2] illustrate how self-play can be scaled and systematized, whereas efforts such as SPIRAL[6] and Sirius[8] highlight domain-specific instantiations. Within the self-play training landscape, a handful of works focus on refining multi-agent interactions through iterative policy updates and preference-based optimization. Multi-agent KTO[3] and its enhanced variant[33] exemplify methods that leverage pairwise comparisons to guide agent improvement without requiring explicit reward models. MARSHAL[25] similarly emphasizes structured self-play in strategic games, exploring how agents can learn robust policies through repeated competitive episodes. MARS[0] sits naturally within this cluster, sharing the emphasis on self-play dynamics and strategic reasoning but distinguishing itself by integrating multi-agent interactions more tightly with game-theoretic equilibria. Compared to Multi-agent KTO[3], which prioritizes preference learning, MARS[0] appears to place greater weight on emergent strategic behaviors and equilibrium convergence. This positioning reflects ongoing debates about whether self-play should prioritize sample efficiency, robustness to opponent diversity, or alignment with human strategic intuitions.

Claimed Contributions

MARS framework for multi-agent reasoning through self-play

The authors introduce MARS, a reinforcement learning framework that trains LLMs to develop multi-agent reasoning abilities by playing both cooperative and competitive strategic games against themselves. This framework enables agents to learn generalizable skills that transfer to multi-agent systems beyond the training games.

10 retrieved papers
Turn-level advantage estimator and agent-specific normalization

The authors propose two technical innovations: a turn-level advantage estimator that computes cumulative returns before normalization for accurate credit assignment across multiple turns, and an agent-specific advantage normalization that partitions trajectories by player role to handle heterogeneous game roles and asymmetric information.

4 retrieved papers
Can Refute
Demonstration of generalization from games to multi-agent systems

The authors conduct comprehensive experiments showing that skills learned through self-play in strategic games transfer to improved performance in established multi-agent systems like AutoGen and MAD, achieving gains up to 10.0% on AIME and 6.6% on GPQA-Diamond across reasoning benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MARS framework for multi-agent reasoning through self-play

The authors introduce MARS, a reinforcement learning framework that trains LLMs to develop multi-agent reasoning abilities by playing both cooperative and competitive strategic games against themselves. This framework enables agents to learn generalizable skills that transfer to multi-agent systems beyond the training games.

Contribution

Turn-level advantage estimator and agent-specific normalization

The authors propose two technical innovations: a turn-level advantage estimator that computes cumulative returns before normalization for accurate credit assignment across multiple turns, and an agent-specific advantage normalization that partitions trajectories by player role to handle heterogeneous game roles and asymmetric information.

Contribution

Demonstration of generalization from games to multi-agent systems

The authors conduct comprehensive experiments showing that skills learned through self-play in strategic games transfer to improved performance in established multi-agent systems like AutoGen and MAD, achieving gains up to 10.0% on AIME and 6.6% on GPQA-Diamond across reasoning benchmarks.