MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelSelf-playMulti-Agent SystemStrategic Games

Developing large language models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARS agent achieves significant performance gains of up to 10.0% on AIME, 6.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at https://anonymous.4open.science/r/MARS-LLM.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MARS, an end-to-end reinforcement learning framework for multi-agent reasoning through self-play in cooperative and competitive games. It resides in the 'Multi-Agent Self-Play in Strategic Games' leaf, which contains four papers including the original work. This leaf sits within the broader 'Self-Play Training Frameworks for Multi-Agent Reasoning' branch, indicating a moderately populated research direction focused on zero-shot strategic learning without human supervision. The taxonomy reveals this is an active but not overcrowded area, with sibling works like Multi-agent KTO and MARSHAL exploring similar self-play dynamics.

The taxonomy tree shows neighboring leaves addressing related but distinct challenges: 'Zero-Shot Self-Play with Verifiable Rewards' focuses on outcome-based training in structured environments, while 'Adversarial Language Games for Strategic Learning' emphasizes deception and linguistic strategy. The 'Supervised and Hybrid Training Approaches' branch explores methods blending self-play with human data, contrasting with MARS's purely self-supervised approach. The scope notes clarify that MARS's game-based setting excludes single-agent reasoning tasks and non-game environments, positioning it squarely within strategic multi-agent interaction research rather than general LLM training or domain-specific applications.

Among twenty-four candidates examined, the contribution-level analysis reveals mixed novelty signals. The core MARS framework (ten candidates examined, zero refutations) appears relatively novel within the limited search scope. However, the turn-level advantage estimator and agent-specific normalization (four candidates examined, one refutation) show overlap with prior work, suggesting incremental refinement of existing credit assignment techniques. The generalization claim from games to multi-agent systems (ten candidates examined, zero refutations) appears less explored in the examined literature, though the limited sample size prevents definitive conclusions about its novelty across the broader field.

Based on the top-twenty-four semantic matches examined, MARS appears to offer a coherent integration of self-play mechanisms with multi-agent strategic reasoning, though specific technical components show partial overlap with existing methods. The analysis covers a focused subset of the literature rather than an exhaustive survey, leaving open questions about how MARS compares to work outside the examined candidate pool or in adjacent research communities not captured by the taxonomy structure.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-agent reasoning of large language models through self-play in strategic games. The field has organized itself around several complementary directions. Self-Play Training Frameworks for Multi-Agent Reasoning explore how agents can improve strategic reasoning by competing or cooperating with themselves, often drawing on game-theoretic principles to refine policy learning. Supervised and Hybrid Training Approaches blend self-play with external supervision or human demonstrations, aiming to balance exploration with guided learning. Domain-Specific Self-Play Applications adapt these ideas to particular settings such as negotiation, dialogue games, or embodied environments. Evaluation Benchmarks and Analysis provide standardized testbeds and metrics to measure strategic competence, while Multi-Agent System Optimization and Meta-Learning investigate higher-level mechanisms for evolving agent populations and tuning training dynamics. Representative works like Llmarena[1] and Absolute Zero[2] illustrate how self-play can be scaled and systematized, whereas efforts such as SPIRAL[6] and Sirius[8] highlight domain-specific instantiations. Within the self-play training landscape, a handful of works focus on refining multi-agent interactions through iterative policy updates and preference-based optimization. Multi-agent KTO[3] and its enhanced variant[33] exemplify methods that leverage pairwise comparisons to guide agent improvement without requiring explicit reward models. MARSHAL[25] similarly emphasizes structured self-play in strategic games, exploring how agents can learn robust policies through repeated competitive episodes. MARS[0] sits naturally within this cluster, sharing the emphasis on self-play dynamics and strategic reasoning but distinguishing itself by integrating multi-agent interactions more tightly with game-theoretic equilibria. Compared to Multi-agent KTO[3], which prioritizes preference learning, MARS[0] appears to place greater weight on emergent strategic behaviors and equilibrium convergence. This positioning reflects ongoing debates about whether self-play should prioritize sample efficiency, robustness to opponent diversity, or alignment with human strategic intuitions.

Claimed Contributions

MARS framework for multi-agent reasoning through self-play

10 retrieved papers

The authors introduce MARS, a reinforcement learning framework that trains LLMs to develop multi-agent reasoning abilities by playing both cooperative and competitive strategic games against themselves. This framework enables agents to learn generalizable skills that transfer to multi-agent systems beyond the training games.

10 retrieved papers

Turn-level advantage estimator and agent-specific normalization

Can Refute

4 retrieved papers

The authors propose two technical innovations: a turn-level advantage estimator that computes cumulative returns before normalization for accurate credit assignment across multiple turns, and an agent-specific advantage normalization that partitions trajectories by player role to handle heterogeneous game roles and asymmetric information.

4 retrieved papers

Can Refute

Demonstration of generalization from games to multi-agent systems

10 retrieved papers

The authors conduct comprehensive experiments showing that skills learned through self-play in strategic games transfer to improved performance in established multi-agent systems like AutoGen and MAD, achieving gains up to 10.0% on AIME and 6.6% on GPQA-Diamond across reasoning benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Multi-agent kto: Reinforcing strategic interactions of large language model in language game PDF

Ye Rong, Zhang Yongxin, Rong Ye, Zhang, Yikai, Yongxin Zhang, Yikai Zhang, Wei, Zhongyu, Haoyu Kuang, Sun Peng, Zhongyu Wei, Peng Sun (2025)

[25] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang (2025)

[33] Multi-agent KTO: Enhancing Strategic Interactions of Large Language Model in Language Game PDF

R Ye, Y Zhang, H Kuang (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MARS framework for multi-agent reasoning through self-play

[8] Sirius: Self-improving multi-agent systems via bootstrapped reasoning PDF

Cannot Refute

[9] Self-playing Adversarial Language Game Enhances LLM Reasoning PDF

Cannot Refute

[15] Meta-thinking in llms via multi-agent reinforcement learning: A survey PDF

Cannot Refute

[22] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning PDF

Cannot Refute

[38] Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game PDF

Cannot Refute

[39] A comprehensive review of multi-agent reinforcement learning in video games PDF

Cannot Refute

[40] Grandmaster level in StarCraft II using multi-agent reinforcement learning PDF

Cannot Refute

[41] Douzero: Mastering doudizhu with self-play deep reinforcement learning PDF

Cannot Refute

[42] Role play: Learning adaptive role-specific strategies in multi-agent interactions PDF

Cannot Refute

[43] âother-playâ for zero-shot coordination PDF

Cannot Refute

Contribution

Turn-level advantage estimator and agent-specific normalization

[25] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF

Can Refute

[35] STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning PDF

Cannot Refute

[36] STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning PDF

Cannot Refute

[37] Multi-Agent Reinforcement Learning with Temporally Smoothed Actions PDF

Cannot Refute

Contribution

Demonstration of generalization from games to multi-agent systems

[44] Bi-level actor-critic for multi-agent coordination PDF

Cannot Refute

[45] A multi-layered AI-driven cybersecurity architecture: Integrating entropy analytics, Fuzzy reasoning, game theory and multi-agent reinforcement learning for adaptive â¦ PDF

Cannot Refute

[46] Strategic Communication and Language Bias in Multi-Agent LLM Coordination PDF

Cannot Refute

[47] Bankruptcy-evolutionary games based solution for the multi-agent credit assignment problem PDF

Cannot Refute

[48] WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models PDF

Cannot Refute

[49] Controllability of game-based multi-agent system PDF

Cannot Refute

[50] Learning and game-theoretic paradigms for strategic coordination of multi-agent autonomous systems PDF

Cannot Refute

[51] Game of drones: Intelligent online decision making of multi-uav confrontation PDF

Cannot Refute

[52] Cooperative multi-agent game based on reinforcement learning PDF

Cannot Refute

[53] A Survey on Large Language Model-Based Game Agents PDF

Cannot Refute

MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Multi-agent kto: Reinforcing strategic interactions of large language model in language game PDF

[25] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF

[33] Multi-agent KTO: Enhancing Strategic Interactions of Large Language Model in Language Game PDF

Contribution Analysis

MARS framework for multi-agent reasoning through self-play

[8] Sirius: Self-improving multi-agent systems via bootstrapped reasoning PDF

[9] Self-playing Adversarial Language Game Enhances LLM Reasoning PDF

[15] Meta-thinking in llms via multi-agent reinforcement learning: A survey PDF

[22] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning PDF

[38] Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game PDF

[39] A comprehensive review of multi-agent reinforcement learning in video games PDF

[40] Grandmaster level in StarCraft II using multi-agent reinforcement learning PDF

[41] Douzero: Mastering doudizhu with self-play deep reinforcement learning PDF

[42] Role play: Learning adaptive role-specific strategies in multi-agent interactions PDF

[43] âother-playâ for zero-shot coordination PDF

Turn-level advantage estimator and agent-specific normalization

[25] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF

[35] STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning PDF

[36] STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning PDF

[37] Multi-Agent Reinforcement Learning with Temporally Smoothed Actions PDF

Demonstration of generalization from games to multi-agent systems

[44] Bi-level actor-critic for multi-agent coordination PDF

[45] A multi-layered AI-driven cybersecurity architecture: Integrating entropy analytics, Fuzzy reasoning, game theory and multi-agent reinforcement learning for adaptive â¦ PDF

[46] Strategic Communication and Language Bias in Multi-Agent LLM Coordination PDF

[47] Bankruptcy-evolutionary games based solution for the multi-agent credit assignment problem PDF

[48] WGSR-Bench: Wargame-based Game-theoretic Strategic Reasoning Benchmark for Large Language Models PDF

[49] Controllability of game-based multi-agent system PDF

[50] Learning and game-theoretic paradigms for strategic coordination of multi-agent autonomous systems PDF

[51] Game of drones: Intelligent online decision making of multi-uav confrontation PDF

[52] Cooperative multi-agent game based on reinforcement learning PDF

[53] A Survey on Large Language Model-Based Game Agents PDF

Table of Contents

[43] âother-playâ for zero-shot coordination PDF

[45] A multi-layered AI-driven cybersecurity architecture: Integrating entropy analytics, Fuzzy reasoning, game theory and multi-agent reinforcement learning for adaptive â¦ PDF