MARS: Reinforcing Multi-Agent Reasoning of LLMs through Self-Play in Strategic Games
Overview
Overall Novelty Assessment
The paper introduces MARS, an end-to-end reinforcement learning framework for multi-agent reasoning through self-play in cooperative and competitive games. It resides in the 'Multi-Agent Self-Play in Strategic Games' leaf, which contains four papers including the original work. This leaf sits within the broader 'Self-Play Training Frameworks for Multi-Agent Reasoning' branch, indicating a moderately populated research direction focused on zero-shot strategic learning without human supervision. The taxonomy reveals this is an active but not overcrowded area, with sibling works like Multi-agent KTO and MARSHAL exploring similar self-play dynamics.
The taxonomy tree shows neighboring leaves addressing related but distinct challenges: 'Zero-Shot Self-Play with Verifiable Rewards' focuses on outcome-based training in structured environments, while 'Adversarial Language Games for Strategic Learning' emphasizes deception and linguistic strategy. The 'Supervised and Hybrid Training Approaches' branch explores methods blending self-play with human data, contrasting with MARS's purely self-supervised approach. The scope notes clarify that MARS's game-based setting excludes single-agent reasoning tasks and non-game environments, positioning it squarely within strategic multi-agent interaction research rather than general LLM training or domain-specific applications.
Among twenty-four candidates examined, the contribution-level analysis reveals mixed novelty signals. The core MARS framework (ten candidates examined, zero refutations) appears relatively novel within the limited search scope. However, the turn-level advantage estimator and agent-specific normalization (four candidates examined, one refutation) show overlap with prior work, suggesting incremental refinement of existing credit assignment techniques. The generalization claim from games to multi-agent systems (ten candidates examined, zero refutations) appears less explored in the examined literature, though the limited sample size prevents definitive conclusions about its novelty across the broader field.
Based on the top-twenty-four semantic matches examined, MARS appears to offer a coherent integration of self-play mechanisms with multi-agent strategic reasoning, though specific technical components show partial overlap with existing methods. The analysis covers a focused subset of the literature rather than an exhaustive survey, leaving open questions about how MARS compares to work outside the examined candidate pool or in adjacent research communities not captured by the taxonomy structure.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce MARS, a reinforcement learning framework that trains LLMs to develop multi-agent reasoning abilities by playing both cooperative and competitive strategic games against themselves. This framework enables agents to learn generalizable skills that transfer to multi-agent systems beyond the training games.
The authors propose two technical innovations: a turn-level advantage estimator that computes cumulative returns before normalization for accurate credit assignment across multiple turns, and an agent-specific advantage normalization that partitions trajectories by player role to handle heterogeneous game roles and asymmetric information.
The authors conduct comprehensive experiments showing that skills learned through self-play in strategic games transfer to improved performance in established multi-agent systems like AutoGen and MAD, achieving gains up to 10.0% on AIME and 6.6% on GPQA-Diamond across reasoning benchmarks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[3] Multi-agent kto: Reinforcing strategic interactions of large language model in language game PDF
[25] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF
[33] Multi-agent KTO: Enhancing Strategic Interactions of Large Language Model in Language Game PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
MARS framework for multi-agent reasoning through self-play
The authors introduce MARS, a reinforcement learning framework that trains LLMs to develop multi-agent reasoning abilities by playing both cooperative and competitive strategic games against themselves. This framework enables agents to learn generalizable skills that transfer to multi-agent systems beyond the training games.
[8] Sirius: Self-improving multi-agent systems via bootstrapped reasoning PDF
[9] Self-playing Adversarial Language Game Enhances LLM Reasoning PDF
[15] Meta-thinking in llms via multi-agent reinforcement learning: A survey PDF
[22] EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning PDF
[38] Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game PDF
[39] A comprehensive review of multi-agent reinforcement learning in video games PDF
[40] Grandmaster level in StarCraft II using multi-agent reinforcement learning PDF
[41] Douzero: Mastering doudizhu with self-play deep reinforcement learning PDF
[42] Role play: Learning adaptive role-specific strategies in multi-agent interactions PDF
[43] âother-playâ for zero-shot coordination PDF
Turn-level advantage estimator and agent-specific normalization
The authors propose two technical innovations: a turn-level advantage estimator that computes cumulative returns before normalization for accurate credit assignment across multiple turns, and an agent-specific advantage normalization that partitions trajectories by player role to handle heterogeneous game roles and asymmetric information.
[25] MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs PDF
[35] STAS: Spatial-Temporal Return Decomposition for Multi-agent Reinforcement Learning PDF
[36] STAS: Spatial-Temporal Return Decomposition for Solving Sparse Rewards Problems in Multi-agent Reinforcement Learning PDF
[37] Multi-Agent Reinforcement Learning with Temporally Smoothed Actions PDF
Demonstration of generalization from games to multi-agent systems
The authors conduct comprehensive experiments showing that skills learned through self-play in strategic games transfer to improved performance in established multi-agent systems like AutoGen and MAD, achieving gains up to 10.0% on AIME and 6.6% on GPQA-Diamond across reasoning benchmarks.