Code World Models for General Game Playing

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelscode world modelscode generationinformation set MCTSplanningpartial observabilitytwo-player gamesimperfect information games
Abstract:

Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach---involving prompting for direct move generation---has significant drawbacks. It relies on the model's implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model---comprising functions for state transition, legal move enumeration, and termination checks---serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes generating executable Python code from natural language game rules to serve as a formal world model for planning algorithms like MCTS. It resides in the 'General Game Playing via Code Generation' leaf, which contains only three papers total, including this work and two siblings (Code to Play, Code World MCTS). This is a notably sparse research direction within the broader taxonomy of 39 papers across 36 topics, suggesting the specific combination of LLM-driven code synthesis for general game playing with verifiable planning is relatively underexplored compared to adjacent areas like neural world models or direct LLM game generation.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Domain-Specific Code Generation' focuses on specialized domains (3D environments, traffic scenarios) rather than general game playing, while 'Formal Specification Languages for Games' emphasizes declarative DSLs like VGDL rather than LLM-driven Python synthesis. The parallel branch 'Neural World Models' trades code interpretability for learned dynamics, and 'Direct LLM Game Generation' bypasses explicit world model construction entirely. The paper's approach sits at the intersection of symbolic verifiability (via code) and LLM flexibility, distinguishing it from purely neural methods while maintaining broader applicability than domain-specific code generators.

Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The core contribution (Code World Models for verifiable planning) examined 9 candidates with 0 refutable matches; inference function synthesis for imperfect information games examined 10 candidates with 0 refutable; and closed deck learning for partial observability examined 10 candidates with 0 refutable. This suggests that within the limited search scope, the specific combination of LLM-generated executable code, MCTS integration, and imperfect information handling appears relatively novel. However, the small candidate pool and sparse taxonomy leaf indicate this assessment reflects top-30 semantic matches rather than exhaustive field coverage.

Based on the limited literature search, the work appears to occupy a sparsely populated niche combining code-based world model synthesis with general game playing. The absence of refuting candidates across all three contributions, coupled with the small taxonomy leaf (3 papers), suggests potential novelty within the examined scope. However, the analysis covers only 29 candidates from semantic search, leaving open the possibility of relevant work outside this retrieval window, particularly in adjacent areas like hybrid symbolic-neural methods or domain-specific planning frameworks.

Taxonomy

Core-task Taxonomy Papers
39
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Generating executable world models from natural language game descriptions. The field divides into several complementary branches. Code-Based World Model Synthesis focuses on translating language into executable programs or domain-specific languages, often leveraging general game playing frameworks and symbolic representations. Neural World Models learn latent dynamics directly from data, trading interpretability for flexibility in complex environments. Direct LLM Game Generation exploits large language models to produce game content or mechanics end-to-end, sometimes bypassing explicit intermediate representations. Benchmarks and Evaluation Frameworks provide standardized testbeds and metrics for comparing these diverse approaches, while Supporting Techniques and Applications encompass auxiliary methods such as procedural generation, constraint-based design, and real-world deployment scenarios. Together, these branches reflect a spectrum from symbolic, verifiable code synthesis to learned, black-box neural dynamics. Within Code-Based World Model Synthesis, a particularly active line of work explores general game playing via code generation, where systems produce executable game logic from textual descriptions. Code World Models[0] sits squarely in this cluster, emphasizing the generation of interpretable, modular code that can be executed and debugged. Nearby efforts such as Code to Play[15] and Code World MCTS[29] similarly prioritize code as the primary representation, but differ in their search or planning strategies for refining generated programs. In contrast, works like Word to World Models[1] and Gavel[2] blend symbolic and neural components, using language models to guide code synthesis while maintaining some degree of learned flexibility. The main trade-off across these approaches is between the transparency and verifiability of pure code generation and the adaptability of hybrid or fully neural methods. Code World Models[0] leans toward the former, offering a clear executable artifact that domain experts can inspect and modify, distinguishing it from more opaque neural alternatives while sharing the code-centric philosophy of its immediate neighbors.

Claimed Contributions

Code World Models for game playing with verifiable planning

The authors propose using LLMs to synthesize executable Python code representing game rules and dynamics (Code World Models) from textual descriptions and example trajectories. This CWM serves as a verifiable simulation engine for classical planning algorithms like MCTS, enabling algorithmic enumeration of valid actions and avoiding illegal moves.

9 retrieved papers
Inference function synthesis for imperfect information games

The authors introduce a novel paradigm where the LLM synthesizes inference functions that act as encoders mapping observations to plausible latent histories, while the CWM acts as a decoder. This enables ISMCTS planning in partially observable games by estimating hidden states from observations.

10 retrieved papers
Closed deck learning for strictly partial observability

The authors develop a method for learning CWMs in a closed deck scenario where hidden states are never observed, even post-hoc. They construct a regularized autoencoder where the inference function encodes observations to hidden action sequences and the CWM decodes them back, with game rules serving as structural regularizers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Code World Models for game playing with verifiable planning

The authors propose using LLMs to synthesize executable Python code representing game rules and dynamics (Code World Models) from textual descriptions and example trajectories. This CWM serves as a verifiable simulation engine for classical planning algorithms like MCTS, enabling algorithmic enumeration of valid actions and avoiding illegal moves.

Contribution

Inference function synthesis for imperfect information games

The authors introduce a novel paradigm where the LLM synthesizes inference functions that act as encoders mapping observations to plausible latent histories, while the CWM acts as a decoder. This enables ISMCTS planning in partially observable games by estimating hidden states from observations.

Contribution

Closed deck learning for strictly partial observability

The authors develop a method for learning CWMs in a closed deck scenario where hidden states are never observed, even post-hoc. They construct a regularized autoencoder where the inference function encodes observations to hidden action sequences and the CWM decodes them back, with game rules serving as structural regularizers.