ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningReward MachinesFoundation ModelsGeneralizationExploration
Abstract:

Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) - an automata-based formalism for reward specification - are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ARM-FM, a framework that automatically generates reward machines from natural language specifications using foundation models, then associates language embeddings with automata states for task generalization. It resides in the 'Structured Reward Representation and Automata-Based Methods' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that automata-based reward design with foundation models remains relatively underexplored compared to more crowded branches like LLM-Generated Reward Code or Zero-Shot VLM Reward Models.

The taxonomy reveals that most neighboring work falls into either unstructured LLM-driven code generation (e.g., LLM-Generated Reward Code, Iterative LLM Reward Optimization) or direct VLM scoring approaches (Zero-Shot VLM Reward Models). ARM-FM diverges by imposing formal automata structure on the reward specification, prioritizing compositionality and interpretability over end-to-end code synthesis. The sibling paper in the same leaf (LLM Reward Machine) also uses automata but differs in how foundation models are integrated. Nearby branches like LLM-Based Reward Shaping focus on heuristic extraction rather than formal state machines, highlighting ARM-FM's distinct emphasis on structured temporal logic.

Among twenty-nine candidates examined, none clearly refute any of the three core contributions. The ARM-FM framework itself was assessed against ten candidates with zero refutable overlaps; Language-Aligned Reward Machines (LARMs) against nine candidates with zero refutations; and the policy conditioning method against ten candidates with zero refutations. This suggests that within the limited search scope, the combination of automata-based reward design, language-aligned state embeddings, and zero-shot generalization via policy conditioning appears relatively novel. However, the search examined only top-K semantic matches and citations, not the entire literature.

Given the sparse population of the automata-based methods leaf and the absence of clear prior work among examined candidates, ARM-FM appears to occupy a distinct niche. The analysis is constrained by the limited search scope and does not cover all possible related work in formal methods or symbolic RL. The framework's novelty hinges on its integration of foundation models with structured reward machines, a combination that the examined literature does not extensively address.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: automated reward design in reinforcement learning using foundation models. The field has evolved into several distinct branches that reflect different modalities and design philosophies. Vision-Language Model-Based Reward Specification leverages multimodal understanding to ground rewards in visual and linguistic cues, while Large Language Model-Driven Reward Function Design focuses on generating code-based or symbolic reward functions from natural language descriptions (e.g., Reward Design LM[4], Text2Reward[12]). Structured Reward Representation and Automata-Based Methods emphasize formal specifications and temporal logic to ensure interpretability and compositionality. Video-Based Reward Learning extracts reward signals directly from video demonstrations (e.g., Video Prediction Rewards[3]), and Foundation Model-Enhanced RL Frameworks and Pretraining explore how pretrained representations can bootstrap policy learning. Language Model Reward Alignment and Preference Learning addresses the challenge of aligning model outputs with human preferences, often through techniques like Direct Preference Optimization[28]. Specialized Foundation Model Applications target domain-specific problems, and Surveys and Conceptual Frameworks provide overarching perspectives on integrating foundation models into RL pipelines. Recent work has highlighted trade-offs between flexibility and interpretability. Many studies in the LLM-driven branch pursue end-to-end code generation for reward shaping (LLM Reward Shaping[5]), offering rapid prototyping but sometimes sacrificing formal guarantees. In contrast, the Structured Reward Representation and Automata-Based Methods branch prioritizes compositional and verifiable reward structures, as seen in LLM Reward Machine[13], which uses automata to encode task semantics. ARM-FM[0] sits within this latter branch, emphasizing structured representations that combine the expressiveness of foundation models with the rigor of automata-based formalisms. Compared to more exploratory approaches like Self-Refined LLM Reward[8] or Video2Reward[43], ARM-FM[0] focuses on maintaining interpretability and compositionality, aligning closely with works that seek to bridge symbolic reasoning and neural reward design. This positioning reflects an ongoing tension in the field between leveraging the generative power of foundation models and ensuring that learned rewards remain transparent and aligned with human intent.

Claimed Contributions

ARM-FM framework for automated compositional reward design

The authors introduce a framework that automatically generates reward machines from natural language task descriptions using foundation models. This framework enables compositional reward design by decomposing complex tasks into structured automata-based representations that provide dense learning signals for reinforcement learning agents.

10 retrieved papers
Language-Aligned Reward Machines (LARMs)

The authors introduce LARMs, which extend traditional reward machines by equipping each automaton state with natural language instructions and embeddings. This enables the creation of a semantically grounded skill space where policies can share knowledge across related subtasks, facilitating transfer learning and compositional reasoning.

9 retrieved papers
Policy conditioning method for experience reuse and zero-shot generalization

The authors develop a method where RL policies are conditioned on language embeddings of reward machine states rather than treating states as isolated symbols. This approach enables agents to reuse learned skills across different tasks and achieve zero-shot generalization to novel task compositions without additional training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARM-FM framework for automated compositional reward design

The authors introduce a framework that automatically generates reward machines from natural language task descriptions using foundation models. This framework enables compositional reward design by decomposing complex tasks into structured automata-based representations that provide dense learning signals for reinforcement learning agents.

Contribution

Language-Aligned Reward Machines (LARMs)

The authors introduce LARMs, which extend traditional reward machines by equipping each automaton state with natural language instructions and embeddings. This enables the creation of a semantically grounded skill space where policies can share knowledge across related subtasks, facilitating transfer learning and compositional reasoning.

Contribution

Policy conditioning method for experience reuse and zero-shot generalization

The authors develop a method where RL policies are conditioned on language embeddings of reward machine states rather than treating states as isolated symbols. This approach enables agents to reuse learned skills across different tasks and achieve zero-shot generalization to novel task compositions without additional training.

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning | Novelty Validation