ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement LearningReward MachinesFoundation ModelsGeneralizationExploration

Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) - an automata-based formalism for reward specification - are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ARM-FM, a framework that automatically generates reward machines from natural language specifications using foundation models, then associates language embeddings with automata states for task generalization. It resides in the 'Structured Reward Representation and Automata-Based Methods' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that automata-based reward design with foundation models remains relatively underexplored compared to more crowded branches like LLM-Generated Reward Code or Zero-Shot VLM Reward Models.

The taxonomy reveals that most neighboring work falls into either unstructured LLM-driven code generation (e.g., LLM-Generated Reward Code, Iterative LLM Reward Optimization) or direct VLM scoring approaches (Zero-Shot VLM Reward Models). ARM-FM diverges by imposing formal automata structure on the reward specification, prioritizing compositionality and interpretability over end-to-end code synthesis. The sibling paper in the same leaf (LLM Reward Machine) also uses automata but differs in how foundation models are integrated. Nearby branches like LLM-Based Reward Shaping focus on heuristic extraction rather than formal state machines, highlighting ARM-FM's distinct emphasis on structured temporal logic.

Among twenty-nine candidates examined, none clearly refute any of the three core contributions. The ARM-FM framework itself was assessed against ten candidates with zero refutable overlaps; Language-Aligned Reward Machines (LARMs) against nine candidates with zero refutations; and the policy conditioning method against ten candidates with zero refutations. This suggests that within the limited search scope, the combination of automata-based reward design, language-aligned state embeddings, and zero-shot generalization via policy conditioning appears relatively novel. However, the search examined only top-K semantic matches and citations, not the entire literature.

Given the sparse population of the automata-based methods leaf and the absence of clear prior work among examined candidates, ARM-FM appears to occupy a distinct niche. The analysis is constrained by the limited search scope and does not cover all possible related work in formal methods or symbolic RL. The framework's novelty hinges on its integration of foundation models with structured reward machines, a combination that the examined literature does not extensively address.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: automated reward design in reinforcement learning using foundation models. The field has evolved into several distinct branches that reflect different modalities and design philosophies. Vision-Language Model-Based Reward Specification leverages multimodal understanding to ground rewards in visual and linguistic cues, while Large Language Model-Driven Reward Function Design focuses on generating code-based or symbolic reward functions from natural language descriptions (e.g., Reward Design LM[4], Text2Reward[12]). Structured Reward Representation and Automata-Based Methods emphasize formal specifications and temporal logic to ensure interpretability and compositionality. Video-Based Reward Learning extracts reward signals directly from video demonstrations (e.g., Video Prediction Rewards[3]), and Foundation Model-Enhanced RL Frameworks and Pretraining explore how pretrained representations can bootstrap policy learning. Language Model Reward Alignment and Preference Learning addresses the challenge of aligning model outputs with human preferences, often through techniques like Direct Preference Optimization[28]. Specialized Foundation Model Applications target domain-specific problems, and Surveys and Conceptual Frameworks provide overarching perspectives on integrating foundation models into RL pipelines. Recent work has highlighted trade-offs between flexibility and interpretability. Many studies in the LLM-driven branch pursue end-to-end code generation for reward shaping (LLM Reward Shaping[5]), offering rapid prototyping but sometimes sacrificing formal guarantees. In contrast, the Structured Reward Representation and Automata-Based Methods branch prioritizes compositional and verifiable reward structures, as seen in LLM Reward Machine[13], which uses automata to encode task semantics. ARM-FM[0] sits within this latter branch, emphasizing structured representations that combine the expressiveness of foundation models with the rigor of automata-based formalisms. Compared to more exploratory approaches like Self-Refined LLM Reward[8] or Video2Reward[43], ARM-FM[0] focuses on maintaining interpretability and compositionality, aligning closely with works that seek to bridge symbolic reasoning and neural reward design. This positioning reflects an ongoing tension in the field between leveraging the generative power of foundation models and ensuring that learned rewards remain transparent and aligned with human intent.

Claimed Contributions

ARM-FM framework for automated compositional reward design

10 retrieved papers

The authors introduce a framework that automatically generates reward machines from natural language task descriptions using foundation models. This framework enables compositional reward design by decomposing complex tasks into structured automata-based representations that provide dense learning signals for reinforcement learning agents.

10 retrieved papers

Language-Aligned Reward Machines (LARMs)

9 retrieved papers

The authors introduce LARMs, which extend traditional reward machines by equipping each automaton state with natural language instructions and embeddings. This enables the creation of a semantically grounded skill space where policies can share knowledge across related subtasks, facilitating transfer learning and compositional reasoning.

9 retrieved papers

Policy conditioning method for experience reuse and zero-shot generalization

10 retrieved papers

The authors develop a method where RL policies are conditioned on language embeddings of reward machine states rather than treating states as isolated symbols. This approach enables agents to reuse learned skills across different tasks and achieve zero-shot generalization to novel task compositions without additional training.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine PDF

Shayan Meshkat Alsadat, Jean-RaphaÃ«l Gaglione, Daniel Neider, Jean-Raphael Gaglione, Ufuk Topcu, D. Neider, Zhe Xu, U. Topcu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ARM-FM framework for automated compositional reward design

[2] Rl-vlm-f: Reinforcement learning from vision language foundation model feedback PDF

Cannot Refute

[4] Reward Design with Language Models PDF

Cannot Refute

[8] Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics PDF

Cannot Refute

[12] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning PDF

Cannot Refute

[14] Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods PDF

Cannot Refute

[16] Auto mc-reward: Automated dense reward design with large language models for minecraft PDF

Cannot Refute

[27] Autoreward: Closed-loop reward design with large language models for autonomous driving PDF

Cannot Refute

[41] A Survey of Robot Intelligence with Large Language Models PDF

Cannot Refute

[58] Eureka: Human-level reward design via coding large language models PDF

Cannot Refute

[59] Moral Alignment for LLM Agents PDF

Cannot Refute

Contribution

Language-Aligned Reward Machines (LARMs)

[14] Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods PDF

Cannot Refute

[21] Language Reward Modulation for Pretraining Reinforcement Learning PDF

Cannot Refute

[51] Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data PDF

Cannot Refute

[52] Mixed-initiative bayesian sub-goal optimization in hierarchical reinforcement learning PDF

Cannot Refute

[53] Counting Reward Automata: Sample Efficient Reinforcement Learning Through The Exploitation of Reward Function Structure PDF

Cannot Refute

[54] Expressive Reward Synthesis with the Runtime Monitoring Language PDF

Cannot Refute

[55] Enhancing Hierarchical Reinforcement Learning with Symbolic Planning for Long-Horizon Tasks PDF

Cannot Refute

[56] Reward Translation via Reward Machine in Semi-Alignable MDPs PDF

Cannot Refute

[57] Using a Learned Policy Basis to Optimally Solve Reward Machines PDF

Cannot Refute

Contribution

Policy conditioning method for experience reuse and zero-shot generalization

[60] Expel: Llm agents are experiential learners PDF

Cannot Refute

[61] Code as policies: Language model programs for embodied control PDF

Cannot Refute

[62] Large language models as generalizable policies for embodied tasks PDF

Cannot Refute

[63] Racer: Rich language-guided failure recovery policies for imitation learning PDF

Cannot Refute

[64] Generalized decoding for pixel, image, and language PDF

Cannot Refute

[65] What matters in language conditioned robotic imitation learning over unstructured data PDF

Cannot Refute

[66] Language-conditioned learning for robotic manipulation: A survey PDF

Cannot Refute

[67] Language models enable zero-shot prediction of the effects of mutations on protein function PDF

Cannot Refute

[68] Concept-guided prompt learning for generalization in vision-language models PDF

Cannot Refute

[69] VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks PDF

Cannot Refute

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Using Large Language Models to Automate and Expedite Reinforcement Learning with Reward Machine PDF

Contribution Analysis

ARM-FM framework for automated compositional reward design

[2] Rl-vlm-f: Reinforcement learning from vision language foundation model feedback PDF

[4] Reward Design with Language Models PDF

[8] Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics PDF

[12] Text2Reward: Reward Shaping with Language Models for Reinforcement Learning PDF

[14] Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods PDF

[16] Auto mc-reward: Automated dense reward design with large language models for minecraft PDF

[27] Autoreward: Closed-loop reward design with large language models for autonomous driving PDF

[41] A Survey of Robot Intelligence with Large Language Models PDF

[58] Eureka: Human-level reward design via coding large language models PDF

[59] Moral Alignment for LLM Agents PDF

Language-Aligned Reward Machines (LARMs)

[14] Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods PDF

[21] Language Reward Modulation for Pretraining Reinforcement Learning PDF

[51] Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data PDF

[52] Mixed-initiative bayesian sub-goal optimization in hierarchical reinforcement learning PDF

[53] Counting Reward Automata: Sample Efficient Reinforcement Learning Through The Exploitation of Reward Function Structure PDF

[54] Expressive Reward Synthesis with the Runtime Monitoring Language PDF

[55] Enhancing Hierarchical Reinforcement Learning with Symbolic Planning for Long-Horizon Tasks PDF

[56] Reward Translation via Reward Machine in Semi-Alignable MDPs PDF

[57] Using a Learned Policy Basis to Optimally Solve Reward Machines PDF

Policy conditioning method for experience reuse and zero-shot generalization

[60] Expel: Llm agents are experiential learners PDF

[61] Code as policies: Language model programs for embodied control PDF

[62] Large language models as generalizable policies for embodied tasks PDF

[63] Racer: Rich language-guided failure recovery policies for imitation learning PDF

[64] Generalized decoding for pixel, image, and language PDF

[65] What matters in language conditioned robotic imitation learning over unstructured data PDF

[66] Language-conditioned learning for robotic manipulation: A survey PDF

[67] Language models enable zero-shot prediction of the effects of mutations on protein function PDF

[68] Concept-guided prompt learning for generalization in vision-language models PDF

[69] VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks PDF

Table of Contents