Distributions as Actions: A Unified Framework for Diverse Action Spaces

ICLR 2026 Conference SubmissionAnonymous Authors
deterministic policy gradientactor-criticcontinuous controldiscrete controlhybrid controlaction spacereinforcement learning
Abstract:

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, \emph{Distributions-as-Actions Policy Gradient} (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce \emph{interpolated critic learning} (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, \emph{Distributions-as-Actions Actor-Critic} (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a distributions-as-actions framework that treats parameterized action distributions as the fundamental action representation, enabling unified policy learning across discrete, continuous, and hybrid action spaces. Within the taxonomy, it occupies the Distribution-Based Action Parameterization leaf under Action Space Unification and Representation, where it is currently the sole paper. This leaf sits alongside Latent Action Encoding and Universal Action Spaces, indicating that distribution-based parameterization represents a distinct but relatively unexplored approach to action space unification compared to latent embedding or universal representation strategies.

The taxonomy reveals that neighboring research directions include Hybrid Action Space Methods, which explicitly decompose discrete-continuous actions through hierarchical or joint optimization, and Variable and Extensible Action Spaces, which handle dynamic action sets. The paper's approach differs by transforming heterogeneous action types into a continuous distribution parameter space rather than decomposing or adapting action structures. The scope note for Distribution-Based Action Parameterization explicitly excludes hybrid action decomposition methods, suggesting the paper's unified continuous parameterization offers an alternative to the hierarchical and joint optimization strategies prevalent in the Hybrid Action Space Methods branch.

Among the three contributions analyzed, the distributions-as-actions framework and DA-PG estimator each examined ten candidates with zero refutable prior work, suggesting these core ideas appear relatively novel within the limited search scope of twenty-six candidates. The interpolated critic learning contribution examined six candidates and found one potentially refutable match, indicating some overlap with existing critic learning techniques. The statistics reflect a focused literature search rather than exhaustive coverage, so these findings characterize novelty relative to the top semantic matches and their citations, not the entire field.

Based on the limited search scope, the framework appears to introduce a distinctive approach to action space unification, particularly in its treatment of distributions as first-class actions rather than intermediate representations. The analysis covers top-K semantic matches and citation expansion but does not claim comprehensive field coverage. The single-paper leaf status and absence of refutable prior work for the core framework suggest it occupies a relatively sparse research direction, though the interpolated critic learning component shows more connection to existing techniques.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: unified reinforcement learning across diverse action spaces. The field addresses the challenge of designing RL agents that can operate effectively when actions vary in type, dimensionality, or structure across different tasks or environments. The taxonomy reveals several major branches: Action Space Unification and Representation explores foundational techniques for encoding heterogeneous actions into common frameworks, including distribution-based parameterizations; Hybrid Action Space Methods tackles settings where discrete and continuous controls must be jointly optimized, as seen in works like Hybrid Actor-Critic[27] and Multi-Agent Hybrid Actions[12]; Variable and Extensible Action Spaces considers scenarios where the action set changes dynamically or grows over time, exemplified by In-Context Variable Actions[13]; Multi-Agent Heterogeneous Action Coordination and Cross-Embodiment Transfer Learning focus on coordination among agents with differing capabilities and knowledge transfer across robotic platforms, respectively; Large-Scale and Combinatorial Action Spaces and Domain-Specific Applications address scalability challenges and practical deployments in areas such as networking, robotics, and resource allocation; finally, Algorithmic Foundations and Frameworks provide the underlying theory and software tools that enable these diverse approaches. Within this landscape, a particularly active line of work centers on parameterized and hybrid action representations, where methods must balance expressiveness with computational tractability. Distributions as Actions[0] sits within the Action Space Unification branch, specifically under Distribution-Based Action Parameterization, proposing that agents output probability distributions over action components rather than point estimates, thereby capturing uncertainty and enabling smoother generalization across heterogeneous spaces. This contrasts with hierarchical decomposition strategies like Hierarchical Parameterized Actions[4], which structure complex actions into multi-level decision trees, and with imitation-based approaches such as Deep Implicit Imitation[5], which learn action mappings from demonstrations without explicit parameterization. The distribution-based perspective offers a middle ground: it retains the flexibility to handle varied action types while avoiding the rigid hierarchies or reliance on expert data that characterize neighboring methods, positioning it as a unifying representational choice for agents operating in structurally diverse environments.

Claimed Contributions

Distributions-as-actions framework

The authors propose a new RL framework where the agent outputs distribution parameters rather than actions directly, with action sampling treated as part of the environment. This reformulation transforms any action space (discrete, continuous, or hybrid) into a continuous parameter space, enabling unified algorithmic treatment across diverse action types.

10 retrieved papers
Distributions-as-Actions Policy Gradient (DA-PG) estimator

The authors develop a policy gradient estimator that generalizes the deterministic policy gradient to the distributions-as-actions framework. They prove this estimator has strictly lower variance than both likelihood-ratio and reparameterization estimators when using a perfect critic.

10 retrieved papers
Interpolated critic learning (ICL)

The authors propose a critic learning method that trains the value function at linearly interpolated points between the current distribution parameters and deterministic parameters corresponding to sampled actions. This approach improves critic generalization and provides more informative gradient signals for policy optimization.

6 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Distributions-as-actions framework

The authors propose a new RL framework where the agent outputs distribution parameters rather than actions directly, with action sampling treated as part of the environment. This reformulation transforms any action space (discrete, continuous, or hybrid) into a continuous parameter space, enabling unified algorithmic treatment across diverse action types.

Contribution

Distributions-as-Actions Policy Gradient (DA-PG) estimator

The authors develop a policy gradient estimator that generalizes the deterministic policy gradient to the distributions-as-actions framework. They prove this estimator has strictly lower variance than both likelihood-ratio and reparameterization estimators when using a perfect critic.

Contribution

Interpolated critic learning (ICL)

The authors propose a critic learning method that trains the value function at linearly interpolated points between the current distribution parameters and deterministic parameters corresponding to sampled actions. This approach improves critic generalization and provides more informative gradient signals for policy optimization.