Distributions as Actions: A Unified Framework for Diverse Action Spaces

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

deterministic policy gradientactor-criticcontinuous controldiscrete controlhybrid controlaction spacereinforcement learning

We introduce a novel reinforcement learning (RL) framework that treats parameterized action distributions as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, hybrid, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, \emph{Distributions-as-Actions Policy Gradient} (DA-PG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce \emph{interpolated critic learning} (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical actor-critic algorithm, \emph{Distributions-as-Actions Actor-Critic} (DA-AC). Empirically, DA-AC achieves competitive performance in various settings across discrete, continuous, and hybrid control.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a distributions-as-actions framework that treats parameterized action distributions as the fundamental action representation, enabling unified policy learning across discrete, continuous, and hybrid action spaces. Within the taxonomy, it occupies the Distribution-Based Action Parameterization leaf under Action Space Unification and Representation, where it is currently the sole paper. This leaf sits alongside Latent Action Encoding and Universal Action Spaces, indicating that distribution-based parameterization represents a distinct but relatively unexplored approach to action space unification compared to latent embedding or universal representation strategies.

The taxonomy reveals that neighboring research directions include Hybrid Action Space Methods, which explicitly decompose discrete-continuous actions through hierarchical or joint optimization, and Variable and Extensible Action Spaces, which handle dynamic action sets. The paper's approach differs by transforming heterogeneous action types into a continuous distribution parameter space rather than decomposing or adapting action structures. The scope note for Distribution-Based Action Parameterization explicitly excludes hybrid action decomposition methods, suggesting the paper's unified continuous parameterization offers an alternative to the hierarchical and joint optimization strategies prevalent in the Hybrid Action Space Methods branch.

Among the three contributions analyzed, the distributions-as-actions framework and DA-PG estimator each examined ten candidates with zero refutable prior work, suggesting these core ideas appear relatively novel within the limited search scope of twenty-six candidates. The interpolated critic learning contribution examined six candidates and found one potentially refutable match, indicating some overlap with existing critic learning techniques. The statistics reflect a focused literature search rather than exhaustive coverage, so these findings characterize novelty relative to the top semantic matches and their citations, not the entire field.

Based on the limited search scope, the framework appears to introduce a distinctive approach to action space unification, particularly in its treatment of distributions as first-class actions rather than intermediate representations. The analysis covers top-K semantic matches and citation expansion but does not claim comprehensive field coverage. The single-paper leaf status and absence of refutable prior work for the core framework suggest it occupies a relatively sparse research direction, though the interpolated critic learning component shows more connection to existing techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified reinforcement learning across diverse action spaces. The field addresses the challenge of designing RL agents that can operate effectively when actions vary in type, dimensionality, or structure across different tasks or environments. The taxonomy reveals several major branches: Action Space Unification and Representation explores foundational techniques for encoding heterogeneous actions into common frameworks, including distribution-based parameterizations; Hybrid Action Space Methods tackles settings where discrete and continuous controls must be jointly optimized, as seen in works like Hybrid Actor-Critic[27] and Multi-Agent Hybrid Actions[12]; Variable and Extensible Action Spaces considers scenarios where the action set changes dynamically or grows over time, exemplified by In-Context Variable Actions[13]; Multi-Agent Heterogeneous Action Coordination and Cross-Embodiment Transfer Learning focus on coordination among agents with differing capabilities and knowledge transfer across robotic platforms, respectively; Large-Scale and Combinatorial Action Spaces and Domain-Specific Applications address scalability challenges and practical deployments in areas such as networking, robotics, and resource allocation; finally, Algorithmic Foundations and Frameworks provide the underlying theory and software tools that enable these diverse approaches. Within this landscape, a particularly active line of work centers on parameterized and hybrid action representations, where methods must balance expressiveness with computational tractability. Distributions as Actions[0] sits within the Action Space Unification branch, specifically under Distribution-Based Action Parameterization, proposing that agents output probability distributions over action components rather than point estimates, thereby capturing uncertainty and enabling smoother generalization across heterogeneous spaces. This contrasts with hierarchical decomposition strategies like Hierarchical Parameterized Actions[4], which structure complex actions into multi-level decision trees, and with imitation-based approaches such as Deep Implicit Imitation[5], which learn action mappings from demonstrations without explicit parameterization. The distribution-based perspective offers a middle ground: it retains the flexibility to handle varied action types while avoiding the rigid hierarchies or reliance on expert data that characterize neighboring methods, positioning it as a unifying representational choice for agents operating in structurally diverse environments.

Claimed Contributions

Distributions-as-actions framework

10 retrieved papers

The authors propose a new RL framework where the agent outputs distribution parameters rather than actions directly, with action sampling treated as part of the environment. This reformulation transforms any action space (discrete, continuous, or hybrid) into a continuous parameter space, enabling unified algorithmic treatment across diverse action types.

10 retrieved papers

Distributions-as-Actions Policy Gradient (DA-PG) estimator

10 retrieved papers

The authors develop a policy gradient estimator that generalizes the deterministic policy gradient to the distributions-as-actions framework. They prove this estimator has strictly lower variance than both likelihood-ratio and reparameterization estimators when using a perfect critic.

10 retrieved papers

Interpolated critic learning (ICL)

Can Refute

6 retrieved papers

The authors propose a critic learning method that trains the value function at linearly interpolated points between the current distribution parameters and deterministic parameters corresponding to sampled actions. This approach improves critic generalization and provides more informative gradient signals for policy optimization.

6 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Distributions-as-actions framework

[26] Deep Reinforcement Learning in Parameterized Action Space PDF

Cannot Refute

[61] Diffusion policy: Visuomotor policy learning via action diffusion PDF

Cannot Refute

[62] Model-based Reinforcement Learning for Parameterized Action Spaces PDF

Cannot Refute

[63] Interaction-Aware Deep Reinforcement Learning Approach Based on Hybrid Parameterized Action Space for Autonomous Driving PDF

Cannot Refute

[64] Efficient Reinforcement Learning with Large Language Model Priors PDF

Cannot Refute

[65] Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space PDF

Cannot Refute

[66] DSAC: Distributional Soft Actor-Critic for Risk-Sensitive Reinforcement Learning PDF

Cannot Refute

[67] Stability Enhanced Hierarchical Reinforcement Learning for Autonomous Driving with Parameterized Trajectory Action PDF

Cannot Refute

[68] Distributed Reinforcement Learning with Self-Play in Parameterized Action Space PDF

Cannot Refute

[69] Fully parameterized quantile function for distributional reinforcement learning PDF

Cannot Refute

Contribution

Distributions-as-Actions Policy Gradient (DA-PG) estimator

[51] Deterministic policy gradient: Convergence analysis PDF

Cannot Refute

[52] Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning PDF

Cannot Refute

[53] Deterministic policy gradient algorithms PDF

Cannot Refute

[54] Sticking the landing: Simple, lower-variance gradient estimators for variational inference PDF

Cannot Refute

[55] Deterministic policy gradient algorithms for semiâMarkov decision processes PDF

Cannot Refute

[56] Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients PDF

Cannot Refute

[57] Direct RL with Policy Gradient PDF

Cannot Refute

[58] Deterministic value-policy gradients PDF

Cannot Refute

[59] Recurrent deterministic policy gradient method for bipedal locomotion on rough terrain challenge PDF

Cannot Refute

[60] Statistical problems with deterministic reinforcement learning and small sample biases PDF

Cannot Refute

Contribution

Interpolated critic learning (ICL)

[74] Distribution Parameter Actor-Critic: Shifting the Agent-Environment Boundary for Diverse Action Spaces PDF

Can Refute

[70] Policy sampling and interpolation for wireless networks: A graph signal processing approach PDF

Cannot Refute

[71] Reinforcement learning in continuous action spaces through sequential monte carlo methods PDF

Cannot Refute

[72] How to Train Your Latent Control Barrier Function: Smooth Safety Filtering Under Hard-to-Model Constraints PDF

Cannot Refute

[73] Interpolated experience replay for improved sample efficiency of model-free deep reinforcement learning algorithms PDF

Cannot Refute

[75] Predicting optimal value functions by interpolating reward functions in scalarized multi-objective reinforcement learning PDF

Cannot Refute

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Distributions-as-actions framework

[26] Deep Reinforcement Learning in Parameterized Action Space PDF

[61] Diffusion policy: Visuomotor policy learning via action diffusion PDF

[62] Model-based Reinforcement Learning for Parameterized Action Spaces PDF

[63] Interaction-Aware Deep Reinforcement Learning Approach Based on Hybrid Parameterized Action Space for Autonomous Driving PDF

[64] Efficient Reinforcement Learning with Large Language Model Priors PDF

[65] Hybrid Actor-Critic Reinforcement Learning in Parameterized Action Space PDF

[66] DSAC: Distributional Soft Actor-Critic for Risk-Sensitive Reinforcement Learning PDF

[67] Stability Enhanced Hierarchical Reinforcement Learning for Autonomous Driving with Parameterized Trajectory Action PDF

[68] Distributed Reinforcement Learning with Self-Play in Parameterized Action Space PDF

[69] Fully parameterized quantile function for distributional reinforcement learning PDF

Distributions-as-Actions Policy Gradient (DA-PG) estimator

[51] Deterministic policy gradient: Convergence analysis PDF

[52] Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning PDF

[53] Deterministic policy gradient algorithms PDF

[54] Sticking the landing: Simple, lower-variance gradient estimators for variational inference PDF

[55] Deterministic policy gradient algorithms for semiâMarkov decision processes PDF

[56] Parameter-Free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients PDF

[57] Direct RL with Policy Gradient PDF

[58] Deterministic value-policy gradients PDF

[59] Recurrent deterministic policy gradient method for bipedal locomotion on rough terrain challenge PDF

[60] Statistical problems with deterministic reinforcement learning and small sample biases PDF

Interpolated critic learning (ICL)

[74] Distribution Parameter Actor-Critic: Shifting the Agent-Environment Boundary for Diverse Action Spaces PDF

[70] Policy sampling and interpolation for wireless networks: A graph signal processing approach PDF

[71] Reinforcement learning in continuous action spaces through sequential monte carlo methods PDF

[72] How to Train Your Latent Control Barrier Function: Smooth Safety Filtering Under Hard-to-Model Constraints PDF

[73] Interpolated experience replay for improved sample efficiency of model-free deep reinforcement learning algorithms PDF

[75] Predicting optimal value functions by interpolating reward functions in scalarized multi-objective reinforcement learning PDF

Table of Contents

[55] Deterministic policy gradient algorithms for semiâMarkov decision processes PDF