Does “Do Differentiable Simulators Give Better Policy Gradients?” Give Better Policy Gradients?

ICLR 2026 Conference SubmissionAnonymous Authors
Differentiable simulationReinforcement learningPolicy gradientModel-based reinforcement learningMonte Carlo gradient estimationReparameterization gradientLikelihood ratio gradientScore function gradient estimatorInverse variance weightingRandomized smoothing
Abstract:

In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces two methods for policy gradient estimation in discontinuous dynamics: DDCG, a lightweight discontinuity detection test that switches between first-order and zeroth-order estimators, and IVW-H, a per-step inverse-variance weighting scheme. It resides in the 'Discontinuity Detection and Estimator Switching' leaf under 'Theoretical Convergence and Optimization', which contains only this single paper. This isolation suggests the specific combination of explicit discontinuity detection with adaptive estimator switching represents a relatively unexplored niche within the broader field of policy gradient methods for nonsmooth dynamics.

The taxonomy reveals that neighboring research directions pursue alternative strategies: the 'Smoothing and Mollification Techniques' leaf contains three papers that regularize discontinuities rather than detect them, while 'Convergence in Non-Smooth and Weakly Smooth Settings' focuses on theoretical guarantees without explicit switching mechanisms. The 'Differentiable Simulation for Policy Learning' branch encompasses six papers in contact-rich tasks and three in adaptive hybrid optimization, suggesting that many researchers address discontinuities by constructing smooth surrogate models rather than handling them directly. The paper's approach diverges by retaining the original nonsmooth dynamics and selectively applying appropriate estimators.

Among nine candidates examined across three contributions, none were found to clearly refute the proposed methods. DDCG examined two candidates with zero refutable matches, while the empirical re-evaluation of bias and AoBG limitations examined seven candidates, also with zero refutations. IVW-H examined no candidates, indicating limited prior work on per-step inverse-variance weighting in this context. The absence of refutable prior work within this limited search scope suggests that the specific combination of discontinuity detection criteria and variance-based estimator selection has not been extensively explored, though the small candidate pool (nine total) means substantial related work may exist beyond the top-K semantic matches examined.

Given the limited search scope of nine candidates and the paper's placement in a singleton taxonomy leaf, the work appears to occupy a distinct methodological position. The analysis captures methods that either smooth discontinuities or develop general convergence theory, but the specific focus on lightweight detection tests and inverse-variance weighting for estimator selection seems less represented. However, the small candidate pool and narrow semantic search window mean this assessment reflects only a localized view of the literature, not an exhaustive survey of all gradient estimation techniques for nonsmooth reinforcement learning.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
9
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: policy gradient estimation in discontinuous dynamics. The field addresses the challenge of computing reliable gradients for policy optimization when the underlying system exhibits discontinuities—such as contact events in robotics, mode switches in hybrid systems, or discrete decision points. The taxonomy reveals several complementary research directions. Differentiable Simulation for Policy Learning encompasses works like Parallel Differentiable Simulation[1] and Quadruped Differentiable Learning[3] that build smooth surrogate models or simulators to enable gradient flow. Model-Free and Hybrid Methods include approaches that sidestep explicit gradient computation or blend learning with classical control. Theoretical Convergence and Optimization investigates the mathematical foundations, ensuring that gradient estimators remain valid despite nonsmoothness. Specialized Gradient Methods and Extensions develop tailored estimators—such as reparameterization tricks or smoothing techniques—to handle specific discontinuity structures. Algorithmic Techniques and Optimization Methods focus on practical solver strategies, while Neural Network Training and Gradient Techniques address backpropagation challenges in networks encountering discontinuous activations or data. A central tension emerges between smoothing-based methods, which approximate discontinuities to recover differentiability, and exact or adaptive techniques that detect and handle discontinuities explicitly. Works like Adaptive Gradient Policy[2] and Adaptive Horizon Contact[5] exemplify adaptive strategies that switch estimators or adjust horizons based on detected events, while others such as Mollification Policy Gradient[29] and Smoothing Nonsmooth Gradient[43] apply regularization to render the problem tractable. Better Policy Gradients[0] sits within the Theoretical Convergence and Optimization branch, specifically under Discontinuity Detection and Estimator Switching. Its emphasis on detecting discontinuities and selecting appropriate estimators aligns closely with adaptive approaches like Adaptive Gradient Policy[2] and Adaptive Horizon Contact[5], yet it contributes a more rigorous theoretical framework for when and how to switch between gradient approximations. This positions the work as a bridge between purely smoothing methods and fully model-free alternatives, offering principled guidance for practitioners navigating the trade-off between computational efficiency and gradient accuracy in contact-rich or hybrid domains.

Claimed Contributions

Discontinuity Detection Composite Gradient (DDCG)

DDCG is a method that uses a statistical test to detect discontinuities and adaptively switches between 0th-order and 1st-order gradient estimators. Unlike prior work (AoBG), it requires minimal hyperparameter tuning and maintains robustness even with small sample sizes by checking variance reliability and local smoothness conditions.

2 retrieved papers
Stepwise Inverse Variance Weighting (IVW-H)

IVW-H is a per-step, per-action inverse variance weighting scheme that combines 0th-order and 1st-order gradient estimators at each time step. It stabilizes variance in practical robotics control tasks without requiring explicit discontinuity detection, demonstrating that variance control can be sufficient in such settings.

0 retrieved papers
Re-evaluation of empirical bias phenomenon and AoBG limitations

The authors systematically reproduce and re-evaluate experiments from prior work (AoBG), revealing that while the empirical bias phenomenon exists in discontinuous settings, the AoBG method requires extensive task-specific hyperparameter tuning and has limited sample efficiency, motivating the need for more robust alternatives.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discontinuity Detection Composite Gradient (DDCG)

DDCG is a method that uses a statistical test to detect discontinuities and adaptively switches between 0th-order and 1st-order gradient estimators. Unlike prior work (AoBG), it requires minimal hyperparameter tuning and maintains robustness even with small sample sizes by checking variance reliability and local smoothness conditions.

Contribution

Stepwise Inverse Variance Weighting (IVW-H)

IVW-H is a per-step, per-action inverse variance weighting scheme that combines 0th-order and 1st-order gradient estimators at each time step. It stabilizes variance in practical robotics control tasks without requiring explicit discontinuity detection, demonstrating that variance control can be sufficient in such settings.

Contribution

Re-evaluation of empirical bias phenomenon and AoBG limitations

The authors systematically reproduce and re-evaluate experiments from prior work (AoBG), revealing that while the empirical bias phenomenon exists in discontinuous settings, the AoBG method requires extensive task-specific hyperparameter tuning and has limited sample efficiency, motivating the need for more robust alternatives.