Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Offline RL; Model-based RL

Model-based offline reinforcement learning (RL) aims to enhance offline RL with a dynamics model that facilitates policy exploration. However, model exploitation could occur due to inevitable model errors, which degrades algorithm performance. Adversarial model learning offers a theoretical framework to mitigate model exploitation by solving a maximin formulation, and RAMBO provides a practical implementation with model gradient. However, we empirically observe that severe Q-value underestimation and gradient explosion can occur in RAMBO with only slight hyperparameter tuning, suggesting that it tends to be overly conservative and suffers from unstable model updates. To address these issues, we propose RObust value-aware Model learning via Implicitly differentiable adaptive weighting (ROMI). Instead of updating the dynamics model with model gradient, ROMI introduces a novel robust value-aware model learning approach. This approach requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set, enabling controllable conservatism and stable model updates. To further improve out-of-distribution (OOD) generalization during multi-step rollouts, we propose implicitly differentiable adaptive weighting, a bi-level optimization scheme that adaptively achieves dynamics- and value-aware model learning. Empirical results on D4RL and NeoRL datasets show that ROMI significantly outperforms RAMBO and achieves competitive or superior performance compared to state-of-the-art methods on datasets where RAMBO typically underperforms.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ROMI, a robust value-aware model learning approach for offline reinforcement learning that addresses Q-value underestimation and gradient instability in prior adversarial model learning methods. It resides in the Conservative and Pessimistic Value Learning leaf, which contains six papers including the original work. This leaf sits within the broader Value Function Estimation and Regularization branch, indicating a moderately populated research direction focused on preventing overestimation through pessimistic value constraints. The taxonomy shows this is an active area with multiple competing approaches to incorporating conservatism into offline RL.

The paper's leaf neighbors include Implicit and Detached Value Learning (two papers) and Robust Value Functions Under Uncertainty (three papers), both addressing distributional shift through different mechanisms. The broader Model-Based Offline RL branch, particularly Robust Model-Based Offline RL (six papers) and Conservative Model-Based Policy Optimization (two papers), represents closely related work that learns dynamics models with robustness considerations. The taxonomy structure reveals that while value-based conservatism is well-explored, the integration of value-awareness directly into model learning occupies a less crowded intersection between model-based and value-based approaches.

Among 24 candidates examined across three contributions, the analysis found six refutable pairs. The robust value-aware model learning contribution (Contribution A) examined four candidates with zero refutations, suggesting relative novelty in this specific formulation. However, the implicitly differentiable adaptive weighting (Contribution B) examined ten candidates with two refutations, and the dual reformulation of Wasserstein uncertainty sets (Contribution C) examined ten candidates with four refutations, indicating more substantial prior work in these technical components. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage.

Based on the top-24 semantic matches examined, the core value-aware model learning approach appears less explored than its constituent optimization techniques. The taxonomy positioning suggests the work occupies a meaningful but not entirely novel intersection between conservative value learning and model-based methods. The analysis cannot assess whether deeper literature searches or domain-specific venues would reveal additional overlapping work beyond the candidates examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: robust value-aware model learning for offline reinforcement learning. The field addresses the challenge of learning effective policies from fixed datasets without environment interaction, organizing itself around several complementary strategies. Value Function Estimation and Regularization methods, exemplified by Conservative Q-Learning[13] and related pessimistic approaches, focus on preventing overestimation by constraining learned values to remain conservative on out-of-distribution actions. Model-Based Offline RL techniques learn dynamics models to augment limited data, while Policy Constraint and Regularization Methods enforce behavioral similarity to the dataset. Sequence Modeling and Diffusion-Based Approaches reframe the problem through generative modeling, and branches addressing Data Quality and Corruption Robustness tackle noisy or adversarially perturbed datasets. Specialized techniques, domain applications, and theoretical analyses round out the taxonomy, reflecting both methodological diversity and the practical need to handle real-world data imperfections. A central tension emerges between conservative value estimation and expressive model learning. Works like Pessimism Efficiency[5] and Randomized Value Functions[33] explore how much pessimism is necessary and how uncertainty quantification can guide safe extrapolation, while Universal Value Uncertainties[38] seeks principled ways to measure confidence across different settings. The Robust Value-Aware Model[0] sits within the Conservative and Pessimistic Value Learning cluster, emphasizing the integration of value-awareness directly into model learning to achieve robustness against distribution shift. This contrasts with purely model-free conservative methods like Conservative Q-Learning[13], which penalize values without explicit dynamics modeling, and with approaches such as Value-Aware Importance Weighting[2] that reweight data based on value estimates. By coupling model learning with value-based robustness criteria, the original work bridges model-based efficiency and pessimistic safety, addressing scenarios where both data scarcity and model misspecification pose risks.

Claimed Contributions

Robust value-aware model learning with scale-adjustable state uncertainty set

4 retrieved papers

The authors introduce a novel model learning approach that requires the dynamics model to predict future states with values close to the minimum Q-value within a scale-adjustable state uncertainty set. This enables controllable conservatism and stable model updates, addressing RAMBO's over-conservatism and training instability issues.

4 retrieved papers

Implicitly differentiable adaptive weighting via bi-level optimization

Can Refute

10 retrieved papers

The authors propose a bi-level optimization framework where an adaptive weighting network re-weights training samples in the inner level (achieving dynamics awareness), while the outer level updates the weighting network by minimizing the robust value-aware model loss with implicit differentiation (achieving value awareness). This hierarchical approach improves OOD generalization.

10 retrieved papers

Can Refute

Dual reformulation of Wasserstein dynamics uncertainty set into state uncertainty set

Can Refute

10 retrieved papers

The authors establish a theoretical result showing that the Wasserstein dynamics uncertainty set can be reformulated into a state uncertainty set through dual transformation. This reformulation enables practical computation of the minimum expected value over the uncertainty set and provides a principled way to control conservatism via the uncertainty set scale.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Is pessimism provably efficient for offline rl? PDF

Ying Jin, Zhuoran Yang, Zhaoran Wang (2021)

[13] Conservative q-learning for offline reinforcement learning PDF

Aviral Kumar, Aurick Zhou, George Tucker, Sergey Levine, G. Tucker, S. Levine (2020)

[19] Confidence-conditioned value functions for offline reinforcement learning PDF

Hong, Joey, Joey Hong, Kumar, Aviral, Aviral Kumar, Levine, Sergey, Sergey Levine, S. Levine (2022)

[33] Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning PDF

Yu, Xudong, Bai, Chenjia, Xudong Yu, Guo Hongyi, Chenjia Bai, Wang, Changhong, Hongyi Guo, Wang Zhen, Changhong Wang, Zhen Wang (2024) • Information Sciences

[38] Universal Value-Function Uncertainties PDF

Zanger, Moritz A., Weltevrede, Max, Moritz A. Zanger, Oren Yaniv, Max Weltevrede, Yaniv Oren, Pascal R. van der Vaart, BÃ¶hmer, Wendelin, Caroline Horsch, Spaan, Matthijs T. J., Wendelin BÃ¶hmer, M. Spaan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Robust value-aware model learning with scale-adjustable state uncertainty set

[71] Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning PDF

Cannot Refute

[72] Uncertainty modified policy for multi-agent reinforcement learning PDF

Cannot Refute

[73] Platoon Communication Power Control Under V2V Data Uncertainty: A Robust DRL Approach PDF

Cannot Refute

[74] Robust and Safe Autonomous Navigation for Systems With Learned SE(3) Hamiltonian Dynamics PDF

Cannot Refute

Contribution

Implicitly differentiable adaptive weighting via bi-level optimization

[53] Task-aware world model learning with meta weighting via bi-level optimization PDF

Can Refute

[60] Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators PDF

Cannot Refute

[69] Wasserstein distributionally robust regret-optimal control over infinite-horizon PDF

Cannot Refute

Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Is pessimism provably efficient for offline rl? PDF

[13] Conservative q-learning for offline reinforcement learning PDF

[19] Confidence-conditioned value functions for offline reinforcement learning PDF

[33] Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning PDF

[38] Universal Value-Function Uncertainties PDF

Contribution Analysis

Robust value-aware model learning with scale-adjustable state uncertainty set

[71] Look Before Leap: Look-Ahead Planning with Uncertainty in Reinforcement Learning PDF

[72] Uncertainty modified policy for multi-agent reinforcement learning PDF

[73] Platoon Communication Power Control Under V2V Data Uncertainty: A Robust DRL Approach PDF

[74] Robust and Safe Autonomous Navigation for Systems With Learned SE(3) Hamiltonian Dynamics PDF

Implicitly differentiable adaptive weighting via bi-level optimization

[53] Task-aware world model learning with meta weighting via bi-level optimization PDF

[60] Combating Data Imbalances in Federated Semi-supervised Learning with Dual Regulators PDF

[51] Meta-learning dynamic center distance: Hard sample mining for learning with noisy labels PDF

[52] BLO-SAM: Bi-level Optimization Based Finetuning of the Segment Anything Model for Overfitting-Preventing Semantic Segmentation PDF

[54] Adaptive weighting function for weighted nuclear norm based matrix/tensor completion PDF

[55] Learning sample reweighting for adversarial robustness PDF

[56] DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training PDF

[57] Design and Development of Online Fairness-Aware Machine Learning Algorithms PDF

[58] Meta-Learned Dynamic Distillation for Automated Hyperparameter Optimization in Machine Learning Systems PDF

[59] MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection PDF

Dual reformulation of Wasserstein dynamics uncertainty set into state uncertainty set

[61] Wasserstein tube MPC with exact uncertainty propagation PDF

[63] Distributionally robust density control with Wasserstein ambiguity sets PDF

[65] Wasserstein distributionally robust stochastic control: A data-driven approach PDF

[70] Wasserstein perturbations of Markovian transition semigroups PDF

[62] Formal Uncertainty Propagation for Stochastic Dynamical Systems with Additive Noise PDF

[64] Wasserstein distributionally robust control of partially observable linear stochastic systems PDF

[66] Distributionally robust differential dynamic programming with Wasserstein distance PDF

[67] A Wasserstein Distance-Based Distributionally Robust Optimization Strategy for a Renewable Energy Power Grid Considering Meteorological Uncertainty PDF

[68] Thermodynamic Unification of Optimal Transport: Thermodynamic Uncertainty Relation, Minimum Dissipation, and Thermodynamic Speed Limits PDF

[69] Wasserstein distributionally robust regret-optimal control over infinite-horizon PDF

Table of Contents