Decoupled Q-Chunking

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement learningaction chunkingoffline RL
Abstract:

Bootstrapping bias problem is a long-standing challenge in temporal-difference (TD) methods in off-policy reinforcement learning (RL). Multi-step return backups can alleviate this issue but require delicate importance sampling to correct their off-policy bias. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, enabling unbiased multi-step backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal in environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning policies over long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned benchmarks and show that it reliably outperforms prior methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes decoupling chunk lengths between critics and policies in temporal-difference learning to mitigate bootstrapping bias while preserving policy reactivity. It resides in the 'Decoupled Critic-Policy Chunking' leaf, which currently contains only this work among the six papers examined across the taxonomy. This positioning suggests the paper occupies a relatively sparse research direction within the broader action chunking literature, where most prior work either couples critic and policy chunking or pursues adaptive stepsize methods without explicit decoupling.

The taxonomy reveals neighboring approaches in 'Unified Action Chunking for Offline-to-Online RL' and 'Vision-Language-Action Model Fine-Tuning with Chunking', both applying chunking to different learning settings but maintaining unified chunk lengths. The 'Adaptive Multi-Step Temporal-Difference Methods' branch explores dynamic horizon selection through sequence compression or context-aware stepsize learning, offering an alternative to fixed chunking. The paper's decoupling strategy diverges from these directions by maintaining multi-step critic benefits while allowing shorter policy chunks, bridging the gap between fixed chunking and adaptive methods.

Among six candidates examined, the theoretical analysis of action chunking Q-learning found no clearly refuting prior work across all six papers reviewed. The core algorithmic contributions—the DQC algorithm and distilled partial critic—were not examined against any candidates in this limited search. This suggests that within the top semantic matches and citation-expanded set, no work directly anticipates the specific combination of decoupled chunking with optimistic backup for partial action sequences, though the small search scope limits definitive conclusions about field-wide novelty.

Based on the limited literature search covering six semantically related papers, the work appears to introduce a distinct approach within action chunking methods. The absence of sibling papers in its taxonomy leaf and lack of refuting candidates among examined works suggest novelty in the decoupling mechanism, though the restricted search scope means potentially relevant work in broader multi-step RL or hierarchical control may not have been captured.

Taxonomy

Core-task Taxonomy Papers
6
3
Claimed Contributions
6
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Mitigating bootstrapping bias in temporal-difference learning with action chunking. The field addresses how reinforcement learning agents can improve credit assignment and value estimation by grouping primitive actions into temporally extended chunks. The taxonomy reveals four main branches: Action Chunking for Multi-Step Value Estimation explores methods that explicitly leverage action sequences to reduce bootstrapping error, often by decoupling how critics and policies handle temporal abstraction. Adaptive Multi-Step Temporal-Difference Methods focuses on dynamically adjusting the horizon over which value targets are computed, balancing bias and variance without necessarily committing to fixed chunk boundaries. Developmental and Cognitive Architectures examines biologically inspired or hierarchical frameworks that naturally produce chunked behavior through working memory or skill discovery. Finally, Temporal Action Detection in Video represents a distinct application domain where chunking serves video understanding rather than control. Representative works such as Action Chunking RL[1] and Non-Local Temporal Difference[3] illustrate how chunking can be integrated into both policy parameterization and value function updates. Within the Action Chunking for Multi-Step Value Estimation branch, a central theme is whether to unify or decouple the chunking mechanisms used by the critic and the policy. Some approaches tie chunk length directly to policy outputs, while others allow the value function to bootstrap over different horizons to reduce bias. Decoupled Q-Chunking[0] sits squarely in this latter camp, proposing that the critic can chunk independently of the policy's action sequences, thereby mitigating the bootstrapping bias that arises when value targets rely on overly short rollouts. This contrasts with methods like Action Chunking RL[1], which couples chunking more tightly to policy execution, and with adaptive schemes such as Context-aware Active Multi-Step[5], which dynamically select step sizes based on state context. The interplay between fixed versus adaptive chunking, and between coupled versus decoupled critic-policy designs, remains an active area of exploration, with trade-offs in sample efficiency, computational overhead, and the stability of learned value estimates.

Claimed Contributions

Theoretical analysis of action chunking Q-learning

The authors formalize the open-loop consistency condition and quantify the value estimation bias in action chunking Q-learning (Theorem 4.4). They derive conditions under which action chunking Q-learning outperforms standard n-step return methods (Theorem 4.8), providing theoretical foundations for when chunked critics should be preferred.

6 retrieved papers
Decoupled Q-chunking (DQC) algorithm

The authors introduce DQC, which trains a policy to predict shorter partial action chunks while using a chunked critic that operates over longer complete action chunks. This is achieved through a distilled critic that optimistically approximates the maximum value achievable when extending partial chunks to complete ones, retaining multi-step value propagation benefits while avoiding open-loop sub-optimality.

0 retrieved papers
Distilled partial critic with implicit maximization

The authors develop a separate partial critic that is trained via implicit maximization loss to approximate the maximum value achievable when a partial action chunk is extended to a complete chunk. This enables policy optimization over shorter action chunks while leveraging the value learning benefits of longer-horizon chunked critics.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of action chunking Q-learning

The authors formalize the open-loop consistency condition and quantify the value estimation bias in action chunking Q-learning (Theorem 4.4). They derive conditions under which action chunking Q-learning outperforms standard n-step return methods (Theorem 4.8), providing theoretical foundations for when chunked critics should be preferred.

Contribution

Decoupled Q-chunking (DQC) algorithm

The authors introduce DQC, which trains a policy to predict shorter partial action chunks while using a chunked critic that operates over longer complete action chunks. This is achieved through a distilled critic that optimistically approximates the maximum value achievable when extending partial chunks to complete ones, retaining multi-step value propagation benefits while avoiding open-loop sub-optimality.

Contribution

Distilled partial critic with implicit maximization

The authors develop a separate partial critic that is trained via implicit maximization loss to approximate the maximum value achievable when a partial action chunk is extended to a complete chunk. This enables policy optimization over shorter action chunks while leveraging the value learning benefits of longer-horizon chunked critics.