Pose-RFT: Aligning MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

ICLR 2026 Conference SubmissionAnonymous Authors
Human Pose EstimationMultimodal Large Language ModelReinforcement Fine-Tuning
Abstract:

Generating 3D human poses from multimodal inputs such as text or images requires models to capture both rich semantic and spatial correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise, their supervised fine-tuning (SFT) paradigm struggles to resolve the task's inherent ambiguity. Its reliance on objectives like SMPL parameter regression creates a critical alignment gap, compromising the model's ability to achieve the required semantic and spatial fidelity. To close the gap, we propose Pose-RFT, a framework that shifts the learning paradigm from supervised imitation to reward-driven reinforcement fine-tuning (RFT). We address the core technical challenge of this task: a hybrid action space requiring joint optimization of discrete language and continuous pose outputs. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that enables stable optimization by performing group-wise reward normalization over sampled responses. Pose-RFT incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of our approach in closing the alignment gap for 3D pose generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Pose-RFT, a framework that applies reinforcement fine-tuning to multimodal 3D pose generation from text and images. It resides in the 'Language-to-Pose Generation with LLMs' leaf, which contains five papers including the original work. This leaf sits within the broader 'Language and Vision-Guided Pose Generation' branch, indicating a moderately populated research direction focused on translating natural language and visual inputs into pose representations. The taxonomy shows this is an active but not overcrowded area, with sibling leaves addressing image-to-pose generation and interactive editing as complementary approaches.

The taxonomy reveals that neighboring research directions include 'Image-to-Pose and Avatar Generation' and 'Multimodal Interactive Editing and Synthesis', both under the same parent branch. These sibling leaves focus on visual-only generation and combined text-image-sketch editing respectively, suggesting the field is exploring diverse input modalities for pose synthesis. The exclude_note for the parent branch clarifies that sensor fusion and temporal forecasting belong elsewhere, positioning this work firmly in the generative modeling space rather than estimation or prediction. The scope_note emphasizes generative models over discriminative approaches, aligning with Pose-RFT's use of MLLMs and reward-driven optimization.

Among thirty candidates examined, the analysis found limited prior work overlap. The Pose-RFT framework contribution examined ten candidates with none appearing to refute it, suggesting the shift from supervised fine-tuning to reinforcement learning for pose generation may be relatively unexplored in this specific context. The HyGRPO algorithm contribution examined ten candidates and found one potentially refutable match, indicating that hybrid action space optimization has some precedent but may offer novel technical elements. The task-specific reward functions contribution also examined ten candidates with no refutations, suggesting this aspect may represent a less-explored direction within the limited search scope.

Based on the top-thirty semantic matches examined, the work appears to occupy a distinct position within language-to-pose generation by emphasizing reward-driven refinement over supervised imitation. The taxonomy context shows this leaf is moderately populated with five papers, suggesting active but not saturated research interest. However, the limited search scope means the analysis captures only a subset of potentially relevant prior work, particularly in adjacent areas like reinforcement learning for generative models or hybrid optimization methods that may exist outside the immediate pose generation literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: 3D human pose generation from multimodal inputs. The field encompasses diverse approaches that leverage combinations of visual, textual, temporal, and sensor data to reconstruct or synthesize human poses in three dimensions. The taxonomy reveals several major branches: Multimodal Fusion for 3D Pose Estimation integrates heterogeneous sensor streams such as RGB-D, IMUs, and LiDAR to improve robustness and accuracy, often addressing challenges in occlusion and viewpoint variation. Language and Vision-Guided Pose Generation focuses on translating natural language descriptions or visual cues into pose representations, enabling intuitive human-computer interaction and creative applications. Temporal Pose Forecasting and Motion Synthesis emphasizes predicting future poses or generating plausible motion sequences, drawing on recurrent and diffusion-based models. Domain-Specific Pose Estimation Applications tailors methods to specialized contexts like autonomous driving, healthcare monitoring, and egocentric vision, while Data Augmentation and Domain Adaptation tackle the scarcity and distribution shift of training data. Single-Modality and Constrained Pose Estimation explores scenarios with limited input modalities, and Surveys and Comprehensive Reviews provide structured overviews of the rapidly evolving landscape. Recent work highlights contrasts between data-driven synthesis and language-conditioned generation. Within Language and Vision-Guided Pose Generation, a small cluster of studies explores leveraging large language models to bridge textual descriptions and 3D poses. PoseScript[5] pioneered structured text-to-pose mappings, while ChatPose[7] and PoseLLaVA[24] extended conversational and vision-language capabilities for interactive pose editing and retrieval. Pose-RFT[0] situates itself in this language-to-pose generation subfield, emphasizing refinement through feedback mechanisms to improve alignment between natural language prompts and generated poses. Compared to PoseScript[5], which focuses on compositional text encoding, Pose-RFT[0] appears to prioritize iterative correction, potentially offering finer control over pose attributes. Meanwhile, FreeMotion[32] explores unconstrained motion synthesis from language, suggesting ongoing interest in balancing expressiveness with physical plausibility across this branch.

Claimed Contributions

Pose-RFT framework for 3D pose generation

The authors introduce Pose-RFT, a novel framework that shifts the learning paradigm from supervised fine-tuning to reward-driven reinforcement fine-tuning for multimodal large language models performing 3D human pose generation. This framework addresses the alignment gap caused by the one-to-many nature of pose generation tasks.

10 retrieved papers
HyGRPO algorithm for hybrid action spaces

The authors develop Hybrid Action Space Group Relative Policy Optimization (HyGRPO), a novel online reinforcement learning algorithm designed to jointly optimize discrete language tokens and continuous 3D pose parameters. The algorithm uses group-wise reward normalization over sampled responses to enable stable optimization in the hybrid action space.

10 retrieved papers
Can Refute
Task-specific reward functions for pose generation

The authors propose four task-specific reward functions: a spatial location reward for image-to-pose generation, a semantic alignment reward for text-to-pose generation, a format correctness reward, and a text embedding similarity reward. These rewards guide the policy optimization to achieve both spatial accuracy and semantic alignment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pose-RFT framework for 3D pose generation

The authors introduce Pose-RFT, a novel framework that shifts the learning paradigm from supervised fine-tuning to reward-driven reinforcement fine-tuning for multimodal large language models performing 3D human pose generation. This framework addresses the alignment gap caused by the one-to-many nature of pose generation tasks.

Contribution

HyGRPO algorithm for hybrid action spaces

The authors develop Hybrid Action Space Group Relative Policy Optimization (HyGRPO), a novel online reinforcement learning algorithm designed to jointly optimize discrete language tokens and continuous 3D pose parameters. The algorithm uses group-wise reward normalization over sampled responses to enable stable optimization in the hybrid action space.

Contribution

Task-specific reward functions for pose generation

The authors propose four task-specific reward functions: a spatial location reward for image-to-pose generation, a semantic alignment reward for text-to-pose generation, a format correctness reward, and a text embedding similarity reward. These rewards guide the policy optimization to achieve both spatial accuracy and semantic alignment.