Pose-RFT: Aligning MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Human Pose EstimationMultimodal Large Language ModelReinforcement Fine-Tuning

Generating 3D human poses from multimodal inputs such as text or images requires models to capture both rich semantic and spatial correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise, their supervised fine-tuning (SFT) paradigm struggles to resolve the task's inherent ambiguity. Its reliance on objectives like SMPL parameter regression creates a critical alignment gap, compromising the model's ability to achieve the required semantic and spatial fidelity. To close the gap, we propose Pose-RFT, a framework that shifts the learning paradigm from supervised imitation to reward-driven reinforcement fine-tuning (RFT). We address the core technical challenge of this task: a hybrid action space requiring joint optimization of discrete language and continuous pose outputs. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that enables stable optimization by performing group-wise reward normalization over sampled responses. Pose-RFT incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of our approach in closing the alignment gap for 3D pose generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Pose-RFT, a framework that applies reinforcement fine-tuning to multimodal 3D pose generation from text and images. It resides in the 'Language-to-Pose Generation with LLMs' leaf, which contains five papers including the original work. This leaf sits within the broader 'Language and Vision-Guided Pose Generation' branch, indicating a moderately populated research direction focused on translating natural language and visual inputs into pose representations. The taxonomy shows this is an active but not overcrowded area, with sibling leaves addressing image-to-pose generation and interactive editing as complementary approaches.

The taxonomy reveals that neighboring research directions include 'Image-to-Pose and Avatar Generation' and 'Multimodal Interactive Editing and Synthesis', both under the same parent branch. These sibling leaves focus on visual-only generation and combined text-image-sketch editing respectively, suggesting the field is exploring diverse input modalities for pose synthesis. The exclude_note for the parent branch clarifies that sensor fusion and temporal forecasting belong elsewhere, positioning this work firmly in the generative modeling space rather than estimation or prediction. The scope_note emphasizes generative models over discriminative approaches, aligning with Pose-RFT's use of MLLMs and reward-driven optimization.

Among thirty candidates examined, the analysis found limited prior work overlap. The Pose-RFT framework contribution examined ten candidates with none appearing to refute it, suggesting the shift from supervised fine-tuning to reinforcement learning for pose generation may be relatively unexplored in this specific context. The HyGRPO algorithm contribution examined ten candidates and found one potentially refutable match, indicating that hybrid action space optimization has some precedent but may offer novel technical elements. The task-specific reward functions contribution also examined ten candidates with no refutations, suggesting this aspect may represent a less-explored direction within the limited search scope.

Based on the top-thirty semantic matches examined, the work appears to occupy a distinct position within language-to-pose generation by emphasizing reward-driven refinement over supervised imitation. The taxonomy context shows this leaf is moderately populated with five papers, suggesting active but not saturated research interest. However, the limited search scope means the analysis captures only a subset of potentially relevant prior work, particularly in adjacent areas like reinforcement learning for generative models or hybrid optimization methods that may exist outside the immediate pose generation literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: 3D human pose generation from multimodal inputs. The field encompasses diverse approaches that leverage combinations of visual, textual, temporal, and sensor data to reconstruct or synthesize human poses in three dimensions. The taxonomy reveals several major branches: Multimodal Fusion for 3D Pose Estimation integrates heterogeneous sensor streams such as RGB-D, IMUs, and LiDAR to improve robustness and accuracy, often addressing challenges in occlusion and viewpoint variation. Language and Vision-Guided Pose Generation focuses on translating natural language descriptions or visual cues into pose representations, enabling intuitive human-computer interaction and creative applications. Temporal Pose Forecasting and Motion Synthesis emphasizes predicting future poses or generating plausible motion sequences, drawing on recurrent and diffusion-based models. Domain-Specific Pose Estimation Applications tailors methods to specialized contexts like autonomous driving, healthcare monitoring, and egocentric vision, while Data Augmentation and Domain Adaptation tackle the scarcity and distribution shift of training data. Single-Modality and Constrained Pose Estimation explores scenarios with limited input modalities, and Surveys and Comprehensive Reviews provide structured overviews of the rapidly evolving landscape. Recent work highlights contrasts between data-driven synthesis and language-conditioned generation. Within Language and Vision-Guided Pose Generation, a small cluster of studies explores leveraging large language models to bridge textual descriptions and 3D poses. PoseScript[5] pioneered structured text-to-pose mappings, while ChatPose[7] and PoseLLaVA[24] extended conversational and vision-language capabilities for interactive pose editing and retrieval. Pose-RFT[0] situates itself in this language-to-pose generation subfield, emphasizing refinement through feedback mechanisms to improve alignment between natural language prompts and generated poses. Compared to PoseScript[5], which focuses on compositional text encoding, Pose-RFT[0] appears to prioritize iterative correction, potentially offering finer control over pose attributes. Meanwhile, FreeMotion[32] explores unconstrained motion synthesis from language, suggesting ongoing interest in balancing expressiveness with physical plausibility across this branch.

Claimed Contributions

Pose-RFT framework for 3D pose generation

10 retrieved papers

The authors introduce Pose-RFT, a novel framework that shifts the learning paradigm from supervised fine-tuning to reward-driven reinforcement fine-tuning for multimodal large language models performing 3D human pose generation. This framework addresses the alignment gap caused by the one-to-many nature of pose generation tasks.

10 retrieved papers

HyGRPO algorithm for hybrid action spaces

Can Refute

10 retrieved papers

The authors develop Hybrid Action Space Group Relative Policy Optimization (HyGRPO), a novel online reinforcement learning algorithm designed to jointly optimize discrete language tokens and continuous 3D pose parameters. The algorithm uses group-wise reward normalization over sampled responses to enable stable optimization in the hybrid action space.

10 retrieved papers

Can Refute

Task-specific reward functions for pose generation

10 retrieved papers

The authors propose four task-specific reward functions: a spatial location reward for image-to-pose generation, a semantic alignment reward for text-to-pose generation, a format correctness reward, and a text embedding similarity reward. These rewards guide the policy optimization to achieve both spatial accuracy and semantic alignment.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] PoseScript: 3D Human Poses from Natural Language PDF

Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Gregory Rogez (2022)

[7] Chatpose: Chatting about 3d human pose PDF

Yao Feng, Jing Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black (2024)

[24] PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation PDF

Feng Dong, Guo Ping, Wang Peng, Yu, Wenhao, Zhu Mingmin (2025)

[32] Freemotion: Mocap-free human motion synthesis with multimodal large language models PDF

Zhikai Zhang, Yitang Li, Hao-Feng Huang, Ming-Xian Lin, Li Yi (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pose-RFT framework for 3D pose generation

[12] Head2Body: Body pose generation from Multi-sensory Head-mounted Inputs PDF

Cannot Refute

[51] Multi-Agent Deep Reinforcement Learning for Online 3D Human Poses Estimation PDF

Cannot Refute

[52] AI-driven knowledge-based motion synthesis algorithms for graphics and animation PDF

Cannot Refute

[53] Smart fitness with YOLO-Fit IoT: Real-time pose analysis and personalized training via IoT and RL PDF

Cannot Refute

[54] 3D human pose detection using nano sensor and multi-agent deep reinforcement learning. PDF

Cannot Refute

[55] Proprioception-driven wearer pose estimation for egocentric video PDF

Cannot Refute

[56] Multifingered Grasping Based on Multimodal Reinforcement Learning PDF

Cannot Refute

[57] Human Motion Pose Rapid Tracking Using Improved Deep Reinforcement Learning and Multimodal Fusion PDF

Cannot Refute

[58] Multi-stage query-based feature generating and encoding for robust early action recognition: J. Chen et al. PDF

Cannot Refute

[59] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning PDF

Cannot Refute

Contribution

HyGRPO algorithm for hybrid action spaces

[61] Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space PDF

Can Refute

[60] Action decoupled SAC reinforcement learning with discrete-continuous hybrid action spaces PDF

Cannot Refute

[62] Multi-agent deep reinforcement learning for computation offloading in cooperative edge network PDF

Cannot Refute

[63] Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces PDF

Cannot Refute

[64] Reinforcement Learning for Traffic Signal Control in Hybrid Action Space PDF

Cannot Refute

[65] Think4SCND: Reinforcement Learning with Thinking Model for Dynamic Supply Chain Network Design PDF

Cannot Refute

[66] Deep Reinforcement Learning-Based Energy Management for Heavy Duty HEV Considering Discrete-Continuous Hybrid Action Space PDF

Cannot Refute

[67] Attentive hybrid reinforcement learning-based eco-driving strategy for connected vehicles with hybrid action spaces and surrounding vehicles attention PDF

Cannot Refute

[68] Continuous-discrete reinforcement learning for hybrid control in robotics PDF

Cannot Refute

[69] Day-ahead scheduling based on reinforcement learning with hybrid action space PDF

Cannot Refute

Contribution

Task-specific reward functions for pose generation

[59] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning PDF

Cannot Refute

[70] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment PDF

Cannot Refute

[71] Bailando: 3d dance generation by actor-critic gpt with choreographic memory PDF

Cannot Refute

[72] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation PDF

Cannot Refute

[73] Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation PDF

Cannot Refute

[74] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation PDF

Cannot Refute

[75] HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment PDF

Cannot Refute

[76] Control-a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning PDF

Cannot Refute

[77] ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment PDF

Cannot Refute

[78] InfinityHuman: Towards Long-Term Audio-Driven Human PDF

Cannot Refute

Pose-RFT: Aligning MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] PoseScript: 3D Human Poses from Natural Language PDF

[7] Chatpose: Chatting about 3d human pose PDF

[24] PoseLLaVA: Pose Centric Multimodal LLM for Fine-Grained 3D Pose Manipulation PDF

[32] Freemotion: Mocap-free human motion synthesis with multimodal large language models PDF

Contribution Analysis

Pose-RFT framework for 3D pose generation

[12] Head2Body: Body pose generation from Multi-sensory Head-mounted Inputs PDF

[51] Multi-Agent Deep Reinforcement Learning for Online 3D Human Poses Estimation PDF

[52] AI-driven knowledge-based motion synthesis algorithms for graphics and animation PDF

[53] Smart fitness with YOLO-Fit IoT: Real-time pose analysis and personalized training via IoT and RL PDF

[54] 3D human pose detection using nano sensor and multi-agent deep reinforcement learning. PDF

[55] Proprioception-driven wearer pose estimation for egocentric video PDF

[56] Multifingered Grasping Based on Multimodal Reinforcement Learning PDF

[57] Human Motion Pose Rapid Tracking Using Improved Deep Reinforcement Learning and Multimodal Fusion PDF

[58] Multi-stage query-based feature generating and encoding for robust early action recognition: J. Chen et al. PDF

[59] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning PDF

HyGRPO algorithm for hybrid action spaces

[61] Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space PDF

[60] Action decoupled SAC reinforcement learning with discrete-continuous hybrid action spaces PDF

[62] Multi-agent deep reinforcement learning for computation offloading in cooperative edge network PDF

[63] Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces PDF

[64] Reinforcement Learning for Traffic Signal Control in Hybrid Action Space PDF

[65] Think4SCND: Reinforcement Learning with Thinking Model for Dynamic Supply Chain Network Design PDF

[66] Deep Reinforcement Learning-Based Energy Management for Heavy Duty HEV Considering Discrete-Continuous Hybrid Action Space PDF

[67] Attentive hybrid reinforcement learning-based eco-driving strategy for connected vehicles with hybrid action spaces and surrounding vehicles attention PDF

[68] Continuous-discrete reinforcement learning for hybrid control in robotics PDF

[69] Day-ahead scheduling based on reinforcement learning with hybrid action space PDF

Task-specific reward functions for pose generation

[59] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning PDF

[70] ReSpace: Text-Driven 3D Scene Synthesis and Editing with Preference Alignment PDF

[71] Bailando: 3d dance generation by actor-critic gpt with choreographic memory PDF

[72] SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation PDF

[73] Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation PDF

[74] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation PDF

[75] HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment PDF

[76] Control-a-video: Controllable text-to-video diffusion models with motion prior and reward feedback learning PDF

[77] ReAlign: Text-to-Motion Generation via Step-Aware Reward-Guided Alignment PDF

[78] InfinityHuman: Towards Long-Term Audio-Driven Human PDF

Table of Contents