RL makes MLLMs see better than SFT
Overview
Overall Novelty Assessment
The paper investigates how reinforcement learning (specifically DPO) reshapes vision encoders in multimodal language models, contrasting this with supervised fine-tuning. It resides in the 'Reinforcement Learning for Vision Encoder Training' leaf under 'Training Paradigms and Optimization', which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the work addresses an emerging and under-explored area. The sibling paper (VLM-R1) also examines RL-based optimization for multimodal models, indicating nascent interest in this training paradigm.
The taxonomy reveals that most vision encoder training research concentrates on supervised and self-supervised pre-training (three papers in that leaf) or visual instruction tuning (three papers). The paper's leaf sits alongside these more populated directions within 'Training Paradigms and Optimization', which collectively address how to learn effective visual representations. Neighboring branches like 'Cross-Modal Alignment and Integration' and 'Vision Encoder Architecture and Design' focus on complementary aspects—bridging modalities and architectural choices—rather than the training strategy itself. The scope notes clarify that this leaf specifically excludes supervised methods and instruction tuning, positioning the work as an alternative training paradigm.
Among 30 candidates examined across three contributions, none were found to clearly refute any claim. The first contribution (systematic SFT vs. RL comparison) examined 10 candidates with zero refutable matches. The second contribution (RL producing stronger localized representations) and third contribution (PIVOT recipe) each examined 10 candidates, also with zero refutable matches. This suggests that within the limited search scope, the specific angle of analyzing how RL fundamentally reshapes vision encoder representations—through gradient visualization, ImageNet classification, and segmentation—appears relatively unexplored. However, the small candidate pool means this assessment reflects top-30 semantic matches rather than exhaustive coverage.
Given the sparse leaf occupancy and absence of refuting prior work among examined candidates, the contributions appear to occupy a relatively novel position within the limited search scope. The systematic comparison of training paradigms' effects on vision encoders, rather than just downstream task performance, represents a distinct analytical focus. However, the analysis is constrained by examining only 30 candidates, and the broader literature on RL for multimodal models may contain relevant work not captured in this semantic search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors perform a comprehensive analysis comparing supervised finetuning (SFT) and reinforcement learning (DPO) in multimodal language models, examining not only downstream MLLM performance but also the impact on the vision encoder itself through vision-only tasks and gradient visualizations.
The paper demonstrates that reinforcement learning (specifically DPO) fundamentally reshapes visual representations in MLLMs, yielding stronger and more fine-grained localization capabilities compared to supervised finetuning, as evidenced by ImageNet classification, segmentation probing, and gradient analysis.
The authors propose PIVOT, a training method that applies RL-based preference optimization to evolve vision encoders for MLLMs. PIVOT-trained encoders outperform larger and more heavily pretrained counterparts while requiring less than 1% of the computational cost of standard vision pretraining.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[44] Vlm-r1: A stable and generalizable r1-style large vision-language model PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic comparison of SFT and RL (DPO) effects on MLLMs and vision encoders
The authors perform a comprehensive analysis comparing supervised finetuning (SFT) and reinforcement learning (DPO) in multimodal language models, examining not only downstream MLLM performance but also the impact on the vision encoder itself through vision-only tasks and gradient visualizations.
[51] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF
[71] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF
[72] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning PDF
[73] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO PDF
[74] Kimi-VL Technical Report PDF
[75] Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback PDF
[76] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models. PDF
[77] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models PDF
[78] Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering PDF
[79] Docthinker: Explainable multimodal large language models with rule-based reinforcement learning for document understanding PDF
Finding that RL produces stronger and more localized visual representations than SFT
The paper demonstrates that reinforcement learning (specifically DPO) fundamentally reshapes visual representations in MLLMs, yielding stronger and more fine-grained localization capabilities compared to supervised finetuning, as evidenced by ImageNet classification, segmentation probing, and gradient analysis.
[51] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF
[52] Bigcharts-r1: Enhanced chart reasoning with visual reinforcement finetuning PDF
[53] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner PDF
[54] VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning PDF
[55] Q-insight: Understanding image quality via visual reinforcement learning PDF
[56] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning PDF
[57] RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning PDF
[58] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF
[59] R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning PDF
[60] Simplevla-rl: Scaling vla training via reinforcement learning PDF
PIVOT: a simple recipe for building strong vision encoders for MLLMs
The authors propose PIVOT, a training method that applies RL-based preference optimization to evolve vision encoders for MLLMs. PIVOT-trained encoders outperform larger and more heavily pretrained counterparts while requiring less than 1% of the computational cost of standard vision pretraining.