RL makes MLLMs see better than SFT

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Multimodal LLMReinforcement LearningVision ModelVisual representation

A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines 'how MLLMs perceive images'. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight—namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage in strongly vision-related VQA benchmarks than SFT. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy 'i.e, SFT or RL' not only leads to disctinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, our main finding is that RL produces stronger and more localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how reinforcement learning (specifically DPO) reshapes vision encoders in multimodal language models, contrasting this with supervised fine-tuning. It resides in the 'Reinforcement Learning for Vision Encoder Training' leaf under 'Training Paradigms and Optimization', which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the work addresses an emerging and under-explored area. The sibling paper (VLM-R1) also examines RL-based optimization for multimodal models, indicating nascent interest in this training paradigm.

The taxonomy reveals that most vision encoder training research concentrates on supervised and self-supervised pre-training (three papers in that leaf) or visual instruction tuning (three papers). The paper's leaf sits alongside these more populated directions within 'Training Paradigms and Optimization', which collectively address how to learn effective visual representations. Neighboring branches like 'Cross-Modal Alignment and Integration' and 'Vision Encoder Architecture and Design' focus on complementary aspects—bridging modalities and architectural choices—rather than the training strategy itself. The scope notes clarify that this leaf specifically excludes supervised methods and instruction tuning, positioning the work as an alternative training paradigm.

Among 30 candidates examined across three contributions, none were found to clearly refute any claim. The first contribution (systematic SFT vs. RL comparison) examined 10 candidates with zero refutable matches. The second contribution (RL producing stronger localized representations) and third contribution (PIVOT recipe) each examined 10 candidates, also with zero refutable matches. This suggests that within the limited search scope, the specific angle of analyzing how RL fundamentally reshapes vision encoder representations—through gradient visualization, ImageNet classification, and segmentation—appears relatively unexplored. However, the small candidate pool means this assessment reflects top-30 semantic matches rather than exhaustive coverage.

Given the sparse leaf occupancy and absence of refuting prior work among examined candidates, the contributions appear to occupy a relatively novel position within the limited search scope. The systematic comparison of training paradigms' effects on vision encoders, rather than just downstream task performance, represents a distinct analytical focus. However, the analysis is constrained by examining only 30 candidates, and the broader literature on RL for multimodal models may contain relevant work not captured in this semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: vision encoder training in multimodal language models. The field has evolved around several interconnected branches that address different facets of building effective multimodal systems. Vision Encoder Architecture and Design explores foundational choices such as selecting between CLIP-style and DINO-style encoders (CLIP to DINO[1]) or designing specialized architectures like Vcoder[2] and TokenPacker[3]. Training Paradigms and Optimization investigates how to effectively learn visual representations, spanning supervised fine-tuning, reinforcement learning approaches, and pretraining strategies exemplified by MM1 Pretraining[9] and PaLM-E[4]. Cross-Modal Alignment and Integration focuses on bridging vision and language modalities through projection layers and alignment mechanisms, while Multimodal Model Applications and Capabilities demonstrates the breadth of tasks these systems can handle, from visual question answering to embodied AI (OpenVLA[42]). Evaluation, Benchmarking, and Analysis provides the measurement frameworks needed to assess progress (Benchmark Survey[14]), and Efficient and Lightweight MLLM Design addresses deployment constraints through compression and distillation techniques like MobileVLM V2[26]. A particularly active tension exists between different training paradigms: while many works rely on supervised fine-tuning with large-scale instruction datasets (Visual Instruction Tuning[34]), recent efforts explore whether reinforcement learning can yield better alignment and reasoning capabilities. RL Better Than SFT[0] sits squarely within this emerging direction, investigating reinforcement learning for vision encoder training alongside related work like VLM-R1[44], which also examines RL-based optimization for multimodal models. This contrasts with the dominant supervised paradigm seen in systems like Cambrian[5] and Qwen2-VL[28], which achieve strong performance through careful data curation and architectural choices. The central question is whether RL's ability to optimize for task-specific rewards can overcome the sample efficiency and stability challenges that have historically favored supervised approaches in vision-language pretraining.

Claimed Contributions

Systematic comparison of SFT and RL (DPO) effects on MLLMs and vision encoders

10 retrieved papers

The authors perform a comprehensive analysis comparing supervised finetuning (SFT) and reinforcement learning (DPO) in multimodal language models, examining not only downstream MLLM performance but also the impact on the vision encoder itself through vision-only tasks and gradient visualizations.

10 retrieved papers

Finding that RL produces stronger and more localized visual representations than SFT

10 retrieved papers

The paper demonstrates that reinforcement learning (specifically DPO) fundamentally reshapes visual representations in MLLMs, yielding stronger and more fine-grained localization capabilities compared to supervised finetuning, as evidenced by ImageNet classification, segmentation probing, and gradient analysis.

10 retrieved papers

PIVOT: a simple recipe for building strong vision encoders for MLLMs

10 retrieved papers

The authors propose PIVOT, a training method that applies RL-based preference optimization to evolve vision encoders for MLLMs. PIVOT-trained encoders outperform larger and more heavily pretrained counterparts while requiring less than 1% of the computational cost of standard vision pretraining.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[44] Vlm-r1: A stable and generalizable r1-style large vision-language model PDF

Shen, Haozhan, Liu Peng, Haozhan Shen, Li Jingcheng, Peng Liu, Jingcheng Li, Ma Yibo, Chunxin Fang, Liao Jiajia, Yibo Ma, Shen Qiaoli, Jiajia Liao, Zhang, Zilun, Qiaoli Shen, Zhao Kangjia, Zilun Zhang, Zhang Qianqian, Kangjia Zhao, Xu, Ruochen, Qianqian Zhang, Zhao Tiancheng, Ruochen Xu, Tiancheng Zhao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic comparison of SFT and RL (DPO) effects on MLLMs and vision encoders

[51] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

Cannot Refute

[71] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

Cannot Refute

[72] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning PDF

Cannot Refute

[73] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO PDF

Cannot Refute

[74] Kimi-VL Technical Report PDF

Cannot Refute

[75] Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback PDF

Cannot Refute

[76] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models. PDF

Cannot Refute

[77] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models PDF

Cannot Refute

[78] Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering PDF

Cannot Refute

[79] Docthinker: Explainable multimodal large language models with rule-based reinforcement learning for document understanding PDF

Cannot Refute

Contribution

Finding that RL produces stronger and more localized visual representations than SFT

[51] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

Cannot Refute

[52] Bigcharts-r1: Enhanced chart reasoning with visual reinforcement finetuning PDF

Cannot Refute

[53] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner PDF

Cannot Refute

[54] VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning PDF

Cannot Refute

[55] Q-insight: Understanding image quality via visual reinforcement learning PDF

Cannot Refute

[56] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning PDF

Cannot Refute

[57] RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning PDF

Cannot Refute

[58] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF

Cannot Refute

[59] R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning PDF

Cannot Refute

[60] Simplevla-rl: Scaling vla training via reinforcement learning PDF

Cannot Refute

Contribution

PIVOT: a simple recipe for building strong vision encoders for MLLMs

[61] Direct preference optimization of video large multimodal models from language model reward PDF

Cannot Refute

[62] mdpo: Conditional preference optimization for multimodal large language models PDF

Cannot Refute

[63] Yi: Open Foundation Models by 01.AI PDF

Cannot Refute

[64] Multi-modal preference alignment remedies degradation of visual instruction tuning on language models PDF

Cannot Refute

[65] Calibrated self-rewarding vision language models PDF

Cannot Refute

[66] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining PDF

Cannot Refute

[67] Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation PDF

Cannot Refute

[68] Aligning modalities in vision large language models via preference fine-tuning PDF

Cannot Refute

[69] Font-Agent: Enhancing Font Understanding with Large Language Models PDF

Cannot Refute

[70] LPOI: Listwise Preference Optimization for Vision Language Models PDF

Cannot Refute

RL makes MLLMs see better than SFT

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[44] Vlm-r1: A stable and generalizable r1-style large vision-language model PDF

Contribution Analysis

Systematic comparison of SFT and RL (DPO) effects on MLLMs and vision encoders

[51] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

[71] SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models PDF

[72] Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning PDF

[73] MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO PDF

[74] Kimi-VL Technical Report PDF

[75] Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback PDF

[76] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models. PDF

[77] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models PDF

[78] Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering PDF

[79] Docthinker: Explainable multimodal large language models with rule-based reinforcement learning for document understanding PDF

Finding that RL produces stronger and more localized visual representations than SFT

[51] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

[52] Bigcharts-r1: Enhanced chart reasoning with visual reinforcement finetuning PDF

[53] Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner PDF

[54] VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning PDF

[55] Q-insight: Understanding image quality via visual reinforcement learning PDF

[56] UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning PDF

[57] RewardMap: Tackling sparse rewards in fine-grained visual reasoning via multi-stage reinforcement learning PDF

[58] Learning only with images: Visual reinforcement learning with reasoning, rendering, and visual feedback PDF

[59] R1-omni: Explainable omni-multimodal emotion recognition with reinforcement learning PDF

[60] Simplevla-rl: Scaling vla training via reinforcement learning PDF

PIVOT: a simple recipe for building strong vision encoders for MLLMs

[61] Direct preference optimization of video large multimodal models from language model reward PDF

[62] mdpo: Conditional preference optimization for multimodal large language models PDF

[63] Yi: Open Foundation Models by 01.AI PDF

[64] Multi-modal preference alignment remedies degradation of visual instruction tuning on language models PDF

[65] Calibrated self-rewarding vision language models PDF

[66] Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining PDF

[67] Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation PDF

[68] Aligning modalities in vision large language models via preference fine-tuning PDF

[69] Font-Agent: Enhancing Font Understanding with Large Language Models PDF

[70] LPOI: Listwise Preference Optimization for Vision Language Models PDF

Table of Contents