RL makes MLLMs see better than SFT

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal LLMReinforcement LearningVision ModelVisual representation
Abstract:

A dominant assumption in Multimodal Language Model (MLLM) research is that its performance is largely inherited from the LLM backbone, given its immense parameter scale and remarkable capabilities. This has created a void in the understanding of the vision encoder, which determines 'how MLLMs perceive images'. The recent shift in MLLM training paradigms, from Supervised Finetuning (SFT) to Reinforcement Learning (RL), magnifies this oversight—namely, the significant lack of analysis on how such training reshapes the vision encoder as well as the MLLM. To address this, we first investigate the impact of training strategies on MLLMs, where RL shows a clear advantage in strongly vision-related VQA benchmarks than SFT. Motivated by this, we conduct a critical yet under-explored analysis of the vision encoder of MLLMs through diverse and in-depth experiments, ranging from ImageNet classification and segmentation to gradient visualization. Our results demonstrate that MLLM's post-training strategy 'i.e, SFT or RL' not only leads to disctinct outcomes on MLLM downstream tasks, but also fundamentally reshapes MLLM's underlying visual representations. Specifically, our main finding is that RL produces stronger and more localized visual representations compared to SFT, boosting the ability of the vision encoder for MLLM. We then reframe our findings into a simple recipe for building strong vision encoders for MLLMs, Preference-Instructed Vision OpTimization (PIVOT). When integrated into MLLMs, a PIVOT-trained vision encoder outperforms even larger and more heavily-trained counterparts, despite requiring less than 1% of the computational cost of standard vision pretraining. This result opens an effective and efficient path for advancing the vision backbones of MLLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how reinforcement learning (specifically DPO) reshapes vision encoders in multimodal language models, contrasting this with supervised fine-tuning. It resides in the 'Reinforcement Learning for Vision Encoder Training' leaf under 'Training Paradigms and Optimization', which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the work addresses an emerging and under-explored area. The sibling paper (VLM-R1) also examines RL-based optimization for multimodal models, indicating nascent interest in this training paradigm.

The taxonomy reveals that most vision encoder training research concentrates on supervised and self-supervised pre-training (three papers in that leaf) or visual instruction tuning (three papers). The paper's leaf sits alongside these more populated directions within 'Training Paradigms and Optimization', which collectively address how to learn effective visual representations. Neighboring branches like 'Cross-Modal Alignment and Integration' and 'Vision Encoder Architecture and Design' focus on complementary aspects—bridging modalities and architectural choices—rather than the training strategy itself. The scope notes clarify that this leaf specifically excludes supervised methods and instruction tuning, positioning the work as an alternative training paradigm.

Among 30 candidates examined across three contributions, none were found to clearly refute any claim. The first contribution (systematic SFT vs. RL comparison) examined 10 candidates with zero refutable matches. The second contribution (RL producing stronger localized representations) and third contribution (PIVOT recipe) each examined 10 candidates, also with zero refutable matches. This suggests that within the limited search scope, the specific angle of analyzing how RL fundamentally reshapes vision encoder representations—through gradient visualization, ImageNet classification, and segmentation—appears relatively unexplored. However, the small candidate pool means this assessment reflects top-30 semantic matches rather than exhaustive coverage.

Given the sparse leaf occupancy and absence of refuting prior work among examined candidates, the contributions appear to occupy a relatively novel position within the limited search scope. The systematic comparison of training paradigms' effects on vision encoders, rather than just downstream task performance, represents a distinct analytical focus. However, the analysis is constrained by examining only 30 candidates, and the broader literature on RL for multimodal models may contain relevant work not captured in this semantic search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: vision encoder training in multimodal language models. The field has evolved around several interconnected branches that address different facets of building effective multimodal systems. Vision Encoder Architecture and Design explores foundational choices such as selecting between CLIP-style and DINO-style encoders (CLIP to DINO[1]) or designing specialized architectures like Vcoder[2] and TokenPacker[3]. Training Paradigms and Optimization investigates how to effectively learn visual representations, spanning supervised fine-tuning, reinforcement learning approaches, and pretraining strategies exemplified by MM1 Pretraining[9] and PaLM-E[4]. Cross-Modal Alignment and Integration focuses on bridging vision and language modalities through projection layers and alignment mechanisms, while Multimodal Model Applications and Capabilities demonstrates the breadth of tasks these systems can handle, from visual question answering to embodied AI (OpenVLA[42]). Evaluation, Benchmarking, and Analysis provides the measurement frameworks needed to assess progress (Benchmark Survey[14]), and Efficient and Lightweight MLLM Design addresses deployment constraints through compression and distillation techniques like MobileVLM V2[26]. A particularly active tension exists between different training paradigms: while many works rely on supervised fine-tuning with large-scale instruction datasets (Visual Instruction Tuning[34]), recent efforts explore whether reinforcement learning can yield better alignment and reasoning capabilities. RL Better Than SFT[0] sits squarely within this emerging direction, investigating reinforcement learning for vision encoder training alongside related work like VLM-R1[44], which also examines RL-based optimization for multimodal models. This contrasts with the dominant supervised paradigm seen in systems like Cambrian[5] and Qwen2-VL[28], which achieve strong performance through careful data curation and architectural choices. The central question is whether RL's ability to optimize for task-specific rewards can overcome the sample efficiency and stability challenges that have historically favored supervised approaches in vision-language pretraining.

Claimed Contributions

Systematic comparison of SFT and RL (DPO) effects on MLLMs and vision encoders

The authors perform a comprehensive analysis comparing supervised finetuning (SFT) and reinforcement learning (DPO) in multimodal language models, examining not only downstream MLLM performance but also the impact on the vision encoder itself through vision-only tasks and gradient visualizations.

10 retrieved papers
Finding that RL produces stronger and more localized visual representations than SFT

The paper demonstrates that reinforcement learning (specifically DPO) fundamentally reshapes visual representations in MLLMs, yielding stronger and more fine-grained localization capabilities compared to supervised finetuning, as evidenced by ImageNet classification, segmentation probing, and gradient analysis.

10 retrieved papers
PIVOT: a simple recipe for building strong vision encoders for MLLMs

The authors propose PIVOT, a training method that applies RL-based preference optimization to evolve vision encoders for MLLMs. PIVOT-trained encoders outperform larger and more heavily pretrained counterparts while requiring less than 1% of the computational cost of standard vision pretraining.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic comparison of SFT and RL (DPO) effects on MLLMs and vision encoders

The authors perform a comprehensive analysis comparing supervised finetuning (SFT) and reinforcement learning (DPO) in multimodal language models, examining not only downstream MLLM performance but also the impact on the vision encoder itself through vision-only tasks and gradient visualizations.

Contribution

Finding that RL produces stronger and more localized visual representations than SFT

The paper demonstrates that reinforcement learning (specifically DPO) fundamentally reshapes visual representations in MLLMs, yielding stronger and more fine-grained localization capabilities compared to supervised finetuning, as evidenced by ImageNet classification, segmentation probing, and gradient analysis.

Contribution

PIVOT: a simple recipe for building strong vision encoders for MLLMs

The authors propose PIVOT, a training method that applies RL-based preference optimization to evolve vision encoders for MLLMs. PIVOT-trained encoders outperform larger and more heavily pretrained counterparts while requiring less than 1% of the computational cost of standard vision pretraining.