Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal RetrievalLLM Reranking

In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a unified framework for analyzing contrastive learning versus supervised fine-tuning in LLM-based reranking, alongside a multimodal reranking benchmark (MRB) and state-of-the-art GMR models. It resides in the 'Contrastive vs. Supervised Fine-Tuning Objectives' leaf, which contains only two papers total (including this one). This leaf sits within the broader 'Training Objective Design and Comparison' branch, indicating a relatively sparse research direction focused specifically on direct objective comparisons for LLM rerankers. The taxonomy reveals that while training objective design is an active area, head-to-head comparisons of CL versus SFT remain underexplored.

The taxonomy shows neighboring leaves addressing reinforcement learning hybrids, specialized loss functions, and alternative supervision sources. The 'Reinforcement Learning and Hybrid Training Approaches' leaf explores multi-objective optimization, while 'Specialized Loss Functions for Reranking' examines novel loss designs for ranking errors. The 'Alternative Supervision Signals' leaf investigates LLM annotations versus click data. The original paper diverges from these by focusing on foundational objective comparison rather than hybrid methods or supervision sources, and by extending the analysis to multimodal retrieval contexts where text and vision signals interact.

Among 24 candidates examined, the unified framework contribution (4 candidates, 0 refutable) appears relatively novel, with no clear prior work decomposing objectives into weight and direction components for LLM reranking. The MRB benchmark contribution (10 candidates, 1 refutable) shows more overlap, suggesting existing multimodal evaluation resources may partially cover this ground. The GMR models contribution (10 candidates, 0 refutable) appears novel in achieving state-of-the-art multimodal reranking performance. The limited search scope means these assessments reflect top-30 semantic matches and immediate citations, not exhaustive field coverage.

Given the sparse taxonomy leaf and limited refutation signals, the work appears to occupy a relatively underexplored niche at the intersection of objective comparison and multimodal reranking. The analysis is constrained by the 24-candidate search scope and may not capture all relevant prior work in adjacent areas like BERT-based objective studies or broader multimodal retrieval benchmarks. The framework and model contributions show stronger novelty signals than the benchmark component within this limited examination.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Comparing training objectives for large language model based reranking. The field structure suggested by the taxonomy reveals three main branches that organize research around how LLM-based rerankers are trained and refined. The first branch, Training Objective Design and Comparison, examines the choice and formulation of loss functions—particularly contrasting contrastive learning approaches with supervised fine-tuning strategies—and explores how different objectives shape model behavior. Works such as Rethink BERT Rerankers[2] and FIRST[3] illustrate early efforts to understand which training signals best capture relevance. The second branch, Training Data and Supervision Sources, addresses where supervision comes from, including human annotations, click logs, and synthetic labels generated by LLMs themselves, as seen in LLM Annotations Replace Clicks[6]. The third branch, Bias Mitigation and Robustness, focuses on ensuring that rerankers generalize fairly across diverse queries and resist spurious correlations or position biases. Several active lines of work highlight key trade-offs and open questions. One central theme is whether contrastive objectives—which encourage models to distinguish relevant from irrelevant passages—offer advantages over pointwise or listwise supervised fine-tuning, especially when training data is noisy or limited. Another theme concerns the integration of multimodal signals and the use of LLM-generated annotations to reduce reliance on expensive human labels. Within this landscape, Multimodal LLM Reranking[0] sits naturally alongside efforts like Multi-view Passage Reranking[1] and ERank[4], which also explore richer input representations and alternative training regimes. Compared to Rethink BERT Rerankers[2], which revisited foundational BERT-based objectives, and LLM Reranking Survey[5], which provides a broader overview, the original paper emphasizes the comparative evaluation of objectives in a multimodal setting, bridging objective design with emerging data modalities.

Claimed Contributions

Unified framework for analyzing SFT and CL in LLM reranking

4 retrieved papers

The authors develop a unified framework (URL) that decomposes reranking loss functions into weight and direction components, enabling systematic comparison between supervised fine-tuning and contrastive learning. Through this decomposition, they demonstrate that SFT's superior performance stems primarily from its weight component, which provides stronger optimization signals than CL.

4 retrieved papers

MRB benchmark for multimodal reranking evaluation

Can Refute

10 retrieved papers

The authors construct MRB (multimodal reranking benchmark), a comprehensive evaluation benchmark containing 40 test datasets spanning diverse modalities including single-modal, cross-modal, and fused-modal retrieval tasks. This benchmark enables rigorous assessment of universal multimodal reranking models across different domains and task types.

10 retrieved papers

Can Refute

GMR models achieving state-of-the-art multimodal reranking

10 retrieved papers

The authors develop GMR-3B and GMR-7B, instruction-aware multimodal LLM rerankers trained using supervised fine-tuning on approximately 1.5 million diverse query-document pairs. These models establish new state-of-the-art performance on the MRB benchmark, demonstrating the practical effectiveness of their SFT-based approach for universal multimodal reranking.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline PDF

Luyu Gao, Zhuyun Dai, Jamie Callan (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified framework for analyzing SFT and CL in LLM reranking

[5] Large Language Models for Reranking: A Survey PDF

Cannot Refute

[9] Self-supervised scientific document recommendation based on contrastive learning PDF

Cannot Refute

[10] HMCL: Task-Optimal Text Representation Adaptation through Hierarchical Contrastive Learning PDF

Cannot Refute

[11] Improving Fine-tuning of Language Models with an Emphasis on Isotropy and Rank PDF

Cannot Refute

Contribution

MRB benchmark for multimodal reranking evaluation

[15] Mm-embed: Universal multimodal retrieval with multimodal llms PDF

Can Refute

[12] Uniir: Training and benchmarking universal multimodal information retrievers PDF

Cannot Refute

[13] A language-guided cross-modal semantic fusion retrieval method PDF

Cannot Refute

[14] A survey on multimodal benchmarks: In the era of large ai models PDF

Cannot Refute

[16] Mmdocir: Benchmarking multi-modal retrieval for long documents PDF

Cannot Refute

[17] Multibench: Multiscale benchmarks for multimodal representation learning PDF

Cannot Refute

[18] Polysemous visual-semantic embedding for cross-modal retrieval PDF

Cannot Refute

[19] SMIL: Multimodal Learning with Severely Missing Modality PDF

Cannot Refute

[20] Wikido: A new benchmark evaluating cross-modal retrieval for vision-language models PDF

Cannot Refute

[21] Deep supervised cross-modal retrieval PDF

Cannot Refute

Contribution

GMR models achieving state-of-the-art multimodal reranking

[15] Mm-embed: Universal multimodal retrieval with multimodal llms PDF

Cannot Refute

[22] EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain PDF

Cannot Refute

[23] Visual in-context learning for large vision-language models PDF

Cannot Refute

[24] CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios PDF

Cannot Refute

[25] Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training PDF

Cannot Refute

[26] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant PDF

Cannot Refute

[27] V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning PDF

Cannot Refute

[28] E2LVLM:Evidence-Enhanced Large Vision-Language Model for Multimodal Out-of-Context Misinformation Detection PDF

Cannot Refute

[29] MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval PDF

Cannot Refute

[30] Conditional Sentence Generation and Cross-Modal Reranking for Sign Language Translation PDF

Cannot Refute

Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline PDF

Contribution Analysis

Unified framework for analyzing SFT and CL in LLM reranking

[5] Large Language Models for Reranking: A Survey PDF

[9] Self-supervised scientific document recommendation based on contrastive learning PDF

[10] HMCL: Task-Optimal Text Representation Adaptation through Hierarchical Contrastive Learning PDF

[11] Improving Fine-tuning of Language Models with an Emphasis on Isotropy and Rank PDF

MRB benchmark for multimodal reranking evaluation

[15] Mm-embed: Universal multimodal retrieval with multimodal llms PDF

[12] Uniir: Training and benchmarking universal multimodal information retrievers PDF

[13] A language-guided cross-modal semantic fusion retrieval method PDF

[14] A survey on multimodal benchmarks: In the era of large ai models PDF

[16] Mmdocir: Benchmarking multi-modal retrieval for long documents PDF

[17] Multibench: Multiscale benchmarks for multimodal representation learning PDF

[18] Polysemous visual-semantic embedding for cross-modal retrieval PDF

[19] SMIL: Multimodal Learning with Severely Missing Modality PDF

[20] Wikido: A new benchmark evaluating cross-modal retrieval for vision-language models PDF

[21] Deep supervised cross-modal retrieval PDF

GMR models achieving state-of-the-art multimodal reranking

[15] Mm-embed: Universal multimodal retrieval with multimodal llms PDF

[22] EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain PDF

[23] Visual in-context learning for large vision-language models PDF

[24] CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios PDF

[25] Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training PDF

[26] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant PDF

[27] V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning PDF

[28] E2LVLM:Evidence-Enhanced Large Vision-Language Model for Multimodal Out-of-Context Misinformation Detection PDF

[29] MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval PDF

[30] Conditional Sentence Generation and Cross-Modal Reranking for Sign Language Translation PDF

Table of Contents