Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking
Overview
Overall Novelty Assessment
The paper contributes a unified framework for analyzing contrastive learning versus supervised fine-tuning in LLM-based reranking, alongside a multimodal reranking benchmark (MRB) and state-of-the-art GMR models. It resides in the 'Contrastive vs. Supervised Fine-Tuning Objectives' leaf, which contains only two papers total (including this one). This leaf sits within the broader 'Training Objective Design and Comparison' branch, indicating a relatively sparse research direction focused specifically on direct objective comparisons for LLM rerankers. The taxonomy reveals that while training objective design is an active area, head-to-head comparisons of CL versus SFT remain underexplored.
The taxonomy shows neighboring leaves addressing reinforcement learning hybrids, specialized loss functions, and alternative supervision sources. The 'Reinforcement Learning and Hybrid Training Approaches' leaf explores multi-objective optimization, while 'Specialized Loss Functions for Reranking' examines novel loss designs for ranking errors. The 'Alternative Supervision Signals' leaf investigates LLM annotations versus click data. The original paper diverges from these by focusing on foundational objective comparison rather than hybrid methods or supervision sources, and by extending the analysis to multimodal retrieval contexts where text and vision signals interact.
Among 24 candidates examined, the unified framework contribution (4 candidates, 0 refutable) appears relatively novel, with no clear prior work decomposing objectives into weight and direction components for LLM reranking. The MRB benchmark contribution (10 candidates, 1 refutable) shows more overlap, suggesting existing multimodal evaluation resources may partially cover this ground. The GMR models contribution (10 candidates, 0 refutable) appears novel in achieving state-of-the-art multimodal reranking performance. The limited search scope means these assessments reflect top-30 semantic matches and immediate citations, not exhaustive field coverage.
Given the sparse taxonomy leaf and limited refutation signals, the work appears to occupy a relatively underexplored niche at the intersection of objective comparison and multimodal reranking. The analysis is constrained by the 24-candidate search scope and may not capture all relevant prior work in adjacent areas like BERT-based objective studies or broader multimodal retrieval benchmarks. The framework and model contributions show stronger novelty signals than the benchmark component within this limited examination.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors develop a unified framework (URL) that decomposes reranking loss functions into weight and direction components, enabling systematic comparison between supervised fine-tuning and contrastive learning. Through this decomposition, they demonstrate that SFT's superior performance stems primarily from its weight component, which provides stronger optimization signals than CL.
The authors construct MRB (multimodal reranking benchmark), a comprehensive evaluation benchmark containing 40 test datasets spanning diverse modalities including single-modal, cross-modal, and fused-modal retrieval tasks. This benchmark enables rigorous assessment of universal multimodal reranking models across different domains and task types.
The authors develop GMR-3B and GMR-7B, instruction-aware multimodal LLM rerankers trained using supervised fine-tuning on approximately 1.5 million diverse query-document pairs. These models establish new state-of-the-art performance on the MRB benchmark, demonstrating the practical effectiveness of their SFT-based approach for universal multimodal reranking.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Unified framework for analyzing SFT and CL in LLM reranking
The authors develop a unified framework (URL) that decomposes reranking loss functions into weight and direction components, enabling systematic comparison between supervised fine-tuning and contrastive learning. Through this decomposition, they demonstrate that SFT's superior performance stems primarily from its weight component, which provides stronger optimization signals than CL.
[5] Large Language Models for Reranking: A Survey PDF
[9] Self-supervised scientific document recommendation based on contrastive learning PDF
[10] HMCL: Task-Optimal Text Representation Adaptation through Hierarchical Contrastive Learning PDF
[11] Improving Fine-tuning of Language Models with an Emphasis on Isotropy and Rank PDF
MRB benchmark for multimodal reranking evaluation
The authors construct MRB (multimodal reranking benchmark), a comprehensive evaluation benchmark containing 40 test datasets spanning diverse modalities including single-modal, cross-modal, and fused-modal retrieval tasks. This benchmark enables rigorous assessment of universal multimodal reranking models across different domains and task types.
[15] Mm-embed: Universal multimodal retrieval with multimodal llms PDF
[12] Uniir: Training and benchmarking universal multimodal information retrievers PDF
[13] A language-guided cross-modal semantic fusion retrieval method PDF
[14] A survey on multimodal benchmarks: In the era of large ai models PDF
[16] Mmdocir: Benchmarking multi-modal retrieval for long documents PDF
[17] Multibench: Multiscale benchmarks for multimodal representation learning PDF
[18] Polysemous visual-semantic embedding for cross-modal retrieval PDF
[19] SMIL: Multimodal Learning with Severely Missing Modality PDF
[20] Wikido: A new benchmark evaluating cross-modal retrieval for vision-language models PDF
[21] Deep supervised cross-modal retrieval PDF
GMR models achieving state-of-the-art multimodal reranking
The authors develop GMR-3B and GMR-7B, instruction-aware multimodal LLM rerankers trained using supervised fine-tuning on approximately 1.5 million diverse query-document pairs. These models establish new state-of-the-art performance on the MRB benchmark, demonstrating the practical effectiveness of their SFT-based approach for universal multimodal reranking.