UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

multimodal embeddingrepresentation learningmultimodal large language modelreasoning model

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces UME-R1, a framework for generating multimodal embeddings through explicit reasoning processes, combining supervised fine-tuning with reinforcement learning. It resides in the 'Reasoning-Driven Embedding Generation' leaf, which contains only two papers including the original work. This sparse population suggests the specific approach of unifying discriminative and generative embeddings via reasoning-augmented generation remains relatively unexplored. The taxonomy shows this leaf sits within the broader 'Generative Multimodal Embedding Architectures' branch, indicating the work addresses a specialized niche within the larger field of reasoning-driven multimodal systems.

The taxonomy reveals neighboring research directions that contextualize this work. The sibling leaf 'Bidirectional Embedding Models' focuses on transforming causal models into bidirectional representations through continual pre-training, differing from UME-R1's reasoning-first approach. Nearby branches include 'Multimodal Chain-of-Thought Reasoning Frameworks' emphasizing explicit step-by-step inference traces, and 'Reinforcement Learning for Multimodal Reasoning' exploring RL optimization strategies. UME-R1 bridges these areas by applying RL specifically to embedding quality rather than general reasoning outputs, distinguishing it from methods that optimize token-level chain-of-thought or visual reasoning traces.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The core UME-R1 framework examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for reasoning-driven embedding generation exists within this limited search scope. The cold-start supervised fine-tuning dataset similarly examined ten candidates with one potential overlap, indicating dataset construction methods may have partial precedent. The rule-based reinforcement learning contribution examined ten candidates with zero refutations, appearing more distinctive within the examined literature. These statistics reflect a constrained search rather than exhaustive coverage of the field.

Based on the limited thirty-candidate search, the work appears to occupy a sparsely populated research direction with some precedent in reasoning-driven embeddings but potentially novel integration of RL for embedding optimization. The taxonomy structure confirms this sits at the intersection of multiple established areas rather than within a single crowded subfield. The analysis cannot assess whether broader literature beyond the top-thirty semantic matches contains additional overlapping work, particularly in adjacent communities working on multimodal representation learning or reasoning-augmented retrieval systems.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reasoning-driven generative multimodal embeddings. The field centers on developing models that integrate reasoning processes directly into the generation of multimodal representations, enabling richer semantic understanding across vision, language, and other modalities. The taxonomy reveals a diverse landscape organized around several complementary themes. Multimodal Chain-of-Thought Reasoning Frameworks[2][3] emphasize explicit step-by-step reasoning traces that guide cross-modal understanding, while Generative Multimodal Embedding Architectures focus on how embeddings themselves can be shaped by reasoning signals. Reinforcement Learning for Multimodal Reasoning[40] explores optimization strategies that reward coherent multimodal inference, and Multimodal Knowledge Integration branches incorporate external knowledge graphs[19][24] to ground reasoning. Compositional and Specialized Reasoning Mechanisms address structured problem decomposition and domain-specific challenges, whereas Implicit Cross-Modal Alignment[32] investigates latent representations that capture reasoning without explicit symbolic steps. Application-Specific Reasoning Tasks and Theoretical Foundations round out the taxonomy by situating these methods in concrete domains and broader conceptual frameworks[20][25]. Recent work highlights a tension between explicit reasoning traces and implicit embedding-based approaches. Multimodal CoT[3] and related frameworks produce interpretable intermediate steps, yet can be computationally expensive and sensitive to prompt design. In contrast, reasoning-driven embedding generation—exemplified by Think Then Embed[1] and UME-R1[0]—aims to distill reasoning directly into the embedding space, offering efficiency gains and smoother integration with retrieval or generation pipelines. UME-R1[0] sits within this Generative Multimodal Embedding Architectures branch, closely aligned with Think Then Embed[1] in its emphasis on embedding-level reasoning signals rather than token-level chain-of-thought. Compared to reinforcement-driven methods like R1-Onevision[4] or knowledge-augmented systems[19], UME-R1[0] prioritizes end-to-end differentiable reasoning within the embedding itself, trading off explicit interpretability for compactness and scalability. This positioning reflects ongoing exploration of where reasoning should occur—whether as visible intermediate outputs or as latent structure encoded in representations.

Claimed Contributions

UME-R1 framework for reasoning-driven generative multimodal embeddings

Can Refute

10 retrieved papers

The authors introduce UME-R1, a framework that enables multimodal embedding models to produce both discriminative and reasoning-driven generative embeddings. The framework uses a two-stage training approach: supervised fine-tuning to equip the model with reasoning capabilities, followed by reinforcement learning to enhance reasoning and optimize generative embedding quality.

10 retrieved papers

Can Refute

Multimodal embedding cold-start SFT dataset with CoT annotations

Can Refute

10 retrieved papers

The authors construct a supervised fine-tuning dataset by augmenting existing multimodal embedding training data with chain-of-thought reasoning annotations and summaries, enabling models to learn reasoning capabilities alongside embedding generation.

10 retrieved papers

Can Refute

Rule-based reinforcement learning for multimodal embeddings

10 retrieved papers

The authors develop a novel reward policy for reinforcement learning that considers both ranking and similarity gaps simultaneously, enabling effective RL training for embedding tasks that do not have definitive correct answers like mathematical problems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Think then embed: Generative context improves multimodal embedding PDF

Cheng Jianpeng, Xuanming Cui, Chen, Hong-You, Jianpeng Cheng, Shukla, Satya Narayan, Hong-you Chen, Awasthi, Abhijeet, Satya Narayan Shukla, Pan Xi-chen, Abhijeet Awasthi, Ahuja, Chaitanya, Xichen Pan, Mishra, Shlok Kumar, Chaitanya Ahuja, Yang, Yonghuan, Shlok Kumar Mishra, Xiao Jun, Qi Guo, Guo Qi, S. Lim, Lim, Ser-Nam, Aashu Singh, Xiangjun Fan, Fan Xiang-jun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UME-R1 framework for reasoning-driven generative multimodal embeddings

[1] Think then embed: Generative context improves multimodal embedding PDF

Can Refute

[61] Unified generative and discriminative training for multi-modal large language models PDF

Cannot Refute

[62] A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges PDF

Cannot Refute

[63] Multimodal Mathematical Reasoning with Diverse Solving Perspective PDF

Cannot Refute

[64] Unifying Vision-and-Language Tasks via Text Generation PDF

Cannot Refute

[65] Multimodal prompt retrieval for generative visual question answering PDF

Cannot Refute

[66] A Comprehensive Survey on LLM-Powered Recommender Systems: From Discriminative, Generative to Multi-Modal Paradigms PDF

Cannot Refute

[67] Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining PDF

Cannot Refute

[68] GRACE: Discriminator-Guided Chain-of-Thought Reasoning PDF

Cannot Refute

[69] Machine learning: discriminative and generative PDF

Cannot Refute

Contribution

Multimodal embedding cold-start SFT dataset with CoT annotations

[71] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

Can Refute

[4] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

Cannot Refute

[10] Learn to explain: Multimodal reasoning via thought chains for science question answering PDF

Cannot Refute

[37] Corvid: Improving multimodal large language models towards chain-of-thought reasoning PDF

Cannot Refute

[70] Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning PDF

Cannot Refute

[72] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

Cannot Refute

[73] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought PDF

Cannot Refute

[74] Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale PDF

Cannot Refute

[75] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs PDF

Cannot Refute

[76] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs PDF

Cannot Refute

Contribution

Rule-based reinforcement learning for multimodal embeddings

[51] Unified Reward Model for Multimodal Understanding and Generation PDF

Cannot Refute

[52] Deep Reinforcement Learning-based Image Captioning with Embedding Reward PDF

Cannot Refute

[53] Effective multimodal reinforcement learning with modality alignment and importance enhancement PDF

Cannot Refute

[54] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

Cannot Refute

[55] Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning PDF

Cannot Refute

[56] Multimodal Label Relevance Ranking via Reinforcement Learning PDF

Cannot Refute

[57] Improved cross-modal retrieval systems using self-reinforcement and quadruplet alignment hashing PDF

Cannot Refute

[58] Learning Multimodal Rewards from Rankings PDF

Cannot Refute

[59] Multimodal knowledge alignment with reinforcement learning PDF

Cannot Refute

[60] Multi-level policy and reward-based deep reinforcement learning framework for image captioning PDF

Cannot Refute

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Think then embed: Generative context improves multimodal embedding PDF

Contribution Analysis

UME-R1 framework for reasoning-driven generative multimodal embeddings

[1] Think then embed: Generative context improves multimodal embedding PDF

[61] Unified generative and discriminative training for multi-modal large language models PDF

[62] A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges PDF

[63] Multimodal Mathematical Reasoning with Diverse Solving Perspective PDF

[64] Unifying Vision-and-Language Tasks via Text Generation PDF

[65] Multimodal prompt retrieval for generative visual question answering PDF

[66] A Comprehensive Survey on LLM-Powered Recommender Systems: From Discriminative, Generative to Multi-Modal Paradigms PDF

[67] Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining PDF

[68] GRACE: Discriminator-Guided Chain-of-Thought Reasoning PDF

[69] Machine learning: discriminative and generative PDF

Multimodal embedding cold-start SFT dataset with CoT annotations

[71] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

[4] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

[10] Learn to explain: Multimodal reasoning via thought chains for science question answering PDF

[37] Corvid: Improving multimodal large language models towards chain-of-thought reasoning PDF

[70] Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning PDF

[72] LLaVA-CoT: Let Vision Language Models Reason Step-by-Step PDF

[73] Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought PDF

[74] Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale PDF

[75] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs PDF

[76] LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs PDF

Rule-based reinforcement learning for multimodal embeddings

[51] Unified Reward Model for Multimodal Understanding and Generation PDF

[52] Deep Reinforcement Learning-based Image Captioning with Embedding Reward PDF

[53] Effective multimodal reinforcement learning with modality alignment and importance enhancement PDF

[54] GraphFusion-HRL: Multi-Modal Hierarchical Reinforcement Graph Learning for Context-Rich Recommender Systems PDF

[55] Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning PDF

[56] Multimodal Label Relevance Ranking via Reinforcement Learning PDF

[57] Improved cross-modal retrieval systems using self-reinforcement and quadruplet alignment hashing PDF

[58] Learning Multimodal Rewards from Rankings PDF

[59] Multimodal knowledge alignment with reinforcement learning PDF

[60] Multi-level policy and reward-based deep reinforcement learning framework for image captioning PDF

Table of Contents