UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal embeddingrepresentation learningmultimodal large language modelreasoning model
Abstract:

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces UME-R1, a framework for generating multimodal embeddings through explicit reasoning processes, combining supervised fine-tuning with reinforcement learning. It resides in the 'Reasoning-Driven Embedding Generation' leaf, which contains only two papers including the original work. This sparse population suggests the specific approach of unifying discriminative and generative embeddings via reasoning-augmented generation remains relatively unexplored. The taxonomy shows this leaf sits within the broader 'Generative Multimodal Embedding Architectures' branch, indicating the work addresses a specialized niche within the larger field of reasoning-driven multimodal systems.

The taxonomy reveals neighboring research directions that contextualize this work. The sibling leaf 'Bidirectional Embedding Models' focuses on transforming causal models into bidirectional representations through continual pre-training, differing from UME-R1's reasoning-first approach. Nearby branches include 'Multimodal Chain-of-Thought Reasoning Frameworks' emphasizing explicit step-by-step inference traces, and 'Reinforcement Learning for Multimodal Reasoning' exploring RL optimization strategies. UME-R1 bridges these areas by applying RL specifically to embedding quality rather than general reasoning outputs, distinguishing it from methods that optimize token-level chain-of-thought or visual reasoning traces.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The core UME-R1 framework examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent for reasoning-driven embedding generation exists within this limited search scope. The cold-start supervised fine-tuning dataset similarly examined ten candidates with one potential overlap, indicating dataset construction methods may have partial precedent. The rule-based reinforcement learning contribution examined ten candidates with zero refutations, appearing more distinctive within the examined literature. These statistics reflect a constrained search rather than exhaustive coverage of the field.

Based on the limited thirty-candidate search, the work appears to occupy a sparsely populated research direction with some precedent in reasoning-driven embeddings but potentially novel integration of RL for embedding optimization. The taxonomy structure confirms this sits at the intersection of multiple established areas rather than within a single crowded subfield. The analysis cannot assess whether broader literature beyond the top-thirty semantic matches contains additional overlapping work, particularly in adjacent communities working on multimodal representation learning or reasoning-augmented retrieval systems.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: reasoning-driven generative multimodal embeddings. The field centers on developing models that integrate reasoning processes directly into the generation of multimodal representations, enabling richer semantic understanding across vision, language, and other modalities. The taxonomy reveals a diverse landscape organized around several complementary themes. Multimodal Chain-of-Thought Reasoning Frameworks[2][3] emphasize explicit step-by-step reasoning traces that guide cross-modal understanding, while Generative Multimodal Embedding Architectures focus on how embeddings themselves can be shaped by reasoning signals. Reinforcement Learning for Multimodal Reasoning[40] explores optimization strategies that reward coherent multimodal inference, and Multimodal Knowledge Integration branches incorporate external knowledge graphs[19][24] to ground reasoning. Compositional and Specialized Reasoning Mechanisms address structured problem decomposition and domain-specific challenges, whereas Implicit Cross-Modal Alignment[32] investigates latent representations that capture reasoning without explicit symbolic steps. Application-Specific Reasoning Tasks and Theoretical Foundations round out the taxonomy by situating these methods in concrete domains and broader conceptual frameworks[20][25]. Recent work highlights a tension between explicit reasoning traces and implicit embedding-based approaches. Multimodal CoT[3] and related frameworks produce interpretable intermediate steps, yet can be computationally expensive and sensitive to prompt design. In contrast, reasoning-driven embedding generation—exemplified by Think Then Embed[1] and UME-R1[0]—aims to distill reasoning directly into the embedding space, offering efficiency gains and smoother integration with retrieval or generation pipelines. UME-R1[0] sits within this Generative Multimodal Embedding Architectures branch, closely aligned with Think Then Embed[1] in its emphasis on embedding-level reasoning signals rather than token-level chain-of-thought. Compared to reinforcement-driven methods like R1-Onevision[4] or knowledge-augmented systems[19], UME-R1[0] prioritizes end-to-end differentiable reasoning within the embedding itself, trading off explicit interpretability for compactness and scalability. This positioning reflects ongoing exploration of where reasoning should occur—whether as visible intermediate outputs or as latent structure encoded in representations.

Claimed Contributions

UME-R1 framework for reasoning-driven generative multimodal embeddings

The authors introduce UME-R1, a framework that enables multimodal embedding models to produce both discriminative and reasoning-driven generative embeddings. The framework uses a two-stage training approach: supervised fine-tuning to equip the model with reasoning capabilities, followed by reinforcement learning to enhance reasoning and optimize generative embedding quality.

10 retrieved papers
Can Refute
Multimodal embedding cold-start SFT dataset with CoT annotations

The authors construct a supervised fine-tuning dataset by augmenting existing multimodal embedding training data with chain-of-thought reasoning annotations and summaries, enabling models to learn reasoning capabilities alongside embedding generation.

10 retrieved papers
Can Refute
Rule-based reinforcement learning for multimodal embeddings

The authors develop a novel reward policy for reinforcement learning that considers both ranking and similarity gaps simultaneously, enabling effective RL training for embedding tasks that do not have definitive correct answers like mathematical problems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UME-R1 framework for reasoning-driven generative multimodal embeddings

The authors introduce UME-R1, a framework that enables multimodal embedding models to produce both discriminative and reasoning-driven generative embeddings. The framework uses a two-stage training approach: supervised fine-tuning to equip the model with reasoning capabilities, followed by reinforcement learning to enhance reasoning and optimize generative embedding quality.

Contribution

Multimodal embedding cold-start SFT dataset with CoT annotations

The authors construct a supervised fine-tuning dataset by augmenting existing multimodal embedding training data with chain-of-thought reasoning annotations and summaries, enabling models to learn reasoning capabilities alongside embedding generation.

Contribution

Rule-based reinforcement learning for multimodal embeddings

The authors develop a novel reward policy for reinforcement learning that considers both ranking and similarity gaps simultaneously, enabling effective RL training for embedding tasks that do not have definitive correct answers like mathematical problems.