Revela: Dense Retriever Learning via Language Modeling

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Information RetrievalUnsupervised Learning

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers? .

To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self‑supervised retriever learning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Revela, a framework that trains dense retrievers through language modeling objectives by conditioning next token prediction on cross-document context weighted by retriever similarity scores. According to the taxonomy tree, this work occupies a singleton leaf node titled 'Self-Supervised Dense Retriever Learning via Language Modeling' with no sibling papers, suggesting it addresses a relatively sparse and specialized research direction. The taxonomy contains 50 papers across 36 topic areas, yet this particular leaf stands alone, indicating that explicitly framing retriever training as chunk-level language modeling with in-batch attention mechanisms represents a less crowded conceptual space within the broader self-supervised retrieval literature.

The taxonomy reveals that Revela's closest neighbors lie in adjacent branches: Self-Supervised Pre-training Architectures explores masked auto-encoder designs like RetroMAE and Condenser models that compress semantics into dense vectors, while Contrastive and Self-Supervised Representation Learning emphasizes metric learning frameworks without architectural novelty. Synthetic Data Generation branches generate pseudo-queries or hypothetical documents to enable training, and Large Language Model Adaptation adapts autoregressive LLMs into dual encoders. Revela diverges from these directions by avoiding synthetic query generation, eschewing explicit contrastive losses, and focusing instead on modeling semantic dependencies through language modeling objectives applied directly to document chunks, positioning it at a conceptual intersection that the taxonomy captures as a distinct leaf.

Among 26 candidates examined across three contributions, the literature search found limited direct overlap. The core Revela framework contribution examined 10 candidates with zero refutable matches, suggesting novelty within the examined scope. The in-batch attention mechanism contribution reviewed 6 candidates, again with no refutations. However, the performance claim contribution examined 10 candidates and identified 1 refutable match, indicating that at least one prior work within this limited sample demonstrates comparable effectiveness without query-document pairs. The search scope—26 papers from semantic search and citation expansion—provides a snapshot rather than exhaustive coverage, meaning these statistics reflect top-K similarity matches rather than comprehensive field-wide analysis.

Based on the limited search scope of 26 candidates, Revela appears to occupy a relatively novel position by explicitly casting retrieval as chunk-level language modeling with similarity-weighted cross-document attention. The singleton taxonomy leaf and low refutation rates across most contributions suggest conceptual distinctiveness, though the performance contribution shows at least one overlapping prior result. The analysis does not cover the full breadth of self-supervised retrieval literature, particularly work published after the search cutoff or in non-indexed venues, leaving open questions about broader field coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: self-supervised dense retriever learning via language modeling. This field centers on training neural retrievers without manually labeled query-document pairs by leveraging language model objectives and self-generated signals. The taxonomy reveals a rich landscape organized around several complementary directions. Self-Supervised Pre-training Architectures explore masked language modeling variants and encoder-decoder designs (e.g., RetroMAE[7], Corpus Aware Pretraining[6]) that shape retrieval-oriented representations. Contrastive and Self-Supervised Representation Learning emphasizes contrastive objectives and metric learning frameworks (e.g., Contrastive BERT[12], Deep Metric Learning[32]) to pull relevant items closer in embedding space. Synthetic Data Generation and Query Augmentation tackle the scarcity of labeled data by creating pseudo-queries or augmented examples (e.g., Self-Generated Queries[16], Questions Are All[18]). Large Language Model Adaptation branches adapt decoder-only or instruction-tuned LLMs for dense retrieval (e.g., Llama2Vec[17], Instruction-Tuning[14]), while Domain Adaptation and Transfer Learning address cross-lingual and specialized corpus challenges (e.g., Cross-lingual Dense[19], Legal Case Retrieval[9]). Hybrid and Enhanced Retrieval Strategies combine lexical and neural signals (e.g., Fusion Techniques[38]), and Specialized Retrieval Applications target domains like geospatial data or histology (e.g., Embedding Earth[45], Slide-Level Histology[47]). A particularly active line of work focuses on designing effective self-supervised objectives that align language model pre-training with retrieval goals, balancing reconstruction fidelity and discriminative power. Some studies emphasize retrieval-oriented masking strategies (e.g., Retrieval Oriented Masking[50], ConTextual MAE[43]) to guide the model toward salient query-document features, while others explore pseudo-relevance feedback loops (e.g., Pseudo-Relevance Labeling[29], Meticulous Pseudo-Relevance[36]) to iteratively refine retrieval quality. Revela[0] sits within the core Self-Supervised Dense Retriever Learning via Language Modeling branch, directly addressing how language modeling can bootstrap retrieval without external supervision. Its emphasis on leveraging LM signals aligns it closely with works like REALM[1] and Designing Accurate Retrieval[3], which also integrate language model objectives into retrieval pipelines, though Revela[0] appears to focus more explicitly on self-supervised mechanisms rather than hybrid or instruction-based adaptations seen in neighboring efforts like Instruction-Tuning[14] or LLM Dense Foundation[46].

Claimed Contributions

Revela framework for self-supervised retriever learning via language modeling

10 retrieved papers

Revela is a novel framework that trains dense retrievers through language modeling by conditioning next token prediction on both local context and cross-document context via an in-batch attention mechanism. The retriever is optimized jointly with the language model without requiring annotated or synthetic query-document pairs.

10 retrieved papers

In-batch attention mechanism weighted by retriever similarity scores

6 retrieved papers

The framework introduces an in-batch attention mechanism where next-token prediction is conditioned on both the input sequence and other sequences within the same batch. The attention weights are determined by retriever-computed similarity scores, enabling joint optimization of the retriever and language model.

6 retrieved papers

Superior performance without query-document pairs across multiple benchmarks

Can Refute

10 retrieved papers

Revela achieves state-of-the-art results on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks without using annotated or synthetic query-document pairs. It outperforms larger supervised models and proprietary APIs while using significantly less training data and compute resources.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Revela framework for self-supervised retriever learning via language modeling

[2] Precise zero-shot dense retrieval without relevance labels PDF

Cannot Refute

[5] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features PDF

Cannot Refute

[6] Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval PDF

Cannot Refute

[22] Structure-aware language model pretraining improves dense retrieval on structured data PDF

Cannot Refute

[23] Lexicon-enhanced self-supervised training for multilingual dense retrieval PDF

Cannot Refute

[51] Condenser: a pre-training architecture for dense retrieval PDF

Cannot Refute

[52] Unsupervised dense information retrieval with contrastive learning PDF

Cannot Refute

[53] Unsupervised dense retrieval with relevance-aware contrastive pre-training PDF

Cannot Refute

[54] Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling PDF

Cannot Refute

[55] Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction PDF

Cannot Refute

Contribution

In-batch attention mechanism weighted by retriever similarity scores

[1] Retrieval augmented language model pre-training PDF

Cannot Refute

[64] Retrieve anything to augment large language models PDF

Cannot Refute

[65] Retrieval-Native Language Models: Integrating Parametric and Vector Memory with Bayesian Attention PDF

Cannot Refute

[66] A Survey on Efficient Protein Language Models PDF

Cannot Refute

[67] Current Limitations of Language Models: What You Need is Retrieval PDF

Cannot Refute

[68] Supporting Retriever's Training by Joint Likelihood-Based Soft-Label Generation PDF

Cannot Refute

Contribution

Superior performance without query-document pairs across multiple benchmarks

[52] Unsupervised dense information retrieval with contrastive learning PDF

Can Refute

[10] Laprador: Unsupervised pretrained dense retriever for zero-shot text retrieval PDF

Cannot Refute

[56] From matching to generation: A survey on generative information retrieval PDF

Cannot Refute

[57] GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval PDF

Cannot Refute

[58] Inpars: Unsupervised dataset generation for information retrieval PDF

Cannot Refute

[59] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval PDF

Cannot Refute

[60] Unsupervised Dense Retrieval Training with Web Anchors PDF

Cannot Refute

[61] Unsupervised learning of semantic audio representations PDF

Cannot Refute

[62] DocumentâtoâDocument Retrieval Using SelfâRetrieval Learning and Automatic Keyword Extraction PDF

Cannot Refute

[63] Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval PDF

Cannot Refute

Revela: Dense Retriever Learning via Language Modeling

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Revela framework for self-supervised retriever learning via language modeling

[2] Precise zero-shot dense retrieval without relevance labels PDF

[5] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features PDF

[6] Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval PDF

[22] Structure-aware language model pretraining improves dense retrieval on structured data PDF

[23] Lexicon-enhanced self-supervised training for multilingual dense retrieval PDF

[51] Condenser: a pre-training architecture for dense retrieval PDF

[52] Unsupervised dense information retrieval with contrastive learning PDF

[53] Unsupervised dense retrieval with relevance-aware contrastive pre-training PDF

[54] Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling PDF

[55] Pre-train a discriminative text encoder for dense retrieval via contrastive span prediction PDF

In-batch attention mechanism weighted by retriever similarity scores

[1] Retrieval augmented language model pre-training PDF

[64] Retrieve anything to augment large language models PDF

[65] Retrieval-Native Language Models: Integrating Parametric and Vector Memory with Bayesian Attention PDF

[66] A Survey on Efficient Protein Language Models PDF

[67] Current Limitations of Language Models: What You Need is Retrieval PDF

[68] Supporting Retriever's Training by Joint Likelihood-Based Soft-Label Generation PDF

Superior performance without query-document pairs across multiple benchmarks

[52] Unsupervised dense information retrieval with contrastive learning PDF

[10] Laprador: Unsupervised pretrained dense retriever for zero-shot text retrieval PDF

[56] From matching to generation: A survey on generative information retrieval PDF

[57] GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval PDF

[58] Inpars: Unsupervised dataset generation for information retrieval PDF

[59] SCENIR: Visual Semantic Clarity through Unsupervised Scene Graph Retrieval PDF

[60] Unsupervised Dense Retrieval Training with Web Anchors PDF

[61] Unsupervised learning of semantic audio representations PDF

[62] DocumentâtoâDocument Retrieval Using SelfâRetrieval Learning and Automatic Keyword Extraction PDF

[63] Transformer-based Clipped Contrastive Quantization Learning for Unsupervised Image Retrieval PDF

Table of Contents

[62] DocumentâtoâDocument Retrieval Using SelfâRetrieval Learning and Automatic Keyword Extraction PDF