Revela: Dense Retriever Learning via Language Modeling

ICLR 2026 Conference SubmissionAnonymous Authors
Information RetrievalUnsupervised Learning
Abstract:

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers? .

To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self‑supervised retriever learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Revela, a framework that trains dense retrievers through language modeling objectives by conditioning next token prediction on cross-document context weighted by retriever similarity scores. According to the taxonomy tree, this work occupies a singleton leaf node titled 'Self-Supervised Dense Retriever Learning via Language Modeling' with no sibling papers, suggesting it addresses a relatively sparse and specialized research direction. The taxonomy contains 50 papers across 36 topic areas, yet this particular leaf stands alone, indicating that explicitly framing retriever training as chunk-level language modeling with in-batch attention mechanisms represents a less crowded conceptual space within the broader self-supervised retrieval literature.

The taxonomy reveals that Revela's closest neighbors lie in adjacent branches: Self-Supervised Pre-training Architectures explores masked auto-encoder designs like RetroMAE and Condenser models that compress semantics into dense vectors, while Contrastive and Self-Supervised Representation Learning emphasizes metric learning frameworks without architectural novelty. Synthetic Data Generation branches generate pseudo-queries or hypothetical documents to enable training, and Large Language Model Adaptation adapts autoregressive LLMs into dual encoders. Revela diverges from these directions by avoiding synthetic query generation, eschewing explicit contrastive losses, and focusing instead on modeling semantic dependencies through language modeling objectives applied directly to document chunks, positioning it at a conceptual intersection that the taxonomy captures as a distinct leaf.

Among 26 candidates examined across three contributions, the literature search found limited direct overlap. The core Revela framework contribution examined 10 candidates with zero refutable matches, suggesting novelty within the examined scope. The in-batch attention mechanism contribution reviewed 6 candidates, again with no refutations. However, the performance claim contribution examined 10 candidates and identified 1 refutable match, indicating that at least one prior work within this limited sample demonstrates comparable effectiveness without query-document pairs. The search scope—26 papers from semantic search and citation expansion—provides a snapshot rather than exhaustive coverage, meaning these statistics reflect top-K similarity matches rather than comprehensive field-wide analysis.

Based on the limited search scope of 26 candidates, Revela appears to occupy a relatively novel position by explicitly casting retrieval as chunk-level language modeling with similarity-weighted cross-document attention. The singleton taxonomy leaf and low refutation rates across most contributions suggest conceptual distinctiveness, though the performance contribution shows at least one overlapping prior result. The analysis does not cover the full breadth of self-supervised retrieval literature, particularly work published after the search cutoff or in non-indexed venues, leaving open questions about broader field coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: self-supervised dense retriever learning via language modeling. This field centers on training neural retrievers without manually labeled query-document pairs by leveraging language model objectives and self-generated signals. The taxonomy reveals a rich landscape organized around several complementary directions. Self-Supervised Pre-training Architectures explore masked language modeling variants and encoder-decoder designs (e.g., RetroMAE[7], Corpus Aware Pretraining[6]) that shape retrieval-oriented representations. Contrastive and Self-Supervised Representation Learning emphasizes contrastive objectives and metric learning frameworks (e.g., Contrastive BERT[12], Deep Metric Learning[32]) to pull relevant items closer in embedding space. Synthetic Data Generation and Query Augmentation tackle the scarcity of labeled data by creating pseudo-queries or augmented examples (e.g., Self-Generated Queries[16], Questions Are All[18]). Large Language Model Adaptation branches adapt decoder-only or instruction-tuned LLMs for dense retrieval (e.g., Llama2Vec[17], Instruction-Tuning[14]), while Domain Adaptation and Transfer Learning address cross-lingual and specialized corpus challenges (e.g., Cross-lingual Dense[19], Legal Case Retrieval[9]). Hybrid and Enhanced Retrieval Strategies combine lexical and neural signals (e.g., Fusion Techniques[38]), and Specialized Retrieval Applications target domains like geospatial data or histology (e.g., Embedding Earth[45], Slide-Level Histology[47]). A particularly active line of work focuses on designing effective self-supervised objectives that align language model pre-training with retrieval goals, balancing reconstruction fidelity and discriminative power. Some studies emphasize retrieval-oriented masking strategies (e.g., Retrieval Oriented Masking[50], ConTextual MAE[43]) to guide the model toward salient query-document features, while others explore pseudo-relevance feedback loops (e.g., Pseudo-Relevance Labeling[29], Meticulous Pseudo-Relevance[36]) to iteratively refine retrieval quality. Revela[0] sits within the core Self-Supervised Dense Retriever Learning via Language Modeling branch, directly addressing how language modeling can bootstrap retrieval without external supervision. Its emphasis on leveraging LM signals aligns it closely with works like REALM[1] and Designing Accurate Retrieval[3], which also integrate language model objectives into retrieval pipelines, though Revela[0] appears to focus more explicitly on self-supervised mechanisms rather than hybrid or instruction-based adaptations seen in neighboring efforts like Instruction-Tuning[14] or LLM Dense Foundation[46].

Claimed Contributions

Revela framework for self-supervised retriever learning via language modeling

Revela is a novel framework that trains dense retrievers through language modeling by conditioning next token prediction on both local context and cross-document context via an in-batch attention mechanism. The retriever is optimized jointly with the language model without requiring annotated or synthetic query-document pairs.

10 retrieved papers
In-batch attention mechanism weighted by retriever similarity scores

The framework introduces an in-batch attention mechanism where next-token prediction is conditioned on both the input sequence and other sequences within the same batch. The attention weights are determined by retriever-computed similarity scores, enabling joint optimization of the retriever and language model.

6 retrieved papers
Superior performance without query-document pairs across multiple benchmarks

Revela achieves state-of-the-art results on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks without using annotated or synthetic query-document pairs. It outperforms larger supervised models and proprietary APIs while using significantly less training data and compute resources.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Revela framework for self-supervised retriever learning via language modeling

Revela is a novel framework that trains dense retrievers through language modeling by conditioning next token prediction on both local context and cross-document context via an in-batch attention mechanism. The retriever is optimized jointly with the language model without requiring annotated or synthetic query-document pairs.

Contribution

In-batch attention mechanism weighted by retriever similarity scores

The framework introduces an in-batch attention mechanism where next-token prediction is conditioned on both the input sequence and other sequences within the same batch. The attention weights are determined by retriever-computed similarity scores, enabling joint optimization of the retriever and language model.

Contribution

Superior performance without query-document pairs across multiple benchmarks

Revela achieves state-of-the-art results on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks without using annotated or synthetic query-document pairs. It outperforms larger supervised models and proprietary APIs while using significantly less training data and compute resources.