OSCAR: Online Soft Compression for RAG

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

RAGCompressionEmbeddingEfficiencyQuestion Answering

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, leading to improved accuracy and relevance. However, scaling RAG pipelines remains computationally expensive as context length grows. On one hand, hard compression methods have recently proposed to prune the retrieved text on-the-fly with a limited compression ration. On the other hand, soft compression method performs a costly offline compression thanks a dedicated LLM but with a higher compression rate. In this paper, we introduce OSCAR, a novel query-dependent online soft compression method for RAG. OSCAR bridges the gap between online hard and offline soft compression methods, bringing the best of both: OSCAR dynamically compresses retrieved information at inference time, eliminating storage overhead and enabling higher compression rates than existing methods. Our experiments demonstrate state-of-the-art performance with a 2-5x speed-up in inference and minimal, if any, accuracy loss, for LLMs ranging from 1B to 24B parameters.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OSCAR, a query-dependent online soft compression method for RAG that dynamically compresses retrieved documents at inference time using continuous embeddings. Within the taxonomy, OSCAR resides in the 'Online Compression with Reranking Integration' leaf under 'Query-Dependent Online Soft Compression Methods', sharing this leaf with only one sibling paper. This positioning indicates a relatively sparse research direction focused specifically on combining dynamic soft compression with reranking mechanisms, distinguishing it from broader compression approaches that lack explicit reranking components or operate offline.

The taxonomy reveals that OSCAR's immediate neighbors include 'Pretraining-Free Compression Architectures' within the same parent branch, which emphasizes lightweight compression without extensive pretraining. Adjacent branches address 'Hybrid Compression with Selective Retrieval' (combining compression with adaptive document selection) and 'Task-Aware Dynamic Compression for Long Contexts' (optimizing compression based on downstream task requirements). OSCAR's focus on query-dependent online soft compression with reranking integration positions it at the intersection of efficiency and relevance optimization, diverging from purely efficiency-driven methods or those requiring offline preprocessing.

Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. The core OSCAR method (Contribution 1) examined 10 candidates with zero refutations, suggesting relative novelty in its specific approach. However, the two architectural contributions—efficient compressor architectures (Contribution 2) and simultaneous compression with reranking (Contribution 3)—each found one refutable candidate among 10 examined. This indicates that while the overall OSCAR framework appears novel within the limited search scope, specific architectural choices and the compression-reranking integration concept have some overlap with existing work in the examined literature.

Based on the top-30 semantic matches and taxonomy structure, OSCAR appears to occupy a moderately novel position, particularly in its query-dependent online soft compression approach. The limited search scope and sparse taxonomy leaf suggest the work addresses an emerging research direction, though certain architectural and integration aspects show partial overlap with prior methods. A more exhaustive literature review would be needed to definitively assess novelty across the broader RAG compression landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: online soft compression for retrieval-augmented generation. The field addresses the challenge of efficiently integrating retrieved documents into language models by compressing context on-the-fly rather than relying solely on hard filtering or static summarization. The taxonomy reveals several complementary directions: Query-Dependent Online Soft Compression Methods focus on tailoring compression to each query's needs, often leveraging learned representations to distill retrieved passages; Hybrid Compression with Selective Retrieval combines compression with intelligent document selection to balance coverage and efficiency; Task-Aware Dynamic Compression for Long Contexts adapts compression strategies based on downstream task requirements and context length constraints; Robustness and Interpretability in Compressed RAG examines how compression affects model reliability and explainability; Specialized Compression for Non-Text Retrieval extends these ideas to multimodal or structured data; and Domain-Specific Adaptive Retrieval Systems tailor compression and retrieval jointly to particular application areas. Representative works like PISCO[1] and SARA[2] illustrate query-dependent approaches, while DynamicKV[3] exemplifies task-aware dynamic methods. A central tension across branches involves the trade-off between aggressive compression for efficiency and preserving sufficient signal for accurate generation, with many studies exploring learned compression modules that can be fine-tuned for specific retrieval pipelines. OSCAR[0] sits within the Query-Dependent Online Soft Compression Methods branch, specifically in the Online Compression with Reranking Integration cluster alongside OSCAR Reranking[4]. This positioning reflects its emphasis on integrating reranking signals directly into the compression process, allowing the model to prioritize salient content based on both query relevance and retrieval confidence. Compared to simpler compression schemes like Simple Context Compression[6], OSCAR[0] and its neighbor OSCAR Reranking[4] leverage richer reranking feedback to guide soft compression decisions. This contrasts with works in the Robustness and Interpretability branch, such as Explainable RAG[5], which prioritize transparency over compression efficiency, highlighting ongoing questions about how to balance compactness, fidelity, and interpretability in retrieval-augmented systems.

Claimed Contributions

OSCAR: Online Soft Compression Method for RAG

10 retrieved papers

The authors propose OSCAR, which dynamically compresses retrieved documents into query-optimized representations for efficient answer generation. OSCAR bridges the gap between online hard compression and offline soft compression methods, achieving 2-5× inference speed-up with minimal accuracy loss.

10 retrieved papers

Two Efficient Compressor Architectures

Can Refute

10 retrieved papers

The authors design two compressor variants: OSCAR-N-Layers uses the first N layers of the pretrained generator backbone, while OSCAR-llama employs a smaller 1B parameter LLM with alignment layers. These architectures enable fast online compression while maintaining generation quality.

10 retrieved papers

Can Refute

Simultaneous Compression and Reranking

Can Refute

10 retrieved papers

The authors extend OSCAR to perform both document compression and reranking in a single forward pass by adding a reranking token and training objective. This makes compression essentially free in standard RAG pipelines that already include reranking.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] OSCAR: Online Soft Compression And Reranking PDF

Louis, Maxime, Formal, Thibault, Maxime Louis, DÃ©jean, HervÃ©, Thibault Formal, Clinchant, StÃ©phane, Herv'e Dejean, S. Clinchant (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OSCAR: Online Soft Compression Method for RAG

[1] PISCO: Pretty Simple Compression for Retrieval-Augmented Generation PDF

Cannot Refute

[24] Searching for best practices in retrieval-augmented generation PDF

Cannot Refute

[25] Oreo: A plug-in context reconstructor to enhance retrieval-augmented generation PDF

Cannot Refute

[26] AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation PDF

Cannot Refute

[27] Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation PDF

Cannot Refute

[28] Query-Aware Graph Neural Networks for Enhanced Retrieval-Augmented Generation PDF

Cannot Refute

[29] Autorag-hp: Automatic online hyper-parameter tuning for retrieval-augmented generation PDF

Cannot Refute

[30] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation PDF

Cannot Refute

[31] ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models PDF

Cannot Refute

[32] Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation PDF

Cannot Refute

Contribution

Two Efficient Compressor Architectures

[20] In-context autoencoder for context compression in a large language model PDF

Can Refute

[14] A survey on model compression for large language models PDF

Cannot Refute

[15] Language modeling is compression PDF

Cannot Refute

[16] Adapting language models to compress contexts PDF

Cannot Refute

[17] Integrating context compression and structural representation in large language models for financial text generation PDF

Cannot Refute

[18] Extending context window of large language models via semantic compression PDF

Cannot Refute

[19] mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding PDF

Cannot Refute

[21] Pretraining context compressor for large language models with embedding-based memory PDF

Cannot Refute

[22] Lossless data compression by large models PDF

Cannot Refute

[23] Prompt compression for large language models: A survey PDF

Cannot Refute

Contribution

Simultaneous Compression and Reranking

[4] OSCAR: Online Soft Compression And Reranking PDF

Can Refute

[1] PISCO: Pretty Simple Compression for Retrieval-Augmented Generation PDF

Cannot Refute

[24] Searching for best practices in retrieval-augmented generation PDF

Cannot Refute

[30] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation PDF

Cannot Refute

[33] Exit: Context-aware extractive compression for enhancing retrieval-augmented generation PDF

Cannot Refute

[34] xrag: Extreme context compression for retrieval-augmented generation with one token PDF

Cannot Refute

[35] Context embeddings for efficient answer generation in retrieval-augmented generation PDF

Cannot Refute

[36] RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation PDF

Cannot Refute

[37] FACTS About Building Retrieval Augmented Generation-based Chatbots PDF

Cannot Refute

[38] Contextual compression in retrieval-augmented generation for large language models: A survey PDF

Cannot Refute

OSCAR: Online Soft Compression for RAG

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] OSCAR: Online Soft Compression And Reranking PDF

Contribution Analysis

OSCAR: Online Soft Compression Method for RAG

[1] PISCO: Pretty Simple Compression for Retrieval-Augmented Generation PDF

[24] Searching for best practices in retrieval-augmented generation PDF

[25] Oreo: A plug-in context reconstructor to enhance retrieval-augmented generation PDF

[26] AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation PDF

[27] Neural-Symbolic Dual-Indexing Architectures for Scalable Retrieval-Augmented Generation PDF

[28] Query-Aware Graph Neural Networks for Enhanced Retrieval-Augmented Generation PDF

[29] Autorag-hp: Automatic online hyper-parameter tuning for retrieval-augmented generation PDF

[30] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation PDF

[31] ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models PDF

[32] Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation PDF

Two Efficient Compressor Architectures

[20] In-context autoencoder for context compression in a large language model PDF

[14] A survey on model compression for large language models PDF

[15] Language modeling is compression PDF

[16] Adapting language models to compress contexts PDF

[17] Integrating context compression and structural representation in large language models for financial text generation PDF

[18] Extending context window of large language models via semantic compression PDF

[19] mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding PDF

[21] Pretraining context compressor for large language models with embedding-based memory PDF

[22] Lossless data compression by large models PDF

[23] Prompt compression for large language models: A survey PDF

Simultaneous Compression and Reranking

[4] OSCAR: Online Soft Compression And Reranking PDF

[1] PISCO: Pretty Simple Compression for Retrieval-Augmented Generation PDF

[24] Searching for best practices in retrieval-augmented generation PDF

[30] Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation PDF

[33] Exit: Context-aware extractive compression for enhancing retrieval-augmented generation PDF

[34] xrag: Extreme context compression for retrieval-augmented generation with one token PDF

[35] Context embeddings for efficient answer generation in retrieval-augmented generation PDF

[36] RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation PDF

[37] FACTS About Building Retrieval Augmented Generation-based Chatbots PDF

[38] Contextual compression in retrieval-augmented generation for large language models: A survey PDF

Table of Contents