GRACE: Generative Representation Learning via Contrastive Policy Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelsText RepresentationReinforcement Learning
Abstract:

Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black-box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce \GRACE{} (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy πθ\pi_\theta that produces explicit, human-interpretable rationales—structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query--positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross-category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent decision traces.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GRACE, a framework that trains LLMs as text encoders by treating contrastive signals as rewards for policy gradient optimization rather than direct losses. It resides in the 'Language Model Alignment via Contrastive Rewards' leaf, which contains four papers total (including GRACE). This leaf sits within the broader 'Generative Models with Contrastive Reward Optimization' branch, indicating a moderately populated research direction focused on aligning generative models through contrastive feedback. The taxonomy shows this is an active but not overcrowded area, with sibling leaves addressing vision-language alignment and image generation, suggesting the language-only focus occupies a distinct niche.

The taxonomy reveals neighboring work in vision-language alignment (two papers) and image/flow model alignment (three papers), all sharing the core idea of policy optimization with contrastive rewards but differing in modality. The 'Contrastive Representation Learning for RL' branch (seven papers across three leaves) explores contrastive methods for state representations in control tasks, a conceptually related but application-distinct direction. GRACE's emphasis on interpretable rationales and explicit reasoning distinguishes it from these neighbors, which typically optimize for task performance or preference alignment without generating intermediate explanations. The taxonomy's scope and exclude notes clarify that GRACE's generative-contrastive fusion places it firmly in the alignment category, not pure representation learning.

Among 30 candidates examined, the core GRACE framework (Contribution 1) shows no clear refutation across 10 candidates, suggesting novelty in combining policy optimization with rationale generation for text encoding. The multi-component reward function (Contribution 2) encountered one refutable candidate among 10 examined, indicating some overlap with existing reward design strategies. The unsupervised extension (Contribution 3) found three refutable candidates among 10, pointing to more substantial prior work in adapting contrastive methods to unlabeled settings. These statistics reflect a limited search scope—top-30 semantic matches—so the analysis captures immediate neighbors rather than exhaustive coverage.

Given the search scale, GRACE appears to occupy a relatively novel position within language model alignment, particularly in its emphasis on interpretable rationales as policy outputs. The framework's novelty is strongest in its core mechanism, while its reward design and unsupervised adaptation show more connection to existing techniques. The taxonomy context suggests this work extends a growing but not saturated research direction, with clear boundaries separating it from vision-language and pure RL contrastive methods.

Taxonomy

Core-task Taxonomy Papers
22
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Generative representation learning via policy optimization with contrastive rewards. This field sits at the intersection of representation learning, generative modeling, and reinforcement learning, exploring how contrastive objectives can guide policy-based optimization to produce meaningful representations. The taxonomy reveals four main branches: one focuses on contrastive representation learning specifically for RL tasks, where methods like CURL[2] and Return Based Contrastive[7] learn state representations that improve sample efficiency and generalization in control problems. A second branch examines generative models that incorporate contrastive reward signals during optimization, often applied to language model alignment and creative generation tasks. A third branch pursues joint generative-contrastive frameworks that simultaneously train generative and discriminative components, while a fourth addresses theoretical underpinnings and conceptual connections across these paradigms. Together, these branches illustrate a shift from purely supervised contrastive learning toward policy-driven approaches that optimize generative processes under contrastive feedback. Recent work has intensified around language model alignment via contrastive rewards, where policy optimization techniques refine generative outputs by contrasting desirable and undesirable samples. GRACE[0] exemplifies this direction, leveraging contrastive reward structures to guide representation learning in generative settings. Nearby efforts such as PrLM[8] and Robust Storytelling[14] similarly explore how contrastive signals can shape language generation, though they differ in whether they emphasize robustness, coherence, or alignment with human preferences. Meanwhile, methods like Goal Conditioned Representations[4] and Contrastive Agent Modeling[9] highlight alternative angles within the broader landscape, focusing on goal-driven or agent-centric contrastive learning rather than purely generative alignment. The interplay between generative flexibility and contrastive discrimination remains an open question, with ongoing debates about sample efficiency, scalability, and the trade-offs between exploration and exploitation in policy-based generative learning.

Claimed Contributions

GRACE framework for generative representation learning via contrastive policy optimization

The authors propose GRACE, a framework that reinterprets contrastive learning signals as reward signals for policy gradient optimization rather than traditional loss functions. This allows LLMs to generate explicit, interpretable rationales that are then encoded into embeddings, transforming the model from an opaque encoder into an interpretable agent.

10 retrieved papers
Multi-component reward function combining contrastive learning, consistency, and hard negative mining

The authors design a composite reward function that integrates contrastive learning rewards, consistency rewards across multiple interpretations, and hard negative mining. This reward structure guides the policy optimization to produce both high-quality embeddings and coherent rationales.

10 retrieved papers
Can Refute
Unsupervised extension adapting the framework to settings without labeled query-document pairs

The authors extend GRACE to an unsupervised learning paradigm inspired by SimCSE, where different interpretations of the same text serve as positive pairs. This enables representation learning from raw text alone without requiring supervised query-document annotations.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GRACE framework for generative representation learning via contrastive policy optimization

The authors propose GRACE, a framework that reinterprets contrastive learning signals as reward signals for policy gradient optimization rather than traditional loss functions. This allows LLMs to generate explicit, interpretable rationales that are then encoded into embeddings, transforming the model from an opaque encoder into an interpretable agent.

Contribution

Multi-component reward function combining contrastive learning, consistency, and hard negative mining

The authors design a composite reward function that integrates contrastive learning rewards, consistency rewards across multiple interpretations, and hard negative mining. This reward structure guides the policy optimization to produce both high-quality embeddings and coherent rationales.

Contribution

Unsupervised extension adapting the framework to settings without labeled query-document pairs

The authors extend GRACE to an unsupervised learning paradigm inspired by SimCSE, where different interpretations of the same text serve as positive pairs. This enables representation learning from raw text alone without requiring supervised query-document annotations.