GRACE: Generative Representation Learning via Contrastive Policy Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Large Language ModelsText RepresentationReinforcement Learning

Prevailing methods for training Large Language Models (LLMs) as text encoders rely on contrastive losses that treat the model as a black-box function, discarding its generative and reasoning capabilities in favor of static embeddings. We introduce \GRACE{} (Generative Representation Learning via Contrastive Policy Optimization), a novel framework that reimagines contrastive signals not as losses to be minimized, but as rewards that guide a generative policy. In GRACE, the LLM acts as a policy $\pi_\theta$ that produces explicit, human-interpretable rationales—structured natural language explanations of its semantic understanding. These rationales are then encoded into high-quality embeddings via mean pooling. Using policy gradient optimization, we train the model with a multi-component reward function that maximizes similarity between query--positive pairs and minimizes similarity with negatives. This transforms the LLM from an opaque encoder into an interpretable agent whose reasoning process is transparent and inspectable. On MTEB benchmark, GRACE yields broad cross-category gains: averaged over four backbones, the supervised setting improves overall score by 11.5% over base models, and the unsupervised variant adds 6.9%, while preserving general capabilities. This work treats contrastive objectives as rewards over rationales, unifying representation learning with generation to produce stronger embeddings and transparent decision traces.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces GRACE, a framework that trains LLMs as text encoders by treating contrastive signals as rewards for policy gradient optimization rather than direct losses. It resides in the 'Language Model Alignment via Contrastive Rewards' leaf, which contains four papers total (including GRACE). This leaf sits within the broader 'Generative Models with Contrastive Reward Optimization' branch, indicating a moderately populated research direction focused on aligning generative models through contrastive feedback. The taxonomy shows this is an active but not overcrowded area, with sibling leaves addressing vision-language alignment and image generation, suggesting the language-only focus occupies a distinct niche.

The taxonomy reveals neighboring work in vision-language alignment (two papers) and image/flow model alignment (three papers), all sharing the core idea of policy optimization with contrastive rewards but differing in modality. The 'Contrastive Representation Learning for RL' branch (seven papers across three leaves) explores contrastive methods for state representations in control tasks, a conceptually related but application-distinct direction. GRACE's emphasis on interpretable rationales and explicit reasoning distinguishes it from these neighbors, which typically optimize for task performance or preference alignment without generating intermediate explanations. The taxonomy's scope and exclude notes clarify that GRACE's generative-contrastive fusion places it firmly in the alignment category, not pure representation learning.

Among 30 candidates examined, the core GRACE framework (Contribution 1) shows no clear refutation across 10 candidates, suggesting novelty in combining policy optimization with rationale generation for text encoding. The multi-component reward function (Contribution 2) encountered one refutable candidate among 10 examined, indicating some overlap with existing reward design strategies. The unsupervised extension (Contribution 3) found three refutable candidates among 10, pointing to more substantial prior work in adapting contrastive methods to unlabeled settings. These statistics reflect a limited search scope—top-30 semantic matches—so the analysis captures immediate neighbors rather than exhaustive coverage.

Given the search scale, GRACE appears to occupy a relatively novel position within language model alignment, particularly in its emphasis on interpretable rationales as policy outputs. The framework's novelty is strongest in its core mechanism, while its reward design and unsupervised adaptation show more connection to existing techniques. The taxonomy context suggests this work extends a growing but not saturated research direction, with clear boundaries separating it from vision-language and pure RL contrastive methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Generative representation learning via policy optimization with contrastive rewards. This field sits at the intersection of representation learning, generative modeling, and reinforcement learning, exploring how contrastive objectives can guide policy-based optimization to produce meaningful representations. The taxonomy reveals four main branches: one focuses on contrastive representation learning specifically for RL tasks, where methods like CURL[2] and Return Based Contrastive[7] learn state representations that improve sample efficiency and generalization in control problems. A second branch examines generative models that incorporate contrastive reward signals during optimization, often applied to language model alignment and creative generation tasks. A third branch pursues joint generative-contrastive frameworks that simultaneously train generative and discriminative components, while a fourth addresses theoretical underpinnings and conceptual connections across these paradigms. Together, these branches illustrate a shift from purely supervised contrastive learning toward policy-driven approaches that optimize generative processes under contrastive feedback. Recent work has intensified around language model alignment via contrastive rewards, where policy optimization techniques refine generative outputs by contrasting desirable and undesirable samples. GRACE[0] exemplifies this direction, leveraging contrastive reward structures to guide representation learning in generative settings. Nearby efforts such as PrLM[8] and Robust Storytelling[14] similarly explore how contrastive signals can shape language generation, though they differ in whether they emphasize robustness, coherence, or alignment with human preferences. Meanwhile, methods like Goal Conditioned Representations[4] and Contrastive Agent Modeling[9] highlight alternative angles within the broader landscape, focusing on goal-driven or agent-centric contrastive learning rather than purely generative alignment. The interplay between generative flexibility and contrastive discrimination remains an open question, with ongoing debates about sample efficiency, scalability, and the trade-offs between exploration and exploitation in policy-based generative learning.

Claimed Contributions

GRACE framework for generative representation learning via contrastive policy optimization

10 retrieved papers

The authors propose GRACE, a framework that reinterprets contrastive learning signals as reward signals for policy gradient optimization rather than traditional loss functions. This allows LLMs to generate explicit, interpretable rationales that are then encoded into embeddings, transforming the model from an opaque encoder into an interpretable agent.

10 retrieved papers

Multi-component reward function combining contrastive learning, consistency, and hard negative mining

Can Refute

10 retrieved papers

The authors design a composite reward function that integrates contrastive learning rewards, consistency rewards across multiple interpretations, and hard negative mining. This reward structure guides the policy optimization to produce both high-quality embeddings and coherent rationales.

10 retrieved papers

Can Refute

Unsupervised extension adapting the framework to settings without labeled query-document pairs

Can Refute

10 retrieved papers

The authors extend GRACE to an unsupervised learning paradigm inspired by SimCSE, where different interpretations of the same text serve as positive pairs. This enables representation learning from raw text alone without requiring supervised query-document annotations.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Learning goal-conditioned representations for language reward models PDF

Jeff Da, Sean Hendryx, Yuntao Ma, Vaskar Nath, Dylan Slack, Spencer Whitehead, Hugh Zhang (2024)

[8] PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization PDF

Kepu Zhang, Teng Shi, Weijie Yu, Jun Xu (2025)

[14] Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning PDF

Castricato, Louis, Havrilla, Alexander, Louis Castricato, Matiana, Shahbuland, Alexander Havrilla, Pieler, Michael, Shahbuland Matiana, Ye, Anbang, M. Pieler, Yang, Ian, Anbang Ye, Frazier, Spencer, Ian Yang, Riedl, Mark, Spencer Frazier, Mark O. Riedl (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GRACE framework for generative representation learning via contrastive policy optimization

[23] Policy contrastive imitation learning PDF

Cannot Refute

[24] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge PDF

Cannot Refute

[25] D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss PDF

Cannot Refute

[26] CIPPO: Contrastive Imitation Proximal Policy Optimization for Recommendation Based on Reinforcement Learning PDF

Cannot Refute

[27] Contrastive policy gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion PDF

Cannot Refute

[28] Discovering hierarchical achievements in reinforcement learning via contrastive learning PDF

Cannot Refute

[29] Policy-gradient training of language models for ranking PDF

Cannot Refute

[30] Bayesian distributional policy gradients PDF

Cannot Refute

[31] Contrastive preference learning: learning from human feedback without rl PDF

Cannot Refute

[32] Using Representation Learning for Scalable Multi-Agent Reinforcement Learning in Heterogeneous Multi-Agent Systems PDF

Cannot Refute

Contribution

Multi-component reward function combining contrastive learning, consistency, and hard negative mining

[4] Learning goal-conditioned representations for language reward models PDF

Can Refute

[33] Secrets of rlhf in large language models part ii: Reward modeling PDF

Cannot Refute

[34] Finding critical nodes in complex networks through graph contrastive reinforcement learning based on adaptive augmentation PDF

Cannot Refute

[35] The hidden link between rlhf and contrastive learning PDF

Cannot Refute

[36] Effective Hard Negative Mining for Contrastive Learning-Based Code Search PDF

Cannot Refute

[37] Fluent and Accurate Image Captioning with a Self-Trained Reward Model PDF

Cannot Refute

[38] Social nce: Contrastive learning of socially-aware motion representations PDF

Cannot Refute

[39] QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval PDF

Cannot Refute

[40] Learning Reward Functions for Robotic Manipulation by Observing Humans PDF

Cannot Refute

[41] Improving Aspect-Based Summarization via Contrastive Learning with Anchored Negative Examples PDF

Cannot Refute

Contribution

Unsupervised extension adapting the framework to settings without labeled query-document pairs

[42] DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations PDF

Can Refute

[46] Text and code embeddings by contrastive pre-training PDF

Can Refute

[50] Representation Learning with Contrastive Predictive Coding PDF

Can Refute

[43] CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass PDF

Cannot Refute

[44] Cert: Contrastive self-supervised learning for language understanding PDF

Cannot Refute

[45] Congrat: Self-supervised contrastive pretraining for joint graph and text embeddings PDF

Cannot Refute

[47] LA-UCL: LLM-augmented unsupervised contrastive learning framework for few-shot text classification PDF

Cannot Refute

[48] Improved graph contrastive learning for short text classification PDF

Cannot Refute

[49] Self-Supervised Learning: Generative or Contrastive PDF

Cannot Refute

[51] Crl+: A novel semi-supervised deep active contrastive representation learning-based text classification model for insurance data PDF

Cannot Refute

GRACE: Generative Representation Learning via Contrastive Policy Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Learning goal-conditioned representations for language reward models PDF

[8] PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization PDF

[14] Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning PDF

Contribution Analysis

GRACE framework for generative representation learning via contrastive policy optimization

[23] Policy contrastive imitation learning PDF

[24] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge PDF

[25] D2PPO: Diffusion Policy Policy Optimization with Dispersive Loss PDF

[26] CIPPO: Contrastive Imitation Proximal Policy Optimization for Recommendation Based on Reinforcement Learning PDF

[27] Contrastive policy gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion PDF

[28] Discovering hierarchical achievements in reinforcement learning via contrastive learning PDF

[29] Policy-gradient training of language models for ranking PDF

[30] Bayesian distributional policy gradients PDF

[31] Contrastive preference learning: learning from human feedback without rl PDF

[32] Using Representation Learning for Scalable Multi-Agent Reinforcement Learning in Heterogeneous Multi-Agent Systems PDF

Multi-component reward function combining contrastive learning, consistency, and hard negative mining

[4] Learning goal-conditioned representations for language reward models PDF

[33] Secrets of rlhf in large language models part ii: Reward modeling PDF

[34] Finding critical nodes in complex networks through graph contrastive reinforcement learning based on adaptive augmentation PDF

[35] The hidden link between rlhf and contrastive learning PDF

[36] Effective Hard Negative Mining for Contrastive Learning-Based Code Search PDF

[37] Fluent and Accurate Image Captioning with a Self-Trained Reward Model PDF

[38] Social nce: Contrastive learning of socially-aware motion representations PDF

[39] QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval PDF

[40] Learning Reward Functions for Robotic Manipulation by Observing Humans PDF

[41] Improving Aspect-Based Summarization via Contrastive Learning with Anchored Negative Examples PDF

Unsupervised extension adapting the framework to settings without labeled query-document pairs

[42] DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations PDF

[46] Text and code embeddings by contrastive pre-training PDF

[50] Representation Learning with Contrastive Predictive Coding PDF

[43] CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass PDF

[44] Cert: Contrastive self-supervised learning for language understanding PDF

[45] Congrat: Self-supervised contrastive pretraining for joint graph and text embeddings PDF

[47] LA-UCL: LLM-augmented unsupervised contrastive learning framework for few-shot text classification PDF

[48] Improved graph contrastive learning for short text classification PDF

[49] Self-Supervised Learning: Generative or Contrastive PDF

[51] Crl+: A novel semi-supervised deep active contrastive representation learning-based text classification model for insurance data PDF

Table of Contents