PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity

ICLR 2026 Conference SubmissionAnonymous Authors
Conditional Semantic Textual SimilarityReinforcement LearningLarge Language ModelsNatural Language ProcessingCurriculum Learning
Abstract:

Conditional Semantic Textual Similarity (C-STS) measures the semantic proximity between text segments under a specific condition, thereby overcoming the ambiguity inherent in traditional STS. However, existing methods are largely confined to discriminative models, failing to fully integrate recent breakthroughs in the NLP community concerning Large Language Models (LLMs) and Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this task, as it can directly optimize the non-differentiable Spearman ranking metric and guide the reasoning process required by C-STS. However, we find that naively applying listwise RL fails to produce meaningful improvements, as the model is overwhelmed by a complex, coarse-grained reward signal. To address this challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning framework. PoLi-RL employs a two-stage curriculum: it first trains the model with simple pointwise rewards to establish fundamental scoring capabilities, then transitions to a hybrid reward that combines pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward (PSRR) mechanism that computes ranking rewards in parallel slices, where each slice comprises same-indexed completions from different samples. This provides a precise, differentiated learning signal for each individual completion, enabling granular credit assignment and effective optimization. On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture. As the first work to successfully apply RL to C-STS, our study introduces a powerful and effective paradigm for training LLMs on complex, ranking-based conditional judgment tasks. Our code and checkpoints are available at https://anonymous.4open.science/r/PoLi-RL.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces PoLi-RL, a two-stage reinforcement learning framework for conditional semantic textual similarity (C-STS). Within the taxonomy, it occupies the 'Reinforcement Learning for C-STS' leaf under 'Conditional and Aspect-Specific Similarity Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This indicates that applying RL to C-STS is a relatively sparse research direction within the broader field of conditional similarity measurement, which includes more populated branches such as contrastive learning approaches and attention-based mechanisms.

The taxonomy reveals that neighboring leaves focus on contrastive learning (two papers) and attention/routing mechanisms (five papers) for C-STS. These sibling branches emphasize supervised or self-supervised objectives rather than policy-based optimization. The broader 'Conditional and Aspect-Specific Similarity Frameworks' category also includes dataset construction efforts (four papers), suggesting that the field is still establishing foundational resources. PoLi-RL diverges from these directions by framing C-STS as a sequential decision problem, directly optimizing ranking metrics rather than relying on contrastive losses or architectural innovations alone.

Among seventeen candidates examined, none clearly refute the paper's three main contributions. The PoLi-RL framework (five candidates examined, zero refutable) and the Parallel Slice Ranking Reward mechanism (two candidates examined, zero refutable) appear novel within the limited search scope. The claim of being the first end-to-end LLM-based cross-encoder with RL for C-STS (ten candidates examined, zero refutable) also lacks direct prior work among the candidates reviewed. However, the search scope is modest—seventeen papers total—so the absence of refutations reflects the limited sample rather than exhaustive coverage of the literature.

Based on the top-seventeen semantic matches and the sparse taxonomy leaf, the work appears to explore a relatively underexplored intersection of RL and C-STS. The analysis does not cover broader RL-for-NLP literature or recent LLM fine-tuning methods outside the C-STS context, so the novelty assessment is necessarily scoped to the immediate research area. The taxonomy structure and contribution-level statistics suggest the approach is distinctive within the examined sample, though a more comprehensive search would be needed to confirm its originality across the wider NLP community.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Conditional semantic textual similarity measurement under specific conditions. The field addresses how to assess semantic similarity between texts when certain aspects, contexts, or constraints must be taken into account. The taxonomy reveals several major branches: Conditional and Aspect-Specific Similarity Frameworks focus on methods that explicitly model user-defined conditions or aspects (e.g., CSTS[3], Conditional Contrastive Learning[1]); Context-Aware and Contextual Similarity Measurement emphasizes leveraging surrounding discourse or situational cues (e.g., Context-Based Similarity[5]); Embedding and Representation Learning for Similarity develops neural encoders and contrastive objectives tailored to similarity tasks; Domain-Specific Similarity Applications adapt these techniques to specialized areas such as legal documents (Legal Case Similarity[14]) or medical text; Similarity for Downstream NLP Tasks explores how similarity measures support retrieval, clustering, and other applications; and Theoretical and Methodological Foundations provide the conceptual underpinnings and evaluation frameworks. Together, these branches reflect a shift from generic similarity metrics toward fine-grained, condition-aware approaches that can capture nuanced semantic relationships. A particularly active line of work explores how to incorporate explicit conditions or aspects into similarity computation, balancing flexibility with interpretability. Some studies introduce contrastive learning schemes that condition on user-specified attributes (Conditional Contrastive Learning[1]), while others propose structured frameworks for aspect-level comparison (CSTS[3]). Within this landscape, PoLi-RL[0] sits squarely in the Reinforcement Learning for C-STS branch, employing policy-based optimization to learn condition-sensitive similarity functions. Compared to supervised or contrastive baselines like CSTS[3] or Conditional Contrastive Learning[1], PoLi-RL[0] emphasizes adaptive decision-making under varying conditions, potentially offering greater robustness when labeled data for specific conditions is scarce. Open questions remain around how to best represent conditions, whether through discrete labels or continuous embeddings, and how to ensure that learned similarity functions generalize across diverse domains and evolving user requirements.

Claimed Contributions

PoLi-RL: A two-stage Point-to-List Reinforcement Learning framework for C-STS

The authors propose PoLi-RL, a progressive two-stage training curriculum for C-STS. Stage I uses simple pointwise rewards to ground the model in basic scoring rules, while Stage II introduces a hybrid reward combining pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions.

5 retrieved papers
Parallel Slice Ranking Reward (PSRR) mechanism

The authors introduce PSRR, a novel reward computation mechanism that organizes multiple completions into parallel slices and computes ranking rewards within each slice. This two-level decomposition allows each completion to receive a unique and precise reward, enabling fine-grained credit assignment and stable training for ranking-based tasks.

2 retrieved papers
First end-to-end LLM-based cross-encoder with RL for C-STS

The authors claim to be the first to apply an end-to-end LLM-based cross-encoder architecture to the Conditional Semantic Textual Similarity task and the first to successfully integrate reinforcement learning for training in this domain, establishing a new paradigm for complex conditional judgment tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PoLi-RL: A two-stage Point-to-List Reinforcement Learning framework for C-STS

The authors propose PoLi-RL, a progressive two-stage training curriculum for C-STS. Stage I uses simple pointwise rewards to ground the model in basic scoring rules, while Stage II introduces a hybrid reward combining pointwise, pairwise, and listwise objectives to refine the model's ability to discern subtle semantic distinctions.

Contribution

Parallel Slice Ranking Reward (PSRR) mechanism

The authors introduce PSRR, a novel reward computation mechanism that organizes multiple completions into parallel slices and computes ranking rewards within each slice. This two-level decomposition allows each completion to receive a unique and precise reward, enabling fine-grained credit assignment and stable training for ranking-based tasks.

Contribution

First end-to-end LLM-based cross-encoder with RL for C-STS

The authors claim to be the first to apply an end-to-end LLM-based cross-encoder architecture to the Conditional Semantic Textual Similarity task and the first to successfully integrate reinforcement learning for training in this domain, establishing a new paradigm for complex conditional judgment tasks.