PSP: Prompt-Guided Self-Training Sampling Policy for Active Prompt Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Active Prompt LearningReinforcement LearningCLIP
Abstract:

Active Prompt Learning (APL) using vision-language models (\textit{e.g.}, CLIP) has attracted considerable attention for mitigating the dependence on fully labeled dataset in downstream task adaptation. However, existing methods fail to explicitly leverage prompt to guide sample selection, resulting in the selected samples being ineffective in facilitating the prompt template's downstream task adaptation, while also overlooking valuable complementary information in the unselected samples. To fill this gap, we propose a novel Prompt-Guided Self-Training Sampling Policy (PSP) for APL, which integrates Soft Actor-Critic with a customized real-pseudo hybrid reward and vectorized critics to incorporate prompts in guiding sample selection toward those that facilitate the optimization of prompt template, by jointly considering both selected and unselected samples. Specifically, PSP comprises two prominent components: Vectorized Soft Actor-Critic Sampling Policy (VSSP) and Uncertainty Augmented Self-Training (UST) mechanism. VSSP customizes a real-pseudo hybrid reward based on learned prompts and image features, which is fed into vectorized critics to estimate Q-value for each sample and compute gradients that optimize the actor, allowing it to refine its sampling policy in an End-to-End manner to identify the most informative samples for prompt learning. Moreover, UST leverages the CLIP from the previous round to generate reliable pseudo-labeled data based on uncertainty and confidence of average predictions, thereby deepening the understanding of the overall data. Extensive experiments conducted on diverse real-world datasets validate the effectiveness of our PSP.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Prompt-Guided Self-Training Sampling Policy (PSP) that combines reinforcement learning-based sample selection with self-training for active prompt learning. It resides in the 'Policy-Based Active Selection' leaf, which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that policy-based approaches to active sample selection in prompt learning remain relatively underexplored compared to other branches like prompt optimization strategies or domain-specific adaptations.

The taxonomy reveals that most active learning work in this field concentrates on 'Uncertainty and Confidence-Based Selection' (three papers) and 'Open-Set and Open-Vocabulary Active Learning' (one paper), while the broader 'Prompt Optimization and Tuning Strategies' branch is substantially more populated with methods focused on context learning, visual prompts, and regularization. The paper's integration of policy networks with prompt-guided rewards positions it at the intersection of active selection and prompt optimization, diverging from purely uncertainty-driven or heuristic selection methods that dominate neighboring leaves. This cross-cutting approach appears less common in the current landscape.

Among 21 candidates examined, the contribution-level analysis shows mixed novelty signals. The core PSP framework examined 10 candidates with no clear refutations, suggesting reasonable distinctiveness in its overall approach. The Vectorized Soft Actor-Critic component examined only 1 candidate with no refutation, though the limited search scope makes this less conclusive. The Uncertainty Augmented Self-Training mechanism examined 10 candidates and found 1 refutable match, indicating some overlap with existing self-training or uncertainty-based methods. The limited search scale means these findings reflect top-K semantic matches rather than exhaustive coverage.

Based on the available signals from 21 examined candidates, the work appears to occupy a relatively novel position by combining policy-based selection with prompt-guided rewards, though the self-training component shows some prior overlap. The sparse population of its taxonomy leaf and the cross-cutting nature of its approach suggest potential novelty, but the limited search scope prevents definitive conclusions about how thoroughly the space of policy-based active prompt learning has been explored.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: active prompt learning with vision-language models. The field has evolved into a rich ecosystem organized around several complementary themes. Active Sample Selection and Data Efficiency explores how to choose the most informative examples for prompt tuning, often employing policy-based or uncertainty-driven strategies to minimize annotation costs. Prompt Optimization and Tuning Strategies focuses on the mechanics of learning effective prompts, ranging from gradient-based methods like Conditional Prompt Learning[3] to ensemble and meta-learning approaches. Domain-Specific and Task-Specific Adaptations address tailoring prompts to specialized contexts such as medical imaging or open-vocabulary detection, while Robustness and Distribution Challenges tackle generalization under domain shift and out-of-distribution scenarios. Meanwhile, Federated and Privacy-Preserving Learning, Unsupervised and Low-Resource Learning, and Continual and Lifelong Learning branches investigate settings where data is decentralized, scarce, or arrives sequentially. Finally, Theoretical Foundations and Model Analysis provides deeper understanding of why and how prompts work. Within the Active Sample Selection branch, a handful of works have emerged that treat sample selection as a learnable policy problem, aiming to identify which examples yield the greatest improvement in prompt quality. PSP Prompt Sampling[0] sits squarely in this policy-based active selection cluster, emphasizing strategic sampling to boost data efficiency. It shares common ground with Active Prompt Learning[1] and Active Prompt Priors[2], both of which also prioritize intelligent example selection but may differ in their underlying selection criteria or integration with prompt optimization loops. Compared to broader prompt tuning methods like Conditional Prompt Learning[3] or Unsupervised Prompt Learning[4], PSP Prompt Sampling[0] places greater emphasis on the active curation of training samples rather than solely on the prompt representation itself. This positioning highlights an ongoing tension in the field: whether gains come primarily from better prompts or from better data, and how these two dimensions can be jointly optimized.

Claimed Contributions

Prompt-Guided Self-Training Sampling Policy (PSP) for Active Prompt Learning

The authors introduce PSP, a framework that combines Soft Actor-Critic with a tailored real-pseudo hybrid reward and vectorized critics to explicitly leverage prompts for guiding sample selection in active prompt learning. This approach bridges sample selection and prompt learning by jointly considering both selected and unselected samples.

10 retrieved papers
Vectorized Soft Actor-Critic Sampling Policy (VSSP)

VSSP is a component that customizes a real-pseudo hybrid reward using learned prompts and image features, which is then fed into vectorized critics to estimate Q-values for each sample and compute actor gradients. This enables the actor to refine its sampling policy in an end-to-end manner to identify the most informative samples for prompt learning.

1 retrieved paper
Uncertainty Augmented Self-Training (UST) mechanism

UST is a mechanism that leverages the teacher CLIP model to generate reliable pseudo-labeled data by evaluating uncertainty and confidence of average predictions across multiple augmentations. This mechanism extracts complementary information from unselected samples to deepen understanding of the overall data distribution.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Prompt-Guided Self-Training Sampling Policy (PSP) for Active Prompt Learning

The authors introduce PSP, a framework that combines Soft Actor-Critic with a tailored real-pseudo hybrid reward and vectorized critics to explicitly leverage prompts for guiding sample selection in active prompt learning. This approach bridges sample selection and prompt learning by jointly considering both selected and unselected samples.

Contribution

Vectorized Soft Actor-Critic Sampling Policy (VSSP)

VSSP is a component that customizes a real-pseudo hybrid reward using learned prompts and image features, which is then fed into vectorized critics to estimate Q-values for each sample and compute actor gradients. This enables the actor to refine its sampling policy in an end-to-end manner to identify the most informative samples for prompt learning.

Contribution

Uncertainty Augmented Self-Training (UST) mechanism

UST is a mechanism that leverages the teacher CLIP model to generate reliable pseudo-labeled data by evaluating uncertainty and confidence of average predictions across multiple augmentations. This mechanism extracts complementary information from unselected samples to deepen understanding of the overall data distribution.