FedPOB: Sample-Efficient Federated Prompt Optimization via Bandits

ICLR 2026 Conference SubmissionAnonymous Authors
Prompt OptimizationNeural BanditsDueling Bandits
Abstract:

The performance of large language models (LLMs) is highly sensitive to the input prompt, making prompt optimization a critical task. However, real-world application is hindered by three major challenges: (1) the black-box nature of powerful proprietary LLMs, (2) the need for high sample efficiency due to query costs, and (3) the desire for privacy-preserving collaboration among multiple users. To address these challenges simultaneously, we introduce a novel framework for sample-efficient federated prompt optimization based on multi-armed bandits (MABs). The MAB framework is uniquely suited for this problem as it is (1) inherently a black-box optimization method, (2) practically sample-efficient, and (3) enables collaborative learning with theoretically guaranteed benefit from more participating agents. We first propose the Federated Prompt Optimization via Bandits (FedPOB) algorithm, a federated variant of the Linear UCB algorithm, where agents collaborate by sharing model parameters instead of raw data. We then extend our approach to the practical setting of comparative user feedback by introducing FedPOB with Preference Feedback (FedPOB-Pref), an efficient algorithm based on federated dueling bandits. Extensive experiments demonstrate that both FedPOB and FedPOB-Pref significantly outperform existing baselines and that their performance consistently improves as more agents participate in the collaboration, validating the effectiveness of our federated approach.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FedPOB and FedPOB-Pref, two federated prompt optimization algorithms based on multi-armed bandits for black-box LLMs. According to the taxonomy, this work resides in the 'Linear Bandit and Preference-Based Methods' leaf under 'Bandit-Based Optimization Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This indicates that bandit-based federated prompt optimization with preference feedback represents a sparse, relatively unexplored research direction within the broader field of federated LLM tuning.

The taxonomy reveals that neighboring branches include 'Discrete Prompt Tuning Methods' (with query-efficient and general discrete frameworks), 'Bayesian and Likelihood-Free Methods' (using Gaussian processes or preimage-informed optimization), and 'Continuous and Hybrid Approaches' (edge-deployed systems and textual gradient methods). The paper's bandit framework diverges from discrete token-level search by treating prompt selection as sequential decision-making, and from Bayesian methods by avoiding probabilistic surrogate models. Its focus on preference feedback also distinguishes it from score-based discrete tuning and continuous embedding optimization, bridging exploration-exploitation trade-offs with privacy-preserving collaboration.

Among the twelve candidates examined, none were found to refute any of the three core contributions. For the FedPOB algorithm with score feedback, five candidates were reviewed with zero refutable overlaps; for FedPOB-Pref with preference feedback, four candidates yielded no prior work that clearly anticipates this approach; and for the multi-armed bandit framework itself, three candidates were examined without identifying substantial precedent. This suggests that within the limited search scope, the combination of federated learning, linear bandits, and preference-based prompt optimization appears relatively novel, though the analysis does not claim exhaustive coverage of all related literature.

Given the top-twelve semantic matches examined, the work occupies a sparsely populated niche at the intersection of bandit algorithms, federated learning, and black-box LLM tuning. The absence of sibling papers in the taxonomy leaf and the lack of refutable candidates among those reviewed indicate that this specific combination of techniques has not been extensively explored in prior work. However, the limited search scope means that related methods in adjacent areas—such as discrete tuning or Bayesian optimization—may share conceptual overlap not captured by the current analysis.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
12
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: federated prompt optimization for black-box large language models. The field addresses how to tune prompts for LLMs when model internals are inaccessible and data must remain decentralized across clients. The taxonomy reveals several main branches: Discrete Prompt Tuning Methods focus on selecting or editing token-level prompts in a federated setting, often leveraging gradient-free search or evolutionary strategies. Bandit-Based Optimization Frameworks treat prompt selection as a sequential decision problem, using linear bandits or preference feedback to guide exploration without requiring model gradients. Bayesian and Likelihood-Free Methods employ probabilistic modeling and simulation-based inference to navigate the black-box landscape. Continuous and Hybrid Approaches blend soft prompt embeddings with discrete search, while Privacy-Preserving and Synthetic Data Methods emphasize differential privacy and data generation to protect client information. Survey and Conceptual Frameworks provide overarching perspectives on federated LLM deployment, and Multi-Agent Collaboration Frameworks explore coordination among distributed agents. Representative works such as FedDTPT[1], Efficient Federated Prompt[2], and FedBPT[5] illustrate how discrete tuning and privacy constraints shape the design space. Within this landscape, a particularly active line of work centers on bandit-based and preference-driven optimization, where methods balance exploration and exploitation under communication constraints. FedPOB[0] sits squarely in this branch, employing linear bandit and preference-based techniques to iteratively refine prompts without accessing model gradients. This contrasts with purely discrete approaches like FedDTPT[1], which rely on token-level edits and evolutionary search, and with Bayesian methods that model uncertainty through surrogate functions. Nearby works such as Federated Blackbox Edge[3] and FedOne[4] also tackle black-box tuning but differ in their handling of client heterogeneity and communication efficiency. A central tension across these branches is the trade-off between prompt expressiveness, privacy guarantees, and the number of federated rounds required for convergence. FedPOB[0] addresses this by leveraging preference feedback to guide search efficiently, positioning it as a middle ground between fully discrete token search and continuous embedding optimization.

Claimed Contributions

FedPOB algorithm for federated prompt optimization with score feedback

The authors introduce FedPOB, a federated variant of the Linear UCB algorithm that allows multiple agents to collaboratively optimize prompts for black-box LLMs by sharing model parameters rather than raw data. The method is designed to be sample-efficient and provides theoretical guarantees that performance improves with more participating agents.

5 retrieved papers
FedPOB-Pref algorithm for federated prompt optimization with preference feedback

The authors propose FedPOB-Pref to handle scenarios where only pairwise preference feedback is available instead of explicit numerical scores. This algorithm is based on federated linear dueling bandits and incorporates dynamic regularization to achieve both communication efficiency and strong performance.

4 retrieved papers
Multi-armed bandit framework for federated prompt optimization

The authors establish a new framework that casts federated prompt optimization as a multi-armed bandit problem. This framework is uniquely suited because MABs are inherently black-box optimization methods, practically sample-efficient, and enable collaborative learning with theoretical guarantees of benefit from more participating agents.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FedPOB algorithm for federated prompt optimization with score feedback

The authors introduce FedPOB, a federated variant of the Linear UCB algorithm that allows multiple agents to collaboratively optimize prompts for black-box LLMs by sharing model parameters rather than raw data. The method is designed to be sample-efficient and provides theoretical guarantees that performance improves with more participating agents.

Contribution

FedPOB-Pref algorithm for federated prompt optimization with preference feedback

The authors propose FedPOB-Pref to handle scenarios where only pairwise preference feedback is available instead of explicit numerical scores. This algorithm is based on federated linear dueling bandits and incorporates dynamic regularization to achieve both communication efficiency and strong performance.

Contribution

Multi-armed bandit framework for federated prompt optimization

The authors establish a new framework that casts federated prompt optimization as a multi-armed bandit problem. This framework is uniquely suited because MABs are inherently black-box optimization methods, practically sample-efficient, and enable collaborative learning with theoretical guarantees of benefit from more participating agents.