Thompson Sampling via Fine-Tuning of LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
Bayesian optimizationThompson Samplingdiscrete domainvariational Bayesian optimistic samplingcumulative regrettheorylarge language modelfine-tuningprobability of maximalityprobability of optimality
Abstract:

Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality—a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. Within a collection of methods covering Bayesian optimization, reinforcement learning, and evolutionary search, ToSFiT exhibits both state-of-the-art sample efficiency and computational efficiency.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Thompson Sampling via Fine-Tuning (ToSFiT), which combines Thompson sampling with large language model fine-tuning to optimize over unstructured discrete spaces. It resides in the 'Thompson Sampling and Posterior Sampling' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that posterior sampling approaches remain relatively underexplored compared to surrogate modeling or gradient-free acquisition optimization methods that dominate other branches.

The taxonomy reveals that most acquisition function optimization work concentrates on gradient-free reparameterization techniques or game-theoretic frameworks, both appearing in sibling leaves under the same parent category. The broader 'Acquisition Function Optimization and Search Strategies' branch sits alongside 'Surrogate Modeling and Representation Learning', which contains four distinct subcategories with substantially more papers. ToSFiT diverges from representation-learning approaches by avoiding explicit embeddings or latent spaces, instead directly parameterizing candidate selection probabilities through language model adaptation.

Among thirty candidates examined, none clearly refute the three core contributions. The theoretical regret bound for variational Bayesian optimization accounting for reward correlation was examined against ten candidates with zero refutations. The ToSFiT algorithm combining pre-training with posterior fine-tuning similarly showed no overlapping prior work among ten examined papers. The empirical validation across FAQ refinement, protein search, and quantum circuit design also encountered no refutable candidates in its ten-paper examination. These statistics suggest novelty within the limited search scope, though the small candidate pool leaves open the possibility of relevant work beyond top-thirty semantic matches.

The analysis indicates that ToSFiT occupies a sparsely populated research direction, with limited prior work on Thompson sampling for discrete Bayesian optimization. The absence of refutations across all contributions within the examined candidate set supports perceived novelty, though the thirty-paper scope represents a focused rather than exhaustive literature review. The taxonomy context confirms that posterior sampling remains less developed than alternative acquisition strategies in this domain.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Bayesian optimization in large unstructured discrete spaces. The field addresses the challenge of optimizing black-box functions over combinatorial or categorical domains where traditional gradient-based methods fail. The taxonomy reveals several complementary research directions: Surrogate Modeling and Representation Learning focuses on building effective probabilistic models over discrete inputs, often using embeddings or neural architectures to capture structure; Acquisition Function Optimization and Search Strategies develops methods for efficiently proposing new candidates, including evolutionary approaches and posterior sampling techniques; High-Dimensional and Structured Space Techniques tackles scalability through dimensionality reduction and exploiting latent regularities; Mixed-Variable and Hybrid Space Optimization handles domains combining discrete and continuous parameters; Multi-Fidelity and Multi-Objective Frameworks extend the paradigm to settings with multiple information sources or competing objectives; Domain-Specific Applications demonstrates successes in areas like neural architecture search, materials discovery, and prompt engineering; and Algorithmic Enhancements provides theoretical guarantees and practical improvements. A particularly active tension exists between representation-driven approaches that learn continuous relaxations or embeddings (e.g., Dictionary Embeddings[2], BANANAS[13]) and methods that operate directly in discrete space through sophisticated search strategies. Thompson Sampling LLMs[0] sits within the acquisition function branch, specifically employing posterior sampling to guide exploration in discrete domains. This contrasts with gradient-free local search methods like Amortized Discrete[12], which learns to propose candidates via amortized inference. While many works focus on structured spaces with known combinatorial properties (Combinatorial Structures[1], Permutation Spaces[7]), Thompson Sampling LLMs[0] targets unstructured settings where leveraging large language models for proposal generation offers a promising alternative to hand-crafted search heuristics, addressing scalability challenges that arise when the discrete space lacks exploitable regularity.

Claimed Contributions

Improved regret bound for VBOS accounting for reward correlation

The authors derive a tighter regret bound for Variational Bayesian Optimistic Sampling that scales with the maximal information gain γT rather than the domain size |X|, making it applicable to large discrete spaces. They also extend this bound to approximate VBOS implementations using gradient-based optimization.

10 retrieved papers
TOSFIT algorithm combining pre-training with fine-tuning toward posterior PoM

The authors propose Thompson Sampling via Fine-Tuning (TOSFIT), which initializes a generative policy using pre-trained LLMs with prompt conditioning and then carefully adapts it toward the posterior probability of maximality through fine-tuning, guided by their theoretical analysis.

10 retrieved papers
Empirical validation across three diverse optimization tasks

The authors demonstrate TOSFIT's effectiveness on three distinct tasks spanning natural language, biological sequences, and quantum circuits, showing it achieves state-of-the-art sample efficiency and computational efficiency compared to existing Bayesian optimization, reinforcement learning, and evolutionary methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Improved regret bound for VBOS accounting for reward correlation

The authors derive a tighter regret bound for Variational Bayesian Optimistic Sampling that scales with the maximal information gain γT rather than the domain size |X|, making it applicable to large discrete spaces. They also extend this bound to approximate VBOS implementations using gradient-based optimization.

Contribution

TOSFIT algorithm combining pre-training with fine-tuning toward posterior PoM

The authors propose Thompson Sampling via Fine-Tuning (TOSFIT), which initializes a generative policy using pre-trained LLMs with prompt conditioning and then carefully adapts it toward the posterior probability of maximality through fine-tuning, guided by their theoretical analysis.

Contribution

Empirical validation across three diverse optimization tasks

The authors demonstrate TOSFIT's effectiveness on three distinct tasks spanning natural language, biological sequences, and quantum circuits, showing it achieves state-of-the-art sample efficiency and computational efficiency compared to existing Bayesian optimization, reinforcement learning, and evolutionary methods.