Thompson Sampling via Fine-Tuning of LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Bayesian optimizationThompson Samplingdiscrete domainvariational Bayesian optimistic samplingcumulative regrettheorylarge language modelfine-tuningprobability of maximalityprobability of optimality

Bayesian optimization in large unstructured discrete spaces is often hindered by the computational cost of maximizing acquisition functions due to the absence of gradients. We propose a scalable alternative based on Thompson sampling that eliminates the need for acquisition function maximization by directly parameterizing the probability that a candidate yields the maximum reward. Our approach, Thompson Sampling via Fine-Tuning (ToSFiT) leverages the prior knowledge embedded in prompt-conditioned large language models, and incrementally adapts them toward the posterior. Theoretically, we derive a novel regret bound for a variational formulation of Thompson Sampling that matches the strong guarantees of its standard counterpart. Our analysis reveals the critical role of careful adaptation to the posterior probability of maximality—a principle that underpins our ToSFiT algorithm. Empirically, we validate our method on three diverse tasks: FAQ response refinement, thermally stable protein search, and quantum circuit design. Within a collection of methods covering Bayesian optimization, reinforcement learning, and evolutionary search, ToSFiT exhibits both state-of-the-art sample efficiency and computational efficiency.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Thompson Sampling via Fine-Tuning (ToSFiT), which combines Thompson sampling with large language model fine-tuning to optimize over unstructured discrete spaces. It resides in the 'Thompson Sampling and Posterior Sampling' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that posterior sampling approaches remain relatively underexplored compared to surrogate modeling or gradient-free acquisition optimization methods that dominate other branches.

The taxonomy reveals that most acquisition function optimization work concentrates on gradient-free reparameterization techniques or game-theoretic frameworks, both appearing in sibling leaves under the same parent category. The broader 'Acquisition Function Optimization and Search Strategies' branch sits alongside 'Surrogate Modeling and Representation Learning', which contains four distinct subcategories with substantially more papers. ToSFiT diverges from representation-learning approaches by avoiding explicit embeddings or latent spaces, instead directly parameterizing candidate selection probabilities through language model adaptation.

Among thirty candidates examined, none clearly refute the three core contributions. The theoretical regret bound for variational Bayesian optimization accounting for reward correlation was examined against ten candidates with zero refutations. The ToSFiT algorithm combining pre-training with posterior fine-tuning similarly showed no overlapping prior work among ten examined papers. The empirical validation across FAQ refinement, protein search, and quantum circuit design also encountered no refutable candidates in its ten-paper examination. These statistics suggest novelty within the limited search scope, though the small candidate pool leaves open the possibility of relevant work beyond top-thirty semantic matches.

The analysis indicates that ToSFiT occupies a sparsely populated research direction, with limited prior work on Thompson sampling for discrete Bayesian optimization. The absence of refutations across all contributions within the examined candidate set supports perceived novelty, though the thirty-paper scope represents a focused rather than exhaustive literature review. The taxonomy context confirms that posterior sampling remains less developed than alternative acquisition strategies in this domain.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Bayesian optimization in large unstructured discrete spaces. The field addresses the challenge of optimizing black-box functions over combinatorial or categorical domains where traditional gradient-based methods fail. The taxonomy reveals several complementary research directions: Surrogate Modeling and Representation Learning focuses on building effective probabilistic models over discrete inputs, often using embeddings or neural architectures to capture structure; Acquisition Function Optimization and Search Strategies develops methods for efficiently proposing new candidates, including evolutionary approaches and posterior sampling techniques; High-Dimensional and Structured Space Techniques tackles scalability through dimensionality reduction and exploiting latent regularities; Mixed-Variable and Hybrid Space Optimization handles domains combining discrete and continuous parameters; Multi-Fidelity and Multi-Objective Frameworks extend the paradigm to settings with multiple information sources or competing objectives; Domain-Specific Applications demonstrates successes in areas like neural architecture search, materials discovery, and prompt engineering; and Algorithmic Enhancements provides theoretical guarantees and practical improvements. A particularly active tension exists between representation-driven approaches that learn continuous relaxations or embeddings (e.g., Dictionary Embeddings[2], BANANAS[13]) and methods that operate directly in discrete space through sophisticated search strategies. Thompson Sampling LLMs[0] sits within the acquisition function branch, specifically employing posterior sampling to guide exploration in discrete domains. This contrasts with gradient-free local search methods like Amortized Discrete[12], which learns to propose candidates via amortized inference. While many works focus on structured spaces with known combinatorial properties (Combinatorial Structures[1], Permutation Spaces[7]), Thompson Sampling LLMs[0] targets unstructured settings where leveraging large language models for proposal generation offers a promising alternative to hand-crafted search heuristics, addressing scalability challenges that arise when the discrete space lacks exploitable regularity.

Claimed Contributions

Improved regret bound for VBOS accounting for reward correlation

10 retrieved papers

The authors derive a tighter regret bound for Variational Bayesian Optimistic Sampling that scales with the maximal information gain γT rather than the domain size |X|, making it applicable to large discrete spaces. They also extend this bound to approximate VBOS implementations using gradient-based optimization.

10 retrieved papers

TOSFIT algorithm combining pre-training with fine-tuning toward posterior PoM

10 retrieved papers

The authors propose Thompson Sampling via Fine-Tuning (TOSFIT), which initializes a generative policy using pre-trained LLMs with prompt conditioning and then carefully adapts it toward the posterior probability of maximality through fine-tuning, guided by their theoretical analysis.

10 retrieved papers

Empirical validation across three diverse optimization tasks

10 retrieved papers

The authors demonstrate TOSFIT's effectiveness on three distinct tasks spanning natural language, biological sequences, and quantum circuits, showing it achieves state-of-the-art sample efficiency and computational efficiency compared to existing Bayesian optimization, reinforcement learning, and evolutionary methods.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Amortized bayesian optimization over discrete spaces PDF

Kevin Swersky, Yulia Rubanova, David Dohan, Kevin J. Murphy (2020)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Improved regret bound for VBOS accounting for reward correlation

[25] Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization PDF

Cannot Refute

[51] CO-BED: Information-Theoretic Contextual Optimization via Bayesian Experimental Design PDF

Cannot Refute

[52] Human-in-the-loop Bayesian Optimization with No-Regret Guarantees PDF

Cannot Refute

[53] Regret and belief complexity trade-off in Gaussian process bandits via information thresholding PDF

Cannot Refute

[54] First-Order Bayesian Regret Analysis of Thompson Sampling PDF

Cannot Refute

[55] Transportability for bandits with data from different environments PDF

Cannot Refute

[56] An Information-Theoretic Approach to Bandits and Reinforcement Learning PDF

Cannot Refute

[57] Regret Bounds for Information-Directed Reinforcement Learning PDF

Cannot Refute

[58] Bayesian optimization in the wild: risk-averse and computationally-effective decision-making PDF

Cannot Refute

[59] Exploring and Exploiting Model Uncertainty in Bayesian Optimization PDF

Cannot Refute

Contribution

TOSFIT algorithm combining pre-training with fine-tuning toward posterior PoM

[60] On the importance of uncertainty in decision-making with large language models PDF

Cannot Refute

[61] A Systematic Review on Optimization Approaches for Transformer and Large Language Models PDF

Cannot Refute

[62] Searching for optimal solutions with LLMs via bayesian optimization PDF

Cannot Refute

[63] Deep bayesian active learning for preference modeling in large language models PDF

Cannot Refute

[64] Posterior sampling via autoregressive generation PDF

Cannot Refute

[65] Code repair with llms gives an exploration-exploitation tradeoff PDF

Cannot Refute

[66] Retrieval-Native Language Models: Integrating Parametric and Vector Memory with Bayesian Attention PDF

Cannot Refute

[67] Knowledge-based question answering with large language models PDF

Cannot Refute

[68] LMFuzz: Program repair fuzzing based on large language models PDF

Cannot Refute

[69] Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking PDF

Cannot Refute

Contribution

Empirical validation across three diverse optimization tasks

[70] Preferential Bayesian Optimisation for Protein Design with Ranking-Based Fitness Predictors PDF

Cannot Refute

[71] Model Inversion Networks for Model-Based Optimization PDF

Cannot Refute

[72] Antibody design with constrained bayesian optimization PDF

Cannot Refute

[73] Efficient Protein Engineering via Integrated Language Models and Bayesian Optimization PDF

Cannot Refute

[74] Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean PDF

Cannot Refute

[75] Inverse protein folding using deep bayesian optimization PDF

Cannot Refute

[76] Effective surrogate models for protein design with bayesian optimization PDF

Cannot Refute

[77] ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers PDF

Cannot Refute

[78] Bayesian Optimization for Biochemical Discovery with LLMs PDF

Cannot Refute

[79] When Do LLMs Improve Bayesian Optimization? A Systematic Comparison Across Molecular and Protein Design PDF

Cannot Refute

Thompson Sampling via Fine-Tuning of LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Amortized bayesian optimization over discrete spaces PDF

Contribution Analysis

Improved regret bound for VBOS accounting for reward correlation

[25] Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization PDF

[51] CO-BED: Information-Theoretic Contextual Optimization via Bayesian Experimental Design PDF

[52] Human-in-the-loop Bayesian Optimization with No-Regret Guarantees PDF

[53] Regret and belief complexity trade-off in Gaussian process bandits via information thresholding PDF

[54] First-Order Bayesian Regret Analysis of Thompson Sampling PDF

[55] Transportability for bandits with data from different environments PDF

[56] An Information-Theoretic Approach to Bandits and Reinforcement Learning PDF

[57] Regret Bounds for Information-Directed Reinforcement Learning PDF

[58] Bayesian optimization in the wild: risk-averse and computationally-effective decision-making PDF

[59] Exploring and Exploiting Model Uncertainty in Bayesian Optimization PDF

TOSFIT algorithm combining pre-training with fine-tuning toward posterior PoM

[60] On the importance of uncertainty in decision-making with large language models PDF

[61] A Systematic Review on Optimization Approaches for Transformer and Large Language Models PDF

[62] Searching for optimal solutions with LLMs via bayesian optimization PDF

[63] Deep bayesian active learning for preference modeling in large language models PDF

[64] Posterior sampling via autoregressive generation PDF

[65] Code repair with llms gives an exploration-exploitation tradeoff PDF

[66] Retrieval-Native Language Models: Integrating Parametric and Vector Memory with Bayesian Attention PDF

[67] Knowledge-based question answering with large language models PDF

[68] LMFuzz: Program repair fuzzing based on large language models PDF

[69] Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking PDF

Empirical validation across three diverse optimization tasks

[70] Preferential Bayesian Optimisation for Protein Design with Ranking-Based Fitness Predictors PDF

[71] Model Inversion Networks for Model-Based Optimization PDF

[72] Antibody design with constrained bayesian optimization PDF

[73] Efficient Protein Engineering via Integrated Language Models and Bayesian Optimization PDF

[74] Bayesian Optimisation for Protein Sequence Design: Gaussian Processes with Zero-Shot Protein Language Model Prior Mean PDF

[75] Inverse protein folding using deep bayesian optimization PDF

[76] Effective surrogate models for protein design with bayesian optimization PDF

[77] ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers PDF

[78] Bayesian Optimization for Biochemical Discovery with LLMs PDF

[79] When Do LLMs Improve Bayesian Optimization? A Systematic Comparison Across Molecular and Protein Design PDF

Table of Contents