Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningGraph Neural NetworksCombinatorial OptimizationSAT SolvingGNNsGraph LearningGRPO
Abstract:

Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver's computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Reinforcement Learning from Algorithm Feedback (RLAF) for training GNN-based branching policies in SAT solvers, using solver runtime as the reward signal and policy-gradient methods like GRPO. It resides in the 'Reinforcement Learning-Based CDCL Branching' leaf, which contains only three papers total (including this work and two siblings). This is a relatively sparse research direction within the broader taxonomy of 17 papers across multiple branches, suggesting the specific combination of RL training and CDCL integration remains an emerging area rather than a crowded subfield.

The taxonomy reveals several neighboring directions: supervised learning-based CDCL branching (one paper mimicking expert heuristics), domain-specific CDCL methods (one paper on logic equivalence checking), and local search GNN branching (one paper on stochastic algorithms). The paper's focus on RL-driven policy learning distinguishes it from supervised approaches that rely on labeled expert demonstrations. Broader branches address non-SAT combinatorial problems (graph optimization, neural network verification, constraint programming) and theoretical foundations (expressive power, multi-task learning), indicating the field spans both application-specific solver development and methodological inquiry into GNN capabilities.

Among 28 candidates examined, the RLAF paradigm contribution shows 3 refutable candidates out of 10 examined, while the generic weight-injection mechanism has 2 refutable candidates out of 10. The one-shot GNN policy contribution appears more novel, with 0 refutable candidates among 8 examined. These statistics reflect a limited semantic search scope, not exhaustive coverage. The presence of some overlapping prior work on RL-based branching and weight parameterization suggests incremental refinement of existing ideas, though the specific integration of GRPO and one-shot variable parameterization may offer distinguishing technical details.

Based on the top-28 semantic matches and taxonomy structure, the work appears to advance an active but not densely populated research direction. The limited number of sibling papers and the sparse RL-based CDCL branch indicate room for methodological contributions, though the refutable candidates for two contributions signal that core ideas around RL training and weight injection have precedents. The analysis does not cover exhaustive citation networks or domain-specific venues, so additional related work may exist beyond this scope.

Taxonomy

Core-task Taxonomy Papers
17
3
Claimed Contributions
28
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: Learning to guide SAT solver branching heuristics with graph neural networks. The field has evolved around several complementary directions. The main branches include GNN-based branching heuristics specifically designed for SAT solvers, extensions to non-SAT combinatorial problems such as mixed-integer programming and constraint satisfaction, and foundational work on theoretical expressiveness and methodological principles. Within the SAT-focused branch, a key distinction emerges between approaches that integrate GNNs directly into conflict-driven clause learning (CDCL) solvers and those that explore alternative architectures or extended formalisms like quantified Boolean formulas. Representative works such as GNN Reinforcement SAT[4] and Q-Learning Branching Heuristic[13] illustrate reinforcement learning strategies for CDCL integration, while studies like GNN Branching Capacity[3] and GNN Expressive Power[10] examine the representational limits of graph neural networks in capturing solver dynamics. Meanwhile, branches addressing non-SAT domains (e.g., Neural Branch and Bound[7], Targeted Branching MIS[6]) demonstrate how similar GNN principles transfer to broader combinatorial settings, and survey-oriented efforts (Data-Driven Combinatorial Solvers[15]) provide comparative perspectives across methods. Particularly active lines of work center on reinforcement learning-based CDCL branching and the interplay between learned heuristics and traditional solver feedback. Algorithm Feedback GNN[0] sits within the reinforcement learning-based CDCL branching cluster, emphasizing how algorithm-level feedback can refine GNN policies during search. This contrasts with nearby approaches like GNN Reinforcement SAT[4], which pioneered RL integration but may rely on simpler reward signals, and Q-Learning Branching Heuristic[13], which explores tabular or value-based methods rather than policy gradients. A recurring theme across these studies is the trade-off between sample efficiency and generalization: some methods prioritize rapid adaptation to specific problem families, while others aim for broader transferability. Open questions include how to best incorporate clause-learning dynamics into GNN architectures and whether hybrid strategies that blend learned and hand-crafted heuristics can outperform purely data-driven policies.

Claimed Contributions

Reinforcement Learning from Algorithm Feedback (RLAF) paradigm

The authors propose RLAF, a training paradigm that uses reinforcement learning to train GNN-based policies for guiding SAT solver branching heuristics. The approach uses the solver's computational cost as the sole reward signal, eliminating the need for expert supervision.

10 retrieved papers
Can Refute
Generic mechanism for injecting variable weights into branching heuristics

The authors introduce a method that modifies existing SAT solver branching heuristics by incorporating external variable-wise weights and polarities. This mechanism scales the solver's original scoring function without sacrificing its inherent heuristic properties.

10 retrieved papers
Can Refute
One-shot GNN-based policy for variable parameterization

The authors develop a GNN-based policy that assigns weights and polarities to all variables in a single forward pass, avoiding costly repeated network evaluations. This one-shot formulation enables efficient training using standard policy-gradient methods like GRPO.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reinforcement Learning from Algorithm Feedback (RLAF) paradigm

The authors propose RLAF, a training paradigm that uses reinforcement learning to train GNN-based policies for guiding SAT solver branching heuristics. The approach uses the solver's computational cost as the sole reward signal, eliminating the need for expert supervision.

Contribution

Generic mechanism for injecting variable weights into branching heuristics

The authors introduce a method that modifies existing SAT solver branching heuristics by incorporating external variable-wise weights and polarities. This mechanism scales the solver's original scoring function without sacrificing its inherent heuristic properties.

Contribution

One-shot GNN-based policy for variable parameterization

The authors develop a GNN-based policy that assigns weights and polarities to all variables in a single forward pass, avoiding costly repeated network evaluations. This one-shot formulation enables efficient training using standard policy-gradient methods like GRPO.