Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs
Overview
Overall Novelty Assessment
The paper introduces Reinforcement Learning from Algorithm Feedback (RLAF) for training GNN-based branching policies in SAT solvers, using solver runtime as the reward signal and policy-gradient methods like GRPO. It resides in the 'Reinforcement Learning-Based CDCL Branching' leaf, which contains only three papers total (including this work and two siblings). This is a relatively sparse research direction within the broader taxonomy of 17 papers across multiple branches, suggesting the specific combination of RL training and CDCL integration remains an emerging area rather than a crowded subfield.
The taxonomy reveals several neighboring directions: supervised learning-based CDCL branching (one paper mimicking expert heuristics), domain-specific CDCL methods (one paper on logic equivalence checking), and local search GNN branching (one paper on stochastic algorithms). The paper's focus on RL-driven policy learning distinguishes it from supervised approaches that rely on labeled expert demonstrations. Broader branches address non-SAT combinatorial problems (graph optimization, neural network verification, constraint programming) and theoretical foundations (expressive power, multi-task learning), indicating the field spans both application-specific solver development and methodological inquiry into GNN capabilities.
Among 28 candidates examined, the RLAF paradigm contribution shows 3 refutable candidates out of 10 examined, while the generic weight-injection mechanism has 2 refutable candidates out of 10. The one-shot GNN policy contribution appears more novel, with 0 refutable candidates among 8 examined. These statistics reflect a limited semantic search scope, not exhaustive coverage. The presence of some overlapping prior work on RL-based branching and weight parameterization suggests incremental refinement of existing ideas, though the specific integration of GRPO and one-shot variable parameterization may offer distinguishing technical details.
Based on the top-28 semantic matches and taxonomy structure, the work appears to advance an active but not densely populated research direction. The limited number of sibling papers and the sparse RL-based CDCL branch indicate room for methodological contributions, though the refutable candidates for two contributions signal that core ideas around RL training and weight injection have precedents. The analysis does not cover exhaustive citation networks or domain-specific venues, so additional related work may exist beyond this scope.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose RLAF, a training paradigm that uses reinforcement learning to train GNN-based policies for guiding SAT solver branching heuristics. The approach uses the solver's computational cost as the sole reward signal, eliminating the need for expert supervision.
The authors introduce a method that modifies existing SAT solver branching heuristics by incorporating external variable-wise weights and polarities. This mechanism scales the solver's original scoring function without sacrificing its inherent heuristic properties.
The authors develop a GNN-based policy that assigns weights and polarities to all variables in a single forward pass, avoiding costly repeated network evaluations. This one-shot formulation enables efficient training using standard policy-gradient methods like GRPO.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[4] Improving SAT solver heuristics with graph networks and reinforcement learning PDF
[13] Can -Learning with Graph Networks Learn a Generalizable Branching Heuristic for a SAT Solver? PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Reinforcement Learning from Algorithm Feedback (RLAF) paradigm
The authors propose RLAF, a training paradigm that uses reinforcement learning to train GNN-based policies for guiding SAT solver branching heuristics. The approach uses the solver's computational cost as the sole reward signal, eliminating the need for expert supervision.
[1] Learning local search heuristics for boolean satisfiability PDF
[4] Improving SAT solver heuristics with graph networks and reinforcement learning PDF
[13] Can -Learning with Graph Networks Learn a Generalizable Branching Heuristic for a SAT Solver? PDF
[8] Neural Guidance in Constraint Solvers PDF
[11] A minimalist approach to deep multi-task learning PDF
[18] Learning splitting heuristics in divide-and-conquer SAT solvers with reinforcement learning PDF
[19] A Deep Reinforcement Learning Heuristic for SAT based on Antagonist Graph Neural Networks PDF
[20] A Survey of Machine Learning Approaches in Logic Synthesis PDF
[21] LangSAT: A Novel Framework Combining NLP and Reinforcement Learning for SAT Solving PDF
[22] Applications of Q-Learning to Network Optimization and Graph Problems PDF
Generic mechanism for injecting variable weights into branching heuristics
The authors introduce a method that modifies existing SAT solver branching heuristics by incorporating external variable-wise weights and polarities. This mechanism scales the solver's original scoring function without sacrificing its inherent heuristic properties.
[28] Learning rate based branching heuristic for SAT solvers PDF
[34] Additive versus multiplicative clause weighting for SAT PDF
[26] Enhancing SAT Solving with GNN for Clause Weight Prediction PDF
[29] RASLite: Enhancing (W) PMS Solvers Through Dynamic Initial Weight Approach PDF
[30] A dynamic clause specific initial weight assignment for solving satisfiability problems using local search PDF
[31] Exponential recency weighted average branching heuristic for SAT solvers PDF
[32] Solving (Weighted) Partial MaxSAT by Dynamic Local Search for SAT PDF
[33] Dynamic branching in qualitative constraint networks via counting local models PDF
[35] MathSAT: Tight Integration of SAT and Mathematical Decision Procedures PDF
[36] Phase selection heuristics for satisfiability solvers PDF
One-shot GNN-based policy for variable parameterization
The authors develop a GNN-based policy that assigns weights and polarities to all variables in a single forward pass, avoiding costly repeated network evaluations. This one-shot formulation enables efficient training using standard policy-gradient methods like GRPO.