NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

ICLR 2026 Conference SubmissionAnonymous Authors
Neuro-SymbolicVision and LanguageCompositional Reasoning
Abstract:

Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, utilizing adversarial tests, and demonstrate a significant improvement over strong base models, as well as its effective compositional generalization and adaptation capabilities in novel environments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces NePTune, a neuro-symbolic framework that translates natural language queries into executable Python programs combining imperative control flow with soft logic operators for compositional visual reasoning. It resides in the 'Program Synthesis and Modular Execution' leaf, which contains four papers total (including NePTune itself). This leaf sits within the broader 'Neuro-Symbolic and Prompting-Based Compositional Reasoning' branch, indicating a moderately populated research direction focused on training-free or minimally-trained approaches. The taxonomy shows this is an active but not overcrowded area, with sibling works like Visual Programming and Visual Program Distillation establishing the paradigm of synthesizing programs to orchestrate vision modules.

The taxonomy reveals several neighboring research directions that contextualize NePTune's positioning. Adjacent leaves include 'Chain-of-Thought and Structured Prompting Strategies' (six papers) and 'Natural Language Inference and Linguistic Decomposition' (two papers), both exploring structured reasoning without full program synthesis. Further afield, the 'Training-Based Improvement' branch encompasses contrastive learning, reinforcement learning, and architectural modifications—approaches that require substantial training, unlike NePTune's training-free design. The taxonomy's scope and exclude notes clarify that NePTune belongs in program synthesis rather than prompting strategies because it generates executable code rather than reasoning traces alone, and its training-free nature distinguishes it from methods requiring fine-tuning.

The contribution-level analysis examined 30 candidate papers across three contributions, with 10 candidates per contribution. None of the contributions were clearly refuted by the examined literature. For 'Hybrid Neuro-Symbolic Execution Model', all 10 candidates were non-refutable or unclear; similarly, 'Domain Adaptable Framework with Zero-Shot Generalization' and 'Strong Compositional Generalization Capabilities' each showed 10 non-refutable candidates. This suggests that among the limited set of 30 semantically similar papers examined, no single work directly overlaps with NePTune's specific combination of soft logic operators, differentiable operations, and modular execution. However, the scale of this search—30 candidates from top-K retrieval—means the analysis captures immediate neighbors rather than exhaustive prior work.

Given the limited search scope of 30 candidates, the paper appears to occupy a relatively distinct position within the program synthesis subfield. The absence of clear refutations across all three contributions, combined with the taxonomy showing only four papers in this specific leaf, suggests the work introduces novel technical elements. However, this assessment is constrained by the

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: compositional reasoning in vision-language models. The field addresses how models can understand and reason about complex visual scenes by composing simpler concepts—such as objects, attributes, spatial relations, and actions—into coherent interpretations. The taxonomy reflects several complementary research directions: one branch focuses on evaluation and benchmarking (e.g., Winoground[28], Compositional Grounding Challenges[7]), establishing testbeds that reveal where models struggle with fine-grained compositional distinctions. Another branch investigates training strategies and data augmentation (Synthetic Preference Compositional[1], Multimodal Synthetic Data[24]) to improve compositional generalization. A third branch explores neuro-symbolic and prompting-based methods, including program synthesis and modular execution approaches that decompose reasoning into interpretable steps. Additional branches examine model architecture and representation (Text Encoders Bottleneck[43], Visual Structures Binding[34]), domain-specific applications (NavGPT[9], UniChart[44]), multimodal model composition (Model Composition Multimodal[13]), and adversarial robustness (Jailbreak Pieces[4]). Within the neuro-symbolic and prompting-based line of work, a particularly active theme is the use of program synthesis to break down complex queries into modular, executable steps—an approach exemplified by Visual Programming[23] and Visual Program Distillation[6]. These methods aim to leverage the structured reasoning capabilities of language models while grounding execution in specialized vision modules. NePTune[0] sits squarely in this cluster, emphasizing modular execution pipelines that synthesize programs for compositional visual reasoning. Compared to Visual Programming[23], which pioneered the idea of generating Python-like code to orchestrate vision APIs, NePTune[0] extends the paradigm by integrating more sophisticated program structures and reasoning traces. Meanwhile, Reasoning Scaling Generating[12] explores how scaling inference-time computation can enhance compositional problem-solving, highlighting a complementary direction that balances symbolic structure with neural flexibility. Together, these works illustrate ongoing efforts to marry interpretability, modularity, and performance in compositional vision-language understanding.

Claimed Contributions

Hybrid Neuro-Symbolic Execution Model

The authors introduce a framework that integrates imperative Python control flow with soft compositional logic operations based on fuzzy logic principles. This hybrid approach enables reasoning over VLM-generated uncertainty scores while maintaining the expressive power of a general-purpose programming language.

10 retrieved papers
Domain Adaptable Framework with Zero-Shot Generalization

The authors develop a modular system where an LLM dynamically generates Python programs without requiring predefined predicates. The framework operates in a training-free manner for zero-shot tasks, yet its differentiable operations support fine-tuning for domain adaptation.

10 retrieved papers
Strong Compositional Generalization Capabilities

Through extensive experiments on multiple benchmarks including adversarial tests and domain-shift scenarios, the authors demonstrate that NePTune significantly outperforms existing methods in compositional reasoning and shows robust generalization to novel environments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Hybrid Neuro-Symbolic Execution Model

The authors introduce a framework that integrates imperative Python control flow with soft compositional logic operations based on fuzzy logic principles. This hybrid approach enables reasoning over VLM-generated uncertainty scores while maintaining the expressive power of a general-purpose programming language.

Contribution

Domain Adaptable Framework with Zero-Shot Generalization

The authors develop a modular system where an LLM dynamically generates Python programs without requiring predefined predicates. The framework operates in a training-free manner for zero-shot tasks, yet its differentiable operations support fine-tuning for domain adaptation.

Contribution

Strong Compositional Generalization Capabilities

Through extensive experiments on multiple benchmarks including adversarial tests and domain-shift scenarios, the authors demonstrate that NePTune significantly outperforms existing methods in compositional reasoning and shows robust generalization to novel environments.