NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Neuro-SymbolicVision and LanguageCompositional Reasoning

Modern Vision-Language Models (VLMs) have achieved impressive performance in various tasks, yet they often struggle with compositional reasoning, the ability to decompose and recombine concepts to solve novel problems. While neuro-symbolic approaches offer a promising direction, they are typically constrained by crisp logical execution or predefined predicates, which limit flexibility. In this work, we introduce NePTune, a neuro-symbolic framework that overcomes these limitations through a hybrid execution model that integrates the perception capabilities of foundation vision models with the compositional expressiveness of symbolic reasoning. NePTune dynamically translates natural language queries into executable Python programs that blend imperative control flow with soft logic operators capable of reasoning over VLM-generated uncertainty. Operating in a training-free manner, NePTune, with a modular design, decouples perception from reasoning, yet its differentiable operations support fine-tuning. We evaluate NePTune on multiple visual reasoning benchmarks and various domains, utilizing adversarial tests, and demonstrate a significant improvement over strong base models, as well as its effective compositional generalization and adaptation capabilities in novel environments.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces NePTune, a neuro-symbolic framework that translates natural language queries into executable Python programs combining imperative control flow with soft logic operators for compositional visual reasoning. It resides in the 'Program Synthesis and Modular Execution' leaf, which contains four papers total (including NePTune itself). This leaf sits within the broader 'Neuro-Symbolic and Prompting-Based Compositional Reasoning' branch, indicating a moderately populated research direction focused on training-free or minimally-trained approaches. The taxonomy shows this is an active but not overcrowded area, with sibling works like Visual Programming and Visual Program Distillation establishing the paradigm of synthesizing programs to orchestrate vision modules.

The taxonomy reveals several neighboring research directions that contextualize NePTune's positioning. Adjacent leaves include 'Chain-of-Thought and Structured Prompting Strategies' (six papers) and 'Natural Language Inference and Linguistic Decomposition' (two papers), both exploring structured reasoning without full program synthesis. Further afield, the 'Training-Based Improvement' branch encompasses contrastive learning, reinforcement learning, and architectural modifications—approaches that require substantial training, unlike NePTune's training-free design. The taxonomy's scope and exclude notes clarify that NePTune belongs in program synthesis rather than prompting strategies because it generates executable code rather than reasoning traces alone, and its training-free nature distinguishes it from methods requiring fine-tuning.

The contribution-level analysis examined 30 candidate papers across three contributions, with 10 candidates per contribution. None of the contributions were clearly refuted by the examined literature. For 'Hybrid Neuro-Symbolic Execution Model', all 10 candidates were non-refutable or unclear; similarly, 'Domain Adaptable Framework with Zero-Shot Generalization' and 'Strong Compositional Generalization Capabilities' each showed 10 non-refutable candidates. This suggests that among the limited set of 30 semantically similar papers examined, no single work directly overlaps with NePTune's specific combination of soft logic operators, differentiable operations, and modular execution. However, the scale of this search—30 candidates from top-K retrieval—means the analysis captures immediate neighbors rather than exhaustive prior work.

Given the limited search scope of 30 candidates, the paper appears to occupy a relatively distinct position within the program synthesis subfield. The absence of clear refutations across all three contributions, combined with the taxonomy showing only four papers in this specific leaf, suggests the work introduces novel technical elements. However, this assessment is constrained by the

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: compositional reasoning in vision-language models. The field addresses how models can understand and reason about complex visual scenes by composing simpler concepts—such as objects, attributes, spatial relations, and actions—into coherent interpretations. The taxonomy reflects several complementary research directions: one branch focuses on evaluation and benchmarking (e.g., Winoground[28], Compositional Grounding Challenges[7]), establishing testbeds that reveal where models struggle with fine-grained compositional distinctions. Another branch investigates training strategies and data augmentation (Synthetic Preference Compositional[1], Multimodal Synthetic Data[24]) to improve compositional generalization. A third branch explores neuro-symbolic and prompting-based methods, including program synthesis and modular execution approaches that decompose reasoning into interpretable steps. Additional branches examine model architecture and representation (Text Encoders Bottleneck[43], Visual Structures Binding[34]), domain-specific applications (NavGPT[9], UniChart[44]), multimodal model composition (Model Composition Multimodal[13]), and adversarial robustness (Jailbreak Pieces[4]). Within the neuro-symbolic and prompting-based line of work, a particularly active theme is the use of program synthesis to break down complex queries into modular, executable steps—an approach exemplified by Visual Programming[23] and Visual Program Distillation[6]. These methods aim to leverage the structured reasoning capabilities of language models while grounding execution in specialized vision modules. NePTune[0] sits squarely in this cluster, emphasizing modular execution pipelines that synthesize programs for compositional visual reasoning. Compared to Visual Programming[23], which pioneered the idea of generating Python-like code to orchestrate vision APIs, NePTune[0] extends the paradigm by integrating more sophisticated program structures and reasoning traces. Meanwhile, Reasoning Scaling Generating[12] explores how scaling inference-time computation can enhance compositional problem-solving, highlighting a complementary direction that balances symbolic structure with neural flexibility. Together, these works illustrate ongoing efforts to marry interpretability, modularity, and performance in compositional vision-language understanding.

Claimed Contributions

Hybrid Neuro-Symbolic Execution Model

10 retrieved papers

The authors introduce a framework that integrates imperative Python control flow with soft compositional logic operations based on fuzzy logic principles. This hybrid approach enables reasoning over VLM-generated uncertainty scores while maintaining the expressive power of a general-purpose programming language.

10 retrieved papers

Domain Adaptable Framework with Zero-Shot Generalization

10 retrieved papers

The authors develop a modular system where an LLM dynamically generates Python programs without requiring predefined predicates. The framework operates in a training-free manner for zero-shot tasks, yet its differentiable operations support fine-tuning for domain adaptation.

10 retrieved papers

Strong Compositional Generalization Capabilities

10 retrieved papers

Through extensive experiments on multiple benchmarks including adversarial tests and domain-shift scenarios, the authors demonstrate that NePTune significantly outperforms existing methods in compositional reasoning and shows robust generalization to novel environments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, K. Hata, Ranjay Krishna, Ariel Fuxman (2024)

[12] Reasoning, scaling, generating with vision-language models PDF

Z Wang (2024)

[23] Visual programming: Compositional visual reasoning without training PDF

Tanmay Gupta, Aniruddha Kembhavi (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Hybrid Neuro-Symbolic Execution Model

[61] Continuum-Interaction-Driven Intelligence: Human-Aligned Neural Architecture via Crystallized Reasoning and Fluid Generation PDF

Cannot Refute

[62] Ergo: a quest for declarativity in logic programming PDF

Cannot Refute

[63] Equipping robot control programs with first-order probabilistic reasoning capabilities PDF

Cannot Refute

[64] Covering Designers' Bayes-ic Needs: Probabilistic Semantics for Structured Design Spaces PDF

Cannot Refute

[65] Declarative Modelling and Reasoning for Combinatorial Problem Solving and Argumentation under Uncertainty PDF

Cannot Refute

[66] Approximate verification in an open source world PDF

Cannot Refute

[67] A framework for engineering intelligent control systems PDF

Cannot Refute

[68] Towards a General Knowledge Representation Language PDF

Cannot Refute

[69] Classification and Fitness Evaluation using Fuzzy Logic Based Approach PDF

Cannot Refute

[70] WP1âRisk-Informed Decision Making (RIDM) PDF

Cannot Refute

Contribution

Domain Adaptable Framework with Zero-Shot Generalization

[51] Codegen: An open large language model for code with multi-turn program synthesis PDF

Cannot Refute

[52] Compositional exemplars for in-context learning PDF

Cannot Refute

[53] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions PDF

Cannot Refute

[54] Llm-guided compositional program synthesis PDF

Cannot Refute

[55] Exploration and adaptation of large language models for specialized domains PDF

Cannot Refute

[56] Compositional task representations for large language models PDF

Cannot Refute

[57] mmT5: Modular multilingual pre-training solves source language hallucinations PDF

Cannot Refute

[58] Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks PDF

Cannot Refute

[59] Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective PDF

Cannot Refute

[60] Uncovering LLMs for service-composition: challenges and opportunities PDF

Cannot Refute

Contribution

Strong Compositional Generalization Capabilities

[7] Investigating compositional challenges in vision-language models for visual grounding PDF

Cannot Refute

[10] Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model PDF

Cannot Refute

[21] In-context compositional generalization for large vision-language models PDF

Cannot Refute

[45] CF-VLM:CounterFactual Vision-Language Fine-tuning PDF

Cannot Refute

[71] Crepe: Can vision-language foundation models reason compositionally? PDF

Cannot Refute

[72] Delving into Out-of-Distribution Detection with Vision-Language Representations PDF

Cannot Refute

[73] Enhancing Compositional Generalization via Compositional Feature Alignment PDF

Cannot Refute

[74] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

Cannot Refute

[75] Causal Attention for Vision-Language Tasks PDF

Cannot Refute

[76] Compositional Kronecker Context Optimization for visionâlanguage models PDF

Cannot Refute

NePTune: A Neuro-Pythonic Framework for Tunable Compositional Reasoning on Vision-Language

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Visual program distillation: Distilling tools and programmatic reasoning into vision-language models PDF

[12] Reasoning, scaling, generating with vision-language models PDF

[23] Visual programming: Compositional visual reasoning without training PDF

Contribution Analysis

Hybrid Neuro-Symbolic Execution Model

[61] Continuum-Interaction-Driven Intelligence: Human-Aligned Neural Architecture via Crystallized Reasoning and Fluid Generation PDF

[62] Ergo: a quest for declarativity in logic programming PDF

[63] Equipping robot control programs with first-order probabilistic reasoning capabilities PDF

[64] Covering Designers' Bayes-ic Needs: Probabilistic Semantics for Structured Design Spaces PDF

[65] Declarative Modelling and Reasoning for Combinatorial Problem Solving and Argumentation under Uncertainty PDF

[66] Approximate verification in an open source world PDF

[67] A framework for engineering intelligent control systems PDF

[68] Towards a General Knowledge Representation Language PDF

[69] Classification and Fitness Evaluation using Fuzzy Logic Based Approach PDF

[70] WP1âRisk-Informed Decision Making (RIDM) PDF

Domain Adaptable Framework with Zero-Shot Generalization

[51] Codegen: An open large language model for code with multi-turn program synthesis PDF

[52] Compositional exemplars for in-context learning PDF

[53] BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions PDF

[54] Llm-guided compositional program synthesis PDF

[55] Exploration and adaptation of large language models for specialized domains PDF

[56] Compositional task representations for large language models PDF

[57] mmT5: Modular multilingual pre-training solves source language hallucinations PDF

[58] Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks PDF

[59] Compositional Hardness of Code in Large Language Models--A Probabilistic Perspective PDF

[60] Uncovering LLMs for service-composition: challenges and opportunities PDF

Strong Compositional Generalization Capabilities

[7] Investigating compositional challenges in vision-language models for visual grounding PDF

[10] Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model PDF

[21] In-context compositional generalization for large vision-language models PDF

[45] CF-VLM:CounterFactual Vision-Language Fine-tuning PDF

[71] Crepe: Can vision-language foundation models reason compositionally? PDF

[72] Delving into Out-of-Distribution Detection with Vision-Language Representations PDF

[73] Enhancing Compositional Generalization via Compositional Feature Alignment PDF

[74] Reason-rft: Reinforcement fine-tuning for visual reasoning of vision language models PDF

[75] Causal Attention for Vision-Language Tasks PDF

[76] Compositional Kronecker Context Optimization for visionâlanguage models PDF

Table of Contents

[70] WP1âRisk-Informed Decision Making (RIDM) PDF

[76] Compositional Kronecker Context Optimization for visionâlanguage models PDF