PixelCraft: A Multi-Agent system for High-Fidelity Visual Reasoning on Structured Images

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

chart understandingmulti-agent systemvisual reasoning

Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

PixelCraft proposes a multi-agent system combining a dispatcher, planner, reasoner, critics, and visual tool agents for structured image reasoning. The taxonomy places this work in the 'Multi-Agent Systems for Visual Reasoning' leaf, which currently contains only this paper among 50 total papers across 36 topics. This indicates a sparse, emerging research direction within the broader visual reasoning landscape, suggesting the multi-agent collaborative approach for structured images represents relatively unexplored territory in the field.

The taxonomy reveals neighboring approaches in compositional reasoning (Visual Programming, scene graph methods), chain-of-thought frameworks (MINT-CoT, ReFocus), and tool-augmented MLLMs. PixelCraft diverges from single-model compositional methods by distributing reasoning across specialized agents rather than generating unified programs. It differs from chain-of-thought approaches by maintaining dynamic image memory and enabling agent discussion rather than linear reasoning chains. The scope notes indicate this work bridges modular reasoning with collaborative problem-solving, occupying a distinct position between end-to-end neural methods and purely symbolic decomposition.

Among 22 candidates examined across three contributions, none clearly refute the proposed approach. The multi-agent system architecture examined 10 candidates with no refutable overlap; the pixel-level grounding model and tool agents examined 10 candidates with similar results; the planner-managed image memory examined 2 candidates without refutation. This limited search scope suggests the specific combination of high-fidelity grounding, multi-agent collaboration, and dynamic image memory appears novel within the examined literature, though the analysis does not cover exhaustive prior work in agent-based visual reasoning or grounding methods.

Based on top-22 semantic matches, the work appears to introduce a distinctive architectural approach in a sparse research area. The absence of refutable candidates reflects both the limited search scope and the relative novelty of applying multi-agent frameworks to structured image reasoning. However, the analysis cannot assess overlap with broader agent-based systems literature or comprehensive grounding method surveys beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual reasoning on structured images. The field encompasses a diverse set of approaches for enabling models to interpret and reason over visually presented structured data such as charts, tables, diagrams, and geometric layouts. The taxonomy reveals several major branches: some focus on understanding structured visual data itself (e.g., chart and table comprehension as in ChartQA[11] and Visual-TableQA[13]), while others emphasize compositional and modular reasoning strategies that decompose complex queries into interpretable steps (e.g., Visual Programming[1]). Neural architectures dedicated to visual reasoning, reinforcement learning methods for iterative decision-making (Grounded Reinforcement Learning for[22]), and chain-of-thought reasoning for vision-language models (MINT-CoT[23], Uni-cot[41]) represent distinct methodological threads. Additional branches address latent reasoning mechanisms, benchmark development (Plotqa[6], CSVQA[33]), multimodal large language models, and the integration of visual generation with reasoning. A smaller but emerging branch explores multi-agent systems for visual reasoning, where collaborative or modular agents tackle structured visual tasks. Recent work highlights contrasting trade-offs between end-to-end neural approaches and more interpretable, step-by-step reasoning paradigms. Chain-of-thought methods like MINT-CoT[23] and ReFocus[5] aim to make reasoning transparent, while latent reasoning approaches pursue efficiency without explicit intermediate steps. Within the multi-agent systems branch, PixelCraft[0] situates itself as a collaborative framework that leverages multiple specialized agents to handle structured visual reasoning tasks. This contrasts with single-model strategies such as Visual Programming[1], which relies on modular code generation, and with reinforcement learning methods like Grounded Reinforcement Learning for[22], which iteratively refine decisions through environmental feedback. The multi-agent perspective offers a middle ground, combining modularity with collaborative problem-solving, and reflects ongoing exploration of how to balance interpretability, scalability, and task-specific specialization in visual reasoning systems.

Claimed Contributions

PixelCraft multi-agent system for structured image reasoning

10 retrieved papers

The authors introduce PixelCraft, a multi-agent framework comprising a dispatcher, planner, reasoner, critics, and visual tool agents. The system enables non-linear reasoning through query-aware agent selection, role-driven discussion, and iterative self-correction, supported by a planner-managed image memory that allows selective recall of prior visual states instead of streaming all images.

10 retrieved papers

High-fidelity pixel-level grounding model and tool agents

10 retrieved papers

The authors develop a hybrid dataset of synthesized charts and annotated geometric diagrams to fine-tune Qwen2.5-VL-3B for precise pixel-level grounding. This grounding model provides accurate coordinates that drive classical CV operators within tool agents, enabling high-fidelity image processing for structured visual reasoning.

10 retrieved papers

Planner-managed image memory for flexible reasoning

2 retrieved papers

The authors introduce an image memory mechanism managed by the planner that stores intermediate visual outputs and allows selective recall of historical images. This design enables branching and backtracking in the reasoning process, departing from linear visual chain-of-thought approaches while reducing long-context overhead.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PixelCraft multi-agent system for structured image reasoning

[53] Mdocagent: A multi-modal multi-agent framework for document understanding PDF

Cannot Refute

[54] A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation PDF

Cannot Refute

[55] Multi-agent system for comprehensive soccer understanding PDF

Cannot Refute

[56] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception PDF

Cannot Refute

[57] A Multimodal Multi-Agent Framework for Radiology Report Generation PDF

Cannot Refute

[58] Vision-inertial collaborative localization of multi-agents with remote interaction PDF

Cannot Refute

[59] T2i-copilot: A training-free multi-agent text-to-image system for enhanced prompt interpretation and interactive generation PDF

Cannot Refute

[60] Multi-agent reasoning for cardiovascular imaging phenotype analysis PDF

Cannot Refute

[61] A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering PDF

Cannot Refute

[62] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation PDF

Cannot Refute

Contribution

High-fidelity pixel-level grounding model and tool agents

[24] GeoChat:Grounded Large Vision-Language Model for Remote Sensing PDF

Cannot Refute

[63] Your large vision-language model only needs a few attention heads for visual grounding PDF

Cannot Refute

[64] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[65] DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding PDF

Cannot Refute

[66] GLaMM: Pixel Grounding Large Multimodal Model PDF

Cannot Refute

[67] Pg-video-llava: Pixel grounding large video-language models PDF

Cannot Refute

[68] Towards a multimodal large language model with pixel-level insight for biomedicine PDF

Cannot Refute

[69] Geomag: A vision-language model for pixel-level fine-grained remote sensing image parsing PDF

Cannot Refute

[70] Groundhog Grounding Large Language Models to Holistic Segmentation PDF

Cannot Refute

[71] Geoground: A unified large vision-language model for remote sensing visual grounding PDF

Cannot Refute

Contribution

Planner-managed image memory for flexible reasoning

[51] Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory PDF

Cannot Refute

[52] Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling PDF

Cannot Refute

PixelCraft: A Multi-Agent system for High-Fidelity Visual Reasoning on Structured Images

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

PixelCraft multi-agent system for structured image reasoning

[53] Mdocagent: A multi-modal multi-agent framework for document understanding PDF

[54] A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation PDF

[55] Multi-agent system for comprehensive soccer understanding PDF

[56] Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception PDF

[57] A Multimodal Multi-Agent Framework for Radiology Report Generation PDF

[58] Vision-inertial collaborative localization of multi-agents with remote interaction PDF

[59] T2i-copilot: A training-free multi-agent text-to-image system for enhanced prompt interpretation and interactive generation PDF

[60] Multi-agent reasoning for cardiovascular imaging phenotype analysis PDF

[61] A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering PDF

[62] MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation PDF

High-fidelity pixel-level grounding model and tool agents

[24] GeoChat:Grounded Large Vision-Language Model for Remote Sensing PDF

[63] Your large vision-language model only needs a few attention heads for visual grounding PDF

[64] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[65] DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding PDF

[66] GLaMM: Pixel Grounding Large Multimodal Model PDF

[67] Pg-video-llava: Pixel grounding large video-language models PDF

[68] Towards a multimodal large language model with pixel-level insight for biomedicine PDF

[69] Geomag: A vision-language model for pixel-level fine-grained remote sensing image parsing PDF

[70] Groundhog Grounding Large Language Models to Holistic Segmentation PDF

[71] Geoground: A unified large vision-language model for remote sensing visual grounding PDF

Planner-managed image memory for flexible reasoning

[51] Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory PDF

[52] Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling PDF

Table of Contents