PixelCraft: A Multi-Agent system for High-Fidelity Visual Reasoning on Structured Images

ICLR 2026 Conference SubmissionAnonymous Authors
chart understandingmulti-agent systemvisual reasoning
Abstract:

Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

PixelCraft proposes a multi-agent system combining a dispatcher, planner, reasoner, critics, and visual tool agents for structured image reasoning. The taxonomy places this work in the 'Multi-Agent Systems for Visual Reasoning' leaf, which currently contains only this paper among 50 total papers across 36 topics. This indicates a sparse, emerging research direction within the broader visual reasoning landscape, suggesting the multi-agent collaborative approach for structured images represents relatively unexplored territory in the field.

The taxonomy reveals neighboring approaches in compositional reasoning (Visual Programming, scene graph methods), chain-of-thought frameworks (MINT-CoT, ReFocus), and tool-augmented MLLMs. PixelCraft diverges from single-model compositional methods by distributing reasoning across specialized agents rather than generating unified programs. It differs from chain-of-thought approaches by maintaining dynamic image memory and enabling agent discussion rather than linear reasoning chains. The scope notes indicate this work bridges modular reasoning with collaborative problem-solving, occupying a distinct position between end-to-end neural methods and purely symbolic decomposition.

Among 22 candidates examined across three contributions, none clearly refute the proposed approach. The multi-agent system architecture examined 10 candidates with no refutable overlap; the pixel-level grounding model and tool agents examined 10 candidates with similar results; the planner-managed image memory examined 2 candidates without refutation. This limited search scope suggests the specific combination of high-fidelity grounding, multi-agent collaboration, and dynamic image memory appears novel within the examined literature, though the analysis does not cover exhaustive prior work in agent-based visual reasoning or grounding methods.

Based on top-22 semantic matches, the work appears to introduce a distinctive architectural approach in a sparse research area. The absence of refutable candidates reflects both the limited search scope and the relative novelty of applying multi-agent frameworks to structured image reasoning. However, the analysis cannot assess overlap with broader agent-based systems literature or comprehensive grounding method surveys beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: visual reasoning on structured images. The field encompasses a diverse set of approaches for enabling models to interpret and reason over visually presented structured data such as charts, tables, diagrams, and geometric layouts. The taxonomy reveals several major branches: some focus on understanding structured visual data itself (e.g., chart and table comprehension as in ChartQA[11] and Visual-TableQA[13]), while others emphasize compositional and modular reasoning strategies that decompose complex queries into interpretable steps (e.g., Visual Programming[1]). Neural architectures dedicated to visual reasoning, reinforcement learning methods for iterative decision-making (Grounded Reinforcement Learning for[22]), and chain-of-thought reasoning for vision-language models (MINT-CoT[23], Uni-cot[41]) represent distinct methodological threads. Additional branches address latent reasoning mechanisms, benchmark development (Plotqa[6], CSVQA[33]), multimodal large language models, and the integration of visual generation with reasoning. A smaller but emerging branch explores multi-agent systems for visual reasoning, where collaborative or modular agents tackle structured visual tasks. Recent work highlights contrasting trade-offs between end-to-end neural approaches and more interpretable, step-by-step reasoning paradigms. Chain-of-thought methods like MINT-CoT[23] and ReFocus[5] aim to make reasoning transparent, while latent reasoning approaches pursue efficiency without explicit intermediate steps. Within the multi-agent systems branch, PixelCraft[0] situates itself as a collaborative framework that leverages multiple specialized agents to handle structured visual reasoning tasks. This contrasts with single-model strategies such as Visual Programming[1], which relies on modular code generation, and with reinforcement learning methods like Grounded Reinforcement Learning for[22], which iteratively refine decisions through environmental feedback. The multi-agent perspective offers a middle ground, combining modularity with collaborative problem-solving, and reflects ongoing exploration of how to balance interpretability, scalability, and task-specific specialization in visual reasoning systems.

Claimed Contributions

PixelCraft multi-agent system for structured image reasoning

The authors introduce PixelCraft, a multi-agent framework comprising a dispatcher, planner, reasoner, critics, and visual tool agents. The system enables non-linear reasoning through query-aware agent selection, role-driven discussion, and iterative self-correction, supported by a planner-managed image memory that allows selective recall of prior visual states instead of streaming all images.

10 retrieved papers
High-fidelity pixel-level grounding model and tool agents

The authors develop a hybrid dataset of synthesized charts and annotated geometric diagrams to fine-tune Qwen2.5-VL-3B for precise pixel-level grounding. This grounding model provides accurate coordinates that drive classical CV operators within tool agents, enabling high-fidelity image processing for structured visual reasoning.

10 retrieved papers
Planner-managed image memory for flexible reasoning

The authors introduce an image memory mechanism managed by the planner that stores intermediate visual outputs and allows selective recall of historical images. This design enables branching and backtracking in the reasoning process, departing from linear visual chain-of-thought approaches while reducing long-context overhead.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PixelCraft multi-agent system for structured image reasoning

The authors introduce PixelCraft, a multi-agent framework comprising a dispatcher, planner, reasoner, critics, and visual tool agents. The system enables non-linear reasoning through query-aware agent selection, role-driven discussion, and iterative self-correction, supported by a planner-managed image memory that allows selective recall of prior visual states instead of streaming all images.

Contribution

High-fidelity pixel-level grounding model and tool agents

The authors develop a hybrid dataset of synthesized charts and annotated geometric diagrams to fine-tune Qwen2.5-VL-3B for precise pixel-level grounding. This grounding model provides accurate coordinates that drive classical CV operators within tool agents, enabling high-fidelity image processing for structured visual reasoning.

Contribution

Planner-managed image memory for flexible reasoning

The authors introduce an image memory mechanism managed by the planner that stores intermediate visual outputs and allows selective recall of historical images. This design enables branching and backtracking in the reasoning process, departing from linear visual chain-of-thought approaches while reducing long-context overhead.

PixelCraft: A Multi-Agent system for High-Fidelity Visual Reasoning on Structured Images | Novelty Validation