VisCoder2: Building Multi-Language Visualization Coding Agents

ICLR 2026 Conference SubmissionAnonymous Authors
Code ModelsVisualizationFine-tuning
Abstract:

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces three resources for multi-language visualization coding agents: a 679K-sample dataset with multi-turn correction dialogues across 12 languages, a benchmark for generation and self-debug evaluation, and a model family trained on this data. Within the taxonomy, it resides in the 'Multi-Language Visualization Coding Agents with Self-Debug' leaf, which contains only one sibling paper (Multi-Language VisCoding). This leaf sits under 'Visualization Code Generation and Debugging', a moderately populated branch with three sub-categories totaling four papers, indicating a relatively sparse but emerging research direction.

The taxonomy reveals neighboring work in grammar-agnostic visualization pipelines (LIDA) and multimodal code generation from flowcharts, both excluding iterative debugging mechanisms. The broader 'Multi-Language Code Generation and Translation' branch addresses cross-language synthesis but focuses on general-purpose code rather than visualization-specific tasks. The paper's emphasis on executable visualization samples and multi-turn correction dialogues positions it at the intersection of visualization synthesis and iterative debugging, diverging from translation-focused approaches (RepoTransAgent, Rectifier) that handle repository-level code conversion without visualization constraints.

Among 12 candidates examined, the dataset contribution shows 1 refutable candidate out of 2 examined, the benchmark has 2 refutable candidates among 6 examined, and the model family has 1 refutable among 4 examined. The limited search scope suggests that within top-ranked semantic matches, some prior work addresses overlapping aspects—particularly in multi-language visualization generation and benchmarking. The dataset and benchmark contributions appear to face more substantial prior work than the model family, though the small candidate pool (12 total) means these findings reflect a narrow slice of the literature rather than exhaustive coverage.

Based on examination of 12 semantically similar papers, the work appears to advance a sparsely populated research direction by combining multi-language support, iterative debugging, and large-scale training data. The analysis captures top-ranked matches but does not encompass the full landscape of visualization generation or code debugging research. The taxonomy structure and sibling count suggest the specific combination of features may be relatively novel within the examined scope.

Taxonomy

Core-task Taxonomy Papers
10
3
Claimed Contributions
12
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: multi-language visualization code generation and iterative debugging. The field structure reflects a convergence of code generation, visualization synthesis, and debugging capabilities across multiple programming languages. The taxonomy organizes work into several main branches: Multi-Language Code Generation and Translation addresses cross-language synthesis and repository-level translation (e.g., RepoTransAgent[3], CruxEval-X[1]); Visualization Code Generation and Debugging focuses on producing and refining chart or plot code from natural language or structured inputs (e.g., LIDA[4], Flowchart to Code[5]); Integrated Development and Debugging Platforms explores holistic environments that combine generation with iterative correction; Cross-Lingual Adaptation and Iterative Training examines training strategies that enable models to handle diverse languages and self-improve; and Performance Visualization and Debugging Tools targets runtime analysis and profiling visualizations (e.g., Performance Debugging Visualization[6]). Together, these branches capture the spectrum from single-language chart generation to multi-language code translation with debugging loops. A particularly active line of work centers on agents that generate visualization code in multiple languages and iteratively debug their outputs, exemplified by VisCoder2[0] and Multi-Language VisCoding[9]. These approaches emphasize self-correction mechanisms that allow models to detect and fix errors across Python, R, or other languages, contrasting with earlier single-language or non-iterative methods like LIDA[4]. Another emerging theme involves cross-language translation with debugging support, as seen in Generate Debug Translate[10] and Rectifier[2], which address syntactic and semantic mismatches when converting code between languages. VisCoder2[0] sits squarely within the visualization-focused debugging cluster, sharing the multi-language emphasis of Multi-Language VisCoding[9] but extending iterative refinement capabilities. Compared to broader translation agents like RepoTransAgent[3], VisCoder2[0] narrows its scope to visualization tasks, trading generality for specialized debugging heuristics tailored to chart generation. Open questions remain around scaling these iterative loops to more complex visualizations and integrating runtime feedback from execution environments.

Claimed Contributions

VisCode-Multi-679K dataset

A supervised instruction-tuning dataset comprising 679K executable visualization code samples paired with rendered outputs and multi-turn correction dialogues, spanning twelve programming languages including Python, LaTeX, LilyPond, SVG, HTML, Asymptote, Mermaid, and Vega-Lite.

2 retrieved papers
Can Refute
VisPlotBench benchmark

A benchmark containing 888 executable visualization tasks across eight programming languages and thirteen visual categories, with standardized evaluation protocols for both single-round code generation and multi-round iterative self-debugging.

6 retrieved papers
Can Refute
VisCoder2 model family

A family of visualization coding agents trained on VisCode-Multi-679K at multiple scales (3B, 7B, 14B, 32B parameters) that can generate, execute, and iteratively refine visualization code across multiple programming languages, approaching the performance of proprietary models like GPT-4.1.

4 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VisCode-Multi-679K dataset

A supervised instruction-tuning dataset comprising 679K executable visualization code samples paired with rendered outputs and multi-turn correction dialogues, spanning twelve programming languages including Python, LaTeX, LilyPond, SVG, HTML, Asymptote, Mermaid, and Vega-Lite.

Contribution

VisPlotBench benchmark

A benchmark containing 888 executable visualization tasks across eight programming languages and thirteen visual categories, with standardized evaluation protocols for both single-round code generation and multi-round iterative self-debugging.

Contribution

VisCoder2 model family

A family of visualization coding agents trained on VisCode-Multi-679K at multiple scales (3B, 7B, 14B, 32B parameters) that can generate, execute, and iteratively refine visualization code across multiple programming languages, approaching the performance of proprietary models like GPT-4.1.

VisCoder2: Building Multi-Language Visualization Coding Agents | Novelty Validation