ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Infographic ChartChart UnderstandingCode GenerationChart GenerationDataset
Abstract:

Infographic charts are a powerful medium for communicating abstract data by combining visual elements (e.g., charts, images) with textual information. However, their visual and structural richness poses challenges for large vision-language models (LVLMs), which are typically trained on plain charts. To bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to advance the understanding and generation of infographic charts. The dataset is constructed through an inductive process that identifies 75 chart types, 440 chart variations, and 68 layout templates from real infographic charts and uses them to create synthetic ones programmatically. We showcase the utility of this dataset through: 1) improving infographic chart understanding via fine-tuning, 2) benchmarking code generation for infographic charts, and 3) enabling example-based infographic chart generation. By capturing the visual and structural complexity of real design, ChartGalaxy provides a useful resource for enhancing multimodal reasoning and generation in LVLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ChartGalaxy, a million-scale dataset for infographic chart understanding and generation, constructed through an inductive process identifying 75 chart types, 440 variations, and 68 layout templates. Within the taxonomy, it resides in the 'Large-Scale Dataset Construction' leaf under 'Multimodal Model Development for Chart Tasks', sharing this space with only one sibling paper (NovaChart). This leaf represents a relatively sparse but critical research direction focused on building comprehensive training resources for vision-language models, distinguishing itself from the more crowded branches of design automation and educational applications.

The taxonomy reveals that ChartGalaxy sits within a broader ecosystem of multimodal model development, adjacent to leaves addressing unified multi-task learning frameworks, code-guided synthesis, and specialized UI models. Neighboring branches include 'Automated Design and Authoring Systems' (template extraction, message-driven authoring) and 'Infographic Understanding and Interpretation' (content extraction, cognitive analysis). The scope note for its leaf emphasizes 'diverse chart types, annotations, and task coverage', explicitly excluding domain-specific or small-scale datasets, positioning ChartGalaxy as a general-purpose resource rather than a specialized benchmark.

Among 24 candidates examined across three contributions, the dataset itself (Contribution 1: 10 candidates, 0 refutable) appears novel within the limited search scope, with no prior work directly overlapping its million-scale programmatic construction approach. However, the pipeline for programmatic chart creation (Contribution 2: 10 candidates, 2 refutable) shows more substantial prior work, suggesting that code-based synthesis methods have been explored elsewhere. The three applications demonstrating utility (Contribution 3: 4 candidates, 0 refutable) appear less contested, though the small candidate pool limits confidence in this assessment.

Based on this limited top-24 semantic search, ChartGalaxy's primary novelty appears to lie in its scale and systematic taxonomy-driven construction rather than fundamentally new generation techniques. The analysis does not cover exhaustive literature on chart datasets or programmatic synthesis methods, leaving open the possibility of additional relevant prior work beyond the examined candidates. The sparse population of its taxonomy leaf suggests this dataset-centric direction remains relatively underexplored compared to design automation or educational applications.

Taxonomy

Core-task Taxonomy Papers
39
3
Claimed Contributions
24
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: infographic chart understanding and generation. The field encompasses both computational methods for interpreting visual data representations and systems that automate their creation. The taxonomy reveals five main branches: Multimodal Model Development focuses on building large-scale datasets and training vision-language models to parse charts and infographics, with works like ChartLlama[6] and ScreenAI[20] advancing multimodal reasoning capabilities. Automated Design and Authoring Systems explore tools that assist or fully automate the generation process, including mixed-initiative approaches such as Mixed Initiative Charts[18] and template-based generators like Infographics Generator[19]. Infographic Understanding and Interpretation addresses how humans and machines extract meaning from complex visual narratives, with studies ranging from tag prediction methods to accessibility considerations in works like Accessible Infographics[10]. Educational Applications examine pedagogical uses across diverse contexts, from geography instruction to medical communication, while Domain-Specific Applications target specialized fields like healthcare patient journeys and military training exercises. Recent activity centers on scaling up dataset construction to support more robust multimodal models, contrasting data-driven learning approaches with rule-based authoring systems that offer greater designer control. ChartGalaxy[0] sits squarely within the Large-Scale Dataset Construction cluster under Multimodal Model Development, sharing this focus with NovaChart[1], which similarly emphasizes building comprehensive training resources for chart understanding tasks. While NovaChart[1] and related dataset efforts concentrate on breadth and diversity of chart types, ChartGalaxy[0] appears to push toward even larger scale or richer annotation schemes to support next-generation vision-language models. This dataset-centric work contrasts with generation-focused efforts like ChartGen[5] and CycleChart[8], which prioritize synthesis quality and design automation, highlighting an ongoing tension between improving interpretive capabilities through better training data versus developing more sophisticated authoring tools.

Claimed Contributions

ChartGalaxy dataset

The authors present ChartGalaxy, a large-scale dataset containing 1,701,356 synthetic and 61,833 real infographic charts paired with data tables. The dataset is constructed through an inductive process that identifies chart types, variations, and layout templates from real designs to programmatically create synthetic ones.

10 retrieved papers
Pipeline for programmatic infographic chart creation

The authors develop a human-in-the-loop pipeline that extracts design patterns (75 chart types, 440 variations, 68 layout templates) from real infographic charts and uses them to automatically generate synthetic infographic charts at scale.

10 retrieved papers
Can Refute
Three applications demonstrating dataset utility

The authors demonstrate the value of ChartGalaxy through three distinct applications: improving infographic chart understanding via fine-tuning, benchmarking code generation for infographic charts, and enabling example-based infographic chart generation.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ChartGalaxy dataset

The authors present ChartGalaxy, a large-scale dataset containing 1,701,356 synthetic and 61,833 real infographic charts paired with data tables. The dataset is constructed through an inductive process that identifies chart types, variations, and layout templates from real designs to programmatically create synthetic ones.

Contribution

Pipeline for programmatic infographic chart creation

The authors develop a human-in-the-loop pipeline that extracts design patterns (75 chart types, 440 variations, 68 layout templates) from real infographic charts and uses them to automatically generate synthetic infographic charts at scale.

Contribution

Three applications demonstrating dataset utility

The authors demonstrate the value of ChartGalaxy through three distinct applications: improving infographic chart understanding via fine-tuning, benchmarking code generation for infographic charts, and enabling example-based infographic chart generation.