TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Natural Language ProcessingAI/NLP for ScienceLarge Language ModelsVision Language ModelsReinforcement LearningCode GenerationRepresentation Learning
Abstract:

Large language models (LLMs) are increasingly used to assist scientists across diverse workflows. A key challenge is generating high-quality figures from textual descriptions, often represented as TikZ programs that can be rendered as scientific images. Prior research has proposed a variety of datasets and modeling approaches for this task. However, existing datasets for Text-to-TikZ are too small and noisy to capture the complexity of TikZ, causing mismatches between text and rendered figures. Moreover, prior approaches rely solely on supervised fine-tuning (SFT), which does not expose the model to the rendered semantics of the figure, often resulting in errors such as looping, irrelevant content, and incorrect spatial relations. To address these issues, we construct DaTikZ-V4, a dataset more than four times larger and substantially higher in quality than DaTikZ-V3, enriched with LLM-generated figure descriptions. Using this dataset, we train TikZilla, a family of small open-source Qwen models (3B and 8B) with a two-stage pipeline of SFT followed by reinforcement learning (RL). For RL, we leverage an image encoder trained via inverse graphics to provide semantically faithful reward signals. Extensive human evaluations with over 1,000 judgments show that TikZilla improves by 1.5-2 points over its base models on a 5-point scale, surpasses GPT-4o by 0.5 points, and matches GPT-5 in the image-based evaluation, while operating at much smaller model sizes. Code, data, and models will be made available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces TikZilla, a family of small language models (3B and 8B parameters) trained to generate TikZ code from textual descriptions using a two-stage pipeline combining supervised fine-tuning and reinforcement learning. According to the taxonomy tree, this work resides in the 'Reinforcement Learning-Enhanced Generation' leaf under 'Text-to-TikZ Generation Methods'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this specific combination of RL-based training for TikZ generation represents a relatively sparse research direction within the broader field of 30 papers examined.

The taxonomy reveals that neighboring leaves include 'Supervised Fine-Tuning Approaches' (containing one paper on prompt-based LLM pipelines) and 'Zero-Shot and Unaligned Data Methods' (one paper on leveraging unaligned graphics programs). The broader 'Text-to-TikZ Generation Methods' branch sits alongside 'Indirect Generation via TikZ Intermediates' (which uses TikZ as a bridge to image synthesis or multimodal understanding) and 'Domain-Specific TikZ Generation' (targeting mathematical diagrams or specialized scientific figures). TikZilla diverges from these directions by directly synthesizing general-purpose TikZ code while incorporating semantic feedback from rendered outputs, rather than relying on intermediate representations or domain-specific templates.

Among 15 total candidates examined across three contributions, no clearly refuting prior work was identified. The 'TikZilla model family with two-stage training' contribution examined 10 candidates with zero refutable matches, while the 'domain-specific reward model for RL' contribution examined 5 candidates, also with zero refutations. The 'DaTikZ-V4 dataset construction' contribution examined no candidates. This limited search scope—15 papers from semantic search and citation expansion—suggests that within the examined literature, the combination of supervised fine-tuning followed by RL with inverse-graphics-based rewards appears relatively unexplored, though the analysis does not claim exhaustive coverage of all possible prior work.

Based on the top-15 semantic matches and taxonomy structure, the work appears to occupy a novel position by combining RL-based training with TikZ generation, a direction not represented by sibling papers in the same taxonomy leaf. However, the limited search scope and the presence of related supervised and zero-shot approaches in neighboring leaves indicate that the broader problem space is moderately populated. The analysis covers direct methodological overlap but does not exhaustively examine all possible dataset construction techniques or reward modeling strategies in adjacent domains.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
15
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: generating scientific figures from textual descriptions using TikZ programs. The field encompasses a range of approaches that vary in how they bridge natural language and TikZ code. Text-to-TikZ Generation Methods form the central branch, exploring direct synthesis strategies including prompt-based large language model pipelines (e.g., AutomaTikZ[4], Controllable GPT-4[1]) and reinforcement learning-enhanced techniques. Indirect Generation via TikZ Intermediates investigates multi-stage workflows that decompose the problem through intermediate representations such as scene graphs or domain-specific languages (e.g., DeTikZify[19], DiagramIR[20]). Domain-Specific TikZ Generation focuses on specialized figure types—automata, neural network diagrams, biological pathways—where tailored templates or constraints improve output quality (e.g., Automata TikZ[10], nndiagram[24], sbgntikz[27]). Evaluation and Benchmarking addresses the challenge of assessing generated TikZ code through datasets and metrics (VGBench[2], MathemaTikZ Benchmark[15]), while TikZ Tools and Applications catalog practical utilities and customization frameworks (TikZ Network[12], Configurable TikZ[23]). Document Understanding and Markup Generation examines the reverse or complementary problem of extracting structured representations from existing documents. A particularly active line of work explores how to improve generation fidelity and controllability. Some studies emphasize iterative refinement and feedback loops, while others investigate whether intermediate symbolic representations can reduce the semantic gap between text and low-level drawing commands. TikZilla[0] sits within the Reinforcement Learning-Enhanced Generation cluster, applying policy optimization to refine TikZ outputs based on visual or structural rewards. This approach contrasts with purely supervised methods like Words to Visuals[3], which rely on large paired corpora, and with zero-shot or few-shot prompting strategies exemplified by TikZero[5]. By framing generation as a sequential decision problem, TikZilla[0] aims to balance syntactic correctness with semantic alignment, addressing a common trade-off in the field: whether to prioritize rapid prototyping through end-to-end models or invest in more structured, feedback-driven pipelines that can handle complex figure specifications.

Claimed Contributions

DaTikZ-V4 dataset construction

The authors build a new dataset for Text-to-TikZ that is over four times larger than its predecessor, sourced from arXiv, GitHub, TeX StackExchange, and synthetic data. They enhance quality through LLM-based debugging of uncompilable code and VLM-generated figure descriptions, addressing the noise and small scale of prior datasets.

0 retrieved papers
TikZilla model family with two-stage training

The authors introduce TikZilla, a family of small Qwen-based models trained using supervised fine-tuning for syntax alignment followed by reinforcement learning with a domain-specific reward model. This two-stage approach substantially improves Text-to-TikZ generation quality, enabling even 3B parameter models to outperform GPT-4o.

10 retrieved papers
Domain-specific reward model for RL

The authors propose the first domain-specific reward model for Text-to-TikZ by retraining an image encoder from DeTikZify-V2 on their larger dataset. This encoder provides semantically meaningful reward signals during RL optimization, correlating more strongly with human judgments than general-purpose metrics like CLIPScore or DreamSIM.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DaTikZ-V4 dataset construction

The authors build a new dataset for Text-to-TikZ that is over four times larger than its predecessor, sourced from arXiv, GitHub, TeX StackExchange, and synthetic data. They enhance quality through LLM-based debugging of uncompilable code and VLM-generated figure descriptions, addressing the noise and small scale of prior datasets.

Contribution

TikZilla model family with two-stage training

The authors introduce TikZilla, a family of small Qwen-based models trained using supervised fine-tuning for syntax alignment followed by reinforcement learning with a domain-specific reward model. This two-stage approach substantially improves Text-to-TikZ generation quality, enabling even 3B parameter models to outperform GPT-4o.

Contribution

Domain-specific reward model for RL

The authors propose the first domain-specific reward model for Text-to-TikZ by retraining an image encoder from DeTikZify-V2 on their larger dataset. This encoder provides semantically meaningful reward signals during RL optimization, correlating more strongly with human judgments than general-purpose metrics like CLIPScore or DreamSIM.