JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal LLMData SynthesisCode GenerationData Visualization
Abstract:

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code, benchmark, and checkpoints will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a synthesis toolkit for multimodal code data, the JanusCode-800K corpus, and unified models (JanusCoder/JanusCoderV) that generate code from visual and textual inputs. It resides in the 'Unified Multimodal Code Generation Models' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Multimodal Code Generation Frameworks and Methodologies' branch, indicating a moderately populated research direction focused on general-purpose architectures rather than domain-specific solutions. The taxonomy shows this is an active but not overcrowded area, with sibling papers exploring similar unified approaches.

The taxonomy reveals neighboring leaves addressing specialized domains: 'Scientific Visualization and Chart Code Generation' (five papers), 'Web UI Code Generation from Design Images' (eight papers), and 'Multimodal Program Synthesis and Reasoning' (four papers). The original paper's position in the unified models leaf suggests it aims to bridge these specialized directions rather than deepen any single domain. The scope note for this leaf explicitly excludes task-specific models, positioning the work as a horizontal integration effort. Nearby branches like 'Robotic and Embodied Agent Code Generation' (five papers) and 'CAD and 3D Model Code Generation' (three papers) represent alternative application domains that the unified approach might encompass.

Among 24 candidates examined, the synthesis toolkit and corpus contributions show no clear refutation across seven candidates each. The unified model contribution examined ten candidates and found one potentially refutable prior work, suggesting some overlap in the architectural approach. The toolkit and corpus appear more novel within the limited search scope, while the model architecture faces more substantial prior work. This pattern indicates the data-centric contributions may represent the stronger novelty claims, though the search scope remains modest relative to the field's breadth. The analysis does not cover exhaustive citation networks or domain-specific venues beyond top semantic matches.

Based on the limited search of 24 candidates, the work appears to occupy a moderately novel position, particularly in its data synthesis and corpus contributions. The unified modeling approach shows some overlap with existing frameworks, consistent with its placement in a leaf containing four other unified models. The taxonomy structure suggests the field is transitioning from specialized systems toward general-purpose architectures, and this work participates in that trend. A more comprehensive literature review would be needed to assess novelty against the full landscape of multimodal code generation research.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: multimodal code generation from visual and textual inputs. This field encompasses methods that translate diverse visual representations—ranging from UI mockups and hand-drawn sketches to mathematical diagrams and robotic scene observations—into executable code. The taxonomy organizes research into several main branches: UI/Front-End Code Generation focuses on translating design artifacts into web or mobile interfaces (e.g., Design2code[1], ScreenCoder[13]); Specialized Visual-to-Code Generation targets domain-specific applications such as chart reproduction (Plot2code[2], Chart-R1[17]) and CAD modeling (CAD-Coder[8]); Algorithmic and Mathematical Code Generation addresses problems like geometry solving (GeoCoder[31]) and mathematical reasoning (MathCoder-VL[3]); Robotic and Embodied Agent Code Generation produces control programs from sensor data or task descriptions (RoboCodex[23], EmbodiedCoder[29]); Multimodal Code Generation Frameworks and Methodologies develop unified architectures that handle multiple input modalities and code targets; and Multimodal Data Generation and Representation explores synthetic data creation and embedding strategies to support training and evaluation. A particularly active line of work centers on unified multimodal frameworks that aim to handle diverse visual inputs and code outputs within a single model architecture, contrasting with earlier domain-specific pipelines. JanusCoder[0] exemplifies this direction by proposing a unified approach that integrates visual and textual encoders to generate code across multiple domains, positioning itself alongside other general-purpose systems like VinciCoder[25] and MMCode[26]. While VinciCoder[25] emphasizes cross-modal alignment through contrastive learning and MMCode[26] explores modular reasoning strategies, JanusCoder[0] focuses on end-to-end generation with joint training objectives. These unified models face trade-offs between generality and domain-specific performance: specialized methods often achieve higher accuracy on narrow tasks, but unified frameworks offer greater flexibility and scalability. Open questions remain around optimal architectural choices for balancing visual understanding with code synthesis, effective strategies for leveraging large-scale multimodal pretraining, and robust evaluation protocols that capture both functional correctness and visual fidelity across diverse application domains.

Claimed Contributions

Complete synthesis toolkit for multimodal code data

The authors introduce a comprehensive toolkit that automates the synthesis of multimodal code data spanning diverse domains (charts, web UIs, visual artifacts, animations) and programming languages. The toolkit leverages reciprocal synergies between data modalities and includes quality control mechanisms through execution validation and reward modeling.

7 retrieved papers
JanusCode-800K multimodal code corpus

The authors construct JanusCode-800K, claimed as the largest multimodal code corpus to date with approximately 800K samples. The corpus uniquely includes large-scale animation and artifact data previously absent from existing datasets, balancing text-centric and vision-centric code intelligence tasks.

7 retrieved papers
JanusCoder unified visual-programmatic interface models

The authors develop JanusCoder and JanusCoderV as unified models that establish a visual-programmatic interface for code intelligence. Unlike existing specialized models for isolated tasks, these models handle diverse tasks including code generation from textual instructions, visual inputs, or combinations thereof across multiple domains.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Complete synthesis toolkit for multimodal code data

The authors introduce a comprehensive toolkit that automates the synthesis of multimodal code data spanning diverse domains (charts, web UIs, visual artifacts, animations) and programming languages. The toolkit leverages reciprocal synergies between data modalities and includes quality control mechanisms through execution validation and reward modeling.

Contribution

JanusCode-800K multimodal code corpus

The authors construct JanusCode-800K, claimed as the largest multimodal code corpus to date with approximately 800K samples. The corpus uniquely includes large-scale animation and artifact data previously absent from existing datasets, balancing text-centric and vision-centric code intelligence tasks.

Contribution

JanusCoder unified visual-programmatic interface models

The authors develop JanusCoder and JanusCoderV as unified models that establish a visual-programmatic interface for code intelligence. Unlike existing specialized models for isolated tasks, these models handle diverse tasks including code generation from textual instructions, visual inputs, or combinations thereof across multiple domains.