JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Multimodal LLMData SynthesisCode GenerationData Visualization

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code, benchmark, and checkpoints will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a synthesis toolkit for multimodal code data, the JanusCode-800K corpus, and unified models (JanusCoder/JanusCoderV) that generate code from visual and textual inputs. It resides in the 'Unified Multimodal Code Generation Models' leaf, which contains five papers total, including the original work. This leaf sits within the broader 'Multimodal Code Generation Frameworks and Methodologies' branch, indicating a moderately populated research direction focused on general-purpose architectures rather than domain-specific solutions. The taxonomy shows this is an active but not overcrowded area, with sibling papers exploring similar unified approaches.

The taxonomy reveals neighboring leaves addressing specialized domains: 'Scientific Visualization and Chart Code Generation' (five papers), 'Web UI Code Generation from Design Images' (eight papers), and 'Multimodal Program Synthesis and Reasoning' (four papers). The original paper's position in the unified models leaf suggests it aims to bridge these specialized directions rather than deepen any single domain. The scope note for this leaf explicitly excludes task-specific models, positioning the work as a horizontal integration effort. Nearby branches like 'Robotic and Embodied Agent Code Generation' (five papers) and 'CAD and 3D Model Code Generation' (three papers) represent alternative application domains that the unified approach might encompass.

Among 24 candidates examined, the synthesis toolkit and corpus contributions show no clear refutation across seven candidates each. The unified model contribution examined ten candidates and found one potentially refutable prior work, suggesting some overlap in the architectural approach. The toolkit and corpus appear more novel within the limited search scope, while the model architecture faces more substantial prior work. This pattern indicates the data-centric contributions may represent the stronger novelty claims, though the search scope remains modest relative to the field's breadth. The analysis does not cover exhaustive citation networks or domain-specific venues beyond top semantic matches.

Based on the limited search of 24 candidates, the work appears to occupy a moderately novel position, particularly in its data synthesis and corpus contributions. The unified modeling approach shows some overlap with existing frameworks, consistent with its placement in a leaf containing four other unified models. The taxonomy structure suggests the field is transitioning from specialized systems toward general-purpose architectures, and this work participates in that trend. A more comprehensive literature review would be needed to assess novelty against the full landscape of multimodal code generation research.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal code generation from visual and textual inputs. This field encompasses methods that translate diverse visual representations—ranging from UI mockups and hand-drawn sketches to mathematical diagrams and robotic scene observations—into executable code. The taxonomy organizes research into several main branches: UI/Front-End Code Generation focuses on translating design artifacts into web or mobile interfaces (e.g., Design2code[1], ScreenCoder[13]); Specialized Visual-to-Code Generation targets domain-specific applications such as chart reproduction (Plot2code[2], Chart-R1[17]) and CAD modeling (CAD-Coder[8]); Algorithmic and Mathematical Code Generation addresses problems like geometry solving (GeoCoder[31]) and mathematical reasoning (MathCoder-VL[3]); Robotic and Embodied Agent Code Generation produces control programs from sensor data or task descriptions (RoboCodex[23], EmbodiedCoder[29]); Multimodal Code Generation Frameworks and Methodologies develop unified architectures that handle multiple input modalities and code targets; and Multimodal Data Generation and Representation explores synthetic data creation and embedding strategies to support training and evaluation. A particularly active line of work centers on unified multimodal frameworks that aim to handle diverse visual inputs and code outputs within a single model architecture, contrasting with earlier domain-specific pipelines. JanusCoder[0] exemplifies this direction by proposing a unified approach that integrates visual and textual encoders to generate code across multiple domains, positioning itself alongside other general-purpose systems like VinciCoder[25] and MMCode[26]. While VinciCoder[25] emphasizes cross-modal alignment through contrastive learning and MMCode[26] explores modular reasoning strategies, JanusCoder[0] focuses on end-to-end generation with joint training objectives. These unified models face trade-offs between generality and domain-specific performance: specialized methods often achieve higher accuracy on narrow tasks, but unified frameworks offer greater flexibility and scalability. Open questions remain around optimal architectural choices for balancing visual understanding with code synthesis, effective strategies for leveraging large-scale multimodal pretraining, and robust evaluation protocols that capture both functional correctness and visual fidelity across diverse application domains.

Claimed Contributions

Complete synthesis toolkit for multimodal code data

7 retrieved papers

The authors introduce a comprehensive toolkit that automates the synthesis of multimodal code data spanning diverse domains (charts, web UIs, visual artifacts, animations) and programming languages. The toolkit leverages reciprocal synergies between data modalities and includes quality control mechanisms through execution validation and reward modeling.

7 retrieved papers

JanusCode-800K multimodal code corpus

7 retrieved papers

The authors construct JanusCode-800K, claimed as the largest multimodal code corpus to date with approximately 800K samples. The corpus uniquely includes large-scale animation and artifact data previously absent from existing datasets, balancing text-centric and vision-centric code intelligence tasks.

7 retrieved papers

JanusCoder unified visual-programmatic interface models

Can Refute

10 retrieved papers

The authors develop JanusCoder and JanusCoderV as unified models that establish a visual-programmatic interface for code intelligence. Unlike existing specialized models for isolated tasks, these models handle diverse tasks including code generation from textual instructions, visual inputs, or combinations thereof across multiple domains.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models PDF

Jiang Lingjie, Huang, Shaohan, Lingjie Jiang, Wu Xun, Shaohan Huang, Li, Yixia, Xun Wu, Zhang, Dongdong, Yixia Li, Wei, Furu, Dongdong Zhang, Furu Wei (2025) • arXiv.org

[25] VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning PDF

Jiang Deyang, Xuanle Zhao, Zeng Zhixiong, Deyang Jiang, Chen Lei, Zhixiong Zeng, Qiu Haibo, Lei Chen, Huang Jing, Haibo Qiu, Zhong, Yufeng, Jing Huang, Zheng Liming, Yufeng Zhong, Cao Yi-lin, Liming Zheng, Ma Lin, Yilin Cao, Lin Ma (2025)

[26] MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems PDF

Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, Zhiyong Huang, Jing Ma (2024) • Conference on Empirical Methods in Natural Language Processing

[49] DVLR: Disentangling Vision Language Representation for Image to Code PDF

Singh, Mukul, Le Vu, Gulwani, Sumit (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Complete synthesis toolkit for multimodal code data

[27] Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis PDF

Cannot Refute

[32] Multilingual multimodal software developer for code generation PDF

Cannot Refute

[59] Autocodebench: Large language models are automatic code benchmark generators PDF

Cannot Refute

[60] Zeronlg: Aligning and autoencoding domains for zero-shot multimodal and multilingual natural language generation PDF

Cannot Refute

[61] Automated Code Generation from Flowcharts: A Multimodal Deep Learning Framework for Accurate Translation and Debugging PDF

Cannot Refute

[62] Bidirectional Automatic Program Code Conversion for Learning Multiple Programming Languages PDF

Cannot Refute

[63] Guess, Measure & Edit: Using Lowering to Lift Tensor Code PDF

Cannot Refute

Contribution

JanusCode-800K multimodal code corpus

[5] Logomotion: Visually-grounded code synthesis for creating and editing animation PDF

Cannot Refute

[64] Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation PDF

Cannot Refute

[65] Procedurally generated AI compound media for expanding audial creations, broadening immersion and perception experience PDF

Cannot Refute

[66] Theoremexplainagent: Towards video-based multimodal explanations for llm theorem understanding PDF

Cannot Refute

[67] TGIF: A new dataset and benchmark on animated GIF description PDF

Cannot Refute

[68] Little Blocks, Big Ideas: How First Graders Animate Identity and Expression in ScratchJr PDF

Cannot Refute

[69] Artifacts for Using an LLM to Help With Code Understanding PDF

Cannot Refute

Contribution

JanusCoder unified visual-programmatic interface models

[19] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models PDF

Can Refute

[25] VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning PDF

Cannot Refute

[51] Low-code LLM: Visual Programming over LLMs PDF

Cannot Refute

[52] ViUniT: Visual Unit Tests for More Robust Visual Programming PDF

Cannot Refute

[53] Proptest: Automatic property testing for improved visual programming PDF

Cannot Refute

[54] An exploratory study of ml sketches and visual code assistants PDF

Cannot Refute

[55] Application of artificial intelligence and visual programming technologies in digital interactive project development PDF

Cannot Refute

[56] Chaldene: Towards Visual Programming Image Processing in Jupyter Notebooks PDF

Cannot Refute

[57] IDEvelopAR: A Programming Interface to enhance Code Understanding in Augmented Reality PDF

Cannot Refute

[58] InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation PDF

Cannot Refute

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models PDF

[25] VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning PDF

[26] MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems PDF

[49] DVLR: Disentangling Vision Language Representation for Image to Code PDF

Contribution Analysis

Complete synthesis toolkit for multimodal code data

[27] Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis PDF

[32] Multilingual multimodal software developer for code generation PDF

[59] Autocodebench: Large language models are automatic code benchmark generators PDF

[60] Zeronlg: Aligning and autoencoding domains for zero-shot multimodal and multilingual natural language generation PDF

[61] Automated Code Generation from Flowcharts: A Multimodal Deep Learning Framework for Accurate Translation and Debugging PDF

[62] Bidirectional Automatic Program Code Conversion for Learning Multiple Programming Languages PDF

[63] Guess, Measure & Edit: Using Lowering to Lift Tensor Code PDF

JanusCode-800K multimodal code corpus

[5] Logomotion: Visually-grounded code synthesis for creating and editing animation PDF

[64] Artifactsbench: Bridging the visual-interactive gap in llm code generation evaluation PDF

[65] Procedurally generated AI compound media for expanding audial creations, broadening immersion and perception experience PDF

[66] Theoremexplainagent: Towards video-based multimodal explanations for llm theorem understanding PDF

[67] TGIF: A new dataset and benchmark on animated GIF description PDF

[68] Little Blocks, Big Ideas: How First Graders Animate Identity and Expression in ScratchJr PDF

[69] Artifacts for Using an LLM to Help With Code Understanding PDF

JanusCoder unified visual-programmatic interface models

[19] VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models PDF

[25] VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning PDF

[51] Low-code LLM: Visual Programming over LLMs PDF

[52] ViUniT: Visual Unit Tests for More Robust Visual Programming PDF

[53] Proptest: Automatic property testing for improved visual programming PDF

[54] An exploratory study of ml sketches and visual code assistants PDF

[55] Application of artificial intelligence and visual programming technologies in digital interactive project development PDF

[56] Chaldene: Towards Visual Programming Image Processing in Jupyter Notebooks PDF

[57] IDEvelopAR: A Programming Interface to enhance Code Understanding in Augmented Reality PDF

[58] InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation PDF

Table of Contents