cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

CAD3D reconstructionLLMVLMpoint cloudDPOGRPO

Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, most existing methods focus on a single input modality: point clouds, images, or texts, which limits their generalizability and robustness, while few multimodal approaches struggle to deliver competitive quality. Leveraging advances in vision-language models (VLM), we propose $\texttt{cadrille}$ , a multimodal CAD reconstruction model that takes inputs of three modalities and outputs executable Python code for CAD reconstruction. Inspired by large language model (LLM) training paradigm, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, $\texttt{cadrille}$ sets new state-of-the-art in as many as 10 benchmarks across three modalities and four datasets, including a real-world one.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces cadrille, a multimodal CAD reconstruction model that accepts point clouds, images, and text as inputs and outputs executable Python code for CAD generation. It resides in the 'Unified Multimodal CAD Reconstruction Frameworks' leaf, which contains four papers including the original work. This leaf represents a relatively focused research direction within the broader taxonomy of 30 papers across the field. The sibling papers in this leaf—GenCAD-3D, CAD-MLLM, and one other—similarly pursue end-to-end multimodal CAD synthesis, indicating a small but active cluster of work addressing the same core challenge.

The taxonomy tree reveals neighboring research directions that handle subsets of the problem. The 'Point Cloud to CAD Construction Sequence Reconstruction' leaf contains three papers focused solely on point cloud inputs, while 'Language-Guided CAD Synthesis with Large Language Models' explores text-driven generation using LLMs. The 'Multimodal Representation Learning for 3D Understanding' branch, with five papers in contrastive pre-training alone, addresses cross-modal alignment without direct CAD output. Cadrille's position bridges these areas by combining multimodal inputs with executable program synthesis, distinguishing it from representation-only methods and single-modality reconstruction approaches.

Among 26 candidates examined, the contribution-level analysis found no clear refutations across all three claimed contributions. For the core multimodal reconstruction model, 10 candidates were examined with none providing overlapping prior work. The RL fine-tuning contribution examined 6 candidates, again with no refutations. The state-of-the-art benchmark results contribution examined 10 candidates, similarly finding no direct overlap. This suggests that within the limited search scope, the combination of multimodal inputs, RL-based refinement, and comprehensive benchmark evaluation appears relatively novel, though the small candidate pool and focused leaf structure indicate a nascent research area.

Based on the top-26 semantic matches and the sparse four-paper leaf structure, the work appears to occupy a relatively unexplored niche. The absence of refutable candidates across contributions may reflect both genuine novelty and the limited scale of the literature search. The taxonomy context shows that while related single-modality and representation-learning methods exist in neighboring branches, the specific combination of multimodal CAD reconstruction with RL fine-tuning has fewer direct precedents within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multimodal CAD reconstruction from point clouds, images, and text. The field structure reflects a progression from foundational representation learning to specialized generation and real-world application. Multimodal Representation Learning for 3D Understanding establishes cross-modal alignment techniques, with works like ULIP[1] and Point-Bind[5] bridging vision, language, and geometry. CAD Generation from Multimodal Inputs focuses on synthesizing parametric models from diverse inputs, encompassing unified frameworks that handle multiple modalities simultaneously, as well as methods targeting specific input types. Point Cloud Completion and Enhancement addresses geometric refinement, while Cross-Modal Shape Abstraction and Transformation explores how different representations can be translated and abstracted. Specialized 3D Generation and Analysis Tasks tackle domain-specific challenges, and Applied CAD Reconstruction and Reverse Engineering brings these techniques to industrial contexts, including manufacturing and design automation. Recent activity centers on unified frameworks that integrate multiple input modalities into coherent CAD outputs, contrasting with earlier single-modality approaches. A key tension involves balancing geometric fidelity with parametric editability: some methods prioritize accurate surface reconstruction from point clouds, while others emphasize generating clean, editable CAD sequences from higher-level inputs like text or sketches. Cadrille[0] sits within the unified multimodal frameworks cluster, alongside GenCAD-3D[2] and CAD-MLLM[25], emphasizing end-to-end learning that fuses point clouds, images, and textual descriptions. Compared to Draw Step by Step[3], which focuses on sequential sketch-based generation, Cadrille[0] adopts a broader multimodal stance, aiming to leverage complementary information across input types. This positioning reflects a growing interest in holistic reconstruction pipelines that can handle real-world scenarios where multiple data sources are available but individually incomplete.

Claimed Contributions

cadrille: multimodal CAD reconstruction model

10 retrieved papers

The authors introduce cadrille, a vision-language model that unifies three input modalities (point clouds, multi-view images, and text) within a single framework to generate executable Python code for CAD reconstruction. This is the first multimodal approach to achieve state-of-the-art results across all three modalities.

10 retrieved papers

RL fine-tuning for multimodal CAD reconstruction

6 retrieved papers

The authors propose a novel training pipeline that uses large-scale procedurally generated data for supervised fine-tuning, followed by reinforcement learning fine-tuning on handcrafted data without requiring CAD sequence annotations. This approach addresses domain gap issues and improves both reconstruction quality and validity ratio.

6 retrieved papers

State-of-the-art results on 10 benchmarks across 3 modalities and 4 datasets

10 retrieved papers

The authors demonstrate that their unified model achieves new state-of-the-art performance on 10 different benchmarks spanning three input modalities and four datasets, including a real-world dataset (CC3D), representing the most comprehensive evaluation of CAD reconstruction methods to date.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning PDF

Kolodiazhnyi, Maksim, Tarasov, Denis, Zhemchuzhnikov, Dmitrii, Nikulin, Alexander, Zisman, Ilya, Vorontsova, Anna, Konushin, Anton, Kurenkov, Vladislav, Rukhovich, Danila (2025)

[18] A multi-modal retrieval augmented framework for user editable 3D CAD model generation PDF

A Ananthakrishnan (2025)

[25] CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM PDF

Xu Jingwei, Jiasheng Xu, Wang Chen-yu, Chenyu Wang, Jingwei Xu, Zhao Zi-bo, Zheng Zhao, Liu Wen, Wenyu Liu, Zibo Zhao, Ma Yi, Yi Ma, Wen Liu, Gao, Shenghua, S. M. Gao, Shenghua Gao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

cadrille: multimodal CAD reconstruction model

[1] Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding PDF

Cannot Refute

[2] GenCAD-Three-Dimensional: Computer-Aided Design Program Generation Using Multimodal Latent Space Alignment and Synthetic Dataset Balancing PDF

Cannot Refute

[3] Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion. PDF

Cannot Refute

[4] HoLa: B-Rep Generation using a Holistic Latent Representation PDF

Cannot Refute

[7] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning PDF

Cannot Refute

[38] Cad-recode: Reverse engineering cad code from point clouds PDF

Cannot Refute

[39] GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors PDF

Cannot Refute

[40] MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds PDF

Cannot Refute

[41] Generating cad code with vision-language models for 3d designs PDF

Cannot Refute

[42] Gencad-3d: Cad program generation using multimodal latent space alignment and synthetic dataset balancing PDF

Cannot Refute

Contribution

RL fine-tuning for multimodal CAD reconstruction

[7] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning PDF

Cannot Refute

[34] Cadcrafter: Generating computer-aided design models from unconstrained images PDF

Cannot Refute

[43] Lrm-zero: Training large reconstruction models with synthesized data PDF

Cannot Refute

[44] CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning PDF

Cannot Refute

[45] Compositional 3D Content Creation with Machine Learning and Procedural Modeling PDF

Cannot Refute

[46] Learning CAD Program Generation using Reinforcement Learning PDF

Cannot Refute

Contribution

State-of-the-art results on 10 benchmarks across 3 modalities and 4 datasets

[2] GenCAD-Three-Dimensional: Computer-Aided Design Program Generation Using Multimodal Latent Space Alignment and Synthetic Dataset Balancing PDF

Cannot Refute

[3] Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion. PDF

Cannot Refute

[6] Cmt: A cascade mar with topology predictor for multimodal conditional cad generation PDF

Cannot Refute

[31] LRM: Large Reconstruction Model for Single Image to 3D PDF

Cannot Refute

[32] OrthoCAD-322K: A cross-modal approach for retrieving 3D CAD models from orthographic views using a graph-based framework on a developed large-scale dataset PDF

Cannot Refute

[33] LiDAR-GS++:Improving LiDAR Gaussian Reconstruction via Diffusion Priors PDF

Cannot Refute

[34] Cadcrafter: Generating computer-aided design models from unconstrained images PDF

Cannot Refute

[35] Img2CAD: Conditioned 3-D CAD Model Generation From Single Image With Structured Visual Geometry PDF

Cannot Refute

[36] Dafu-cad: Depth-assisted feature unraveling for sketch-based robust cad modeling PDF

Cannot Refute

[37] Cadvlm: Bridging language and vision in the generation of parametric cad sketches PDF

Cannot Refute

cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning PDF

[18] A multi-modal retrieval augmented framework for user editable 3D CAD model generation PDF

[25] CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM PDF

Contribution Analysis

cadrille: multimodal CAD reconstruction model

[1] Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding PDF

[2] GenCAD-Three-Dimensional: Computer-Aided Design Program Generation Using Multimodal Latent Space Alignment and Synthetic Dataset Balancing PDF

[3] Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion. PDF

[4] HoLa: B-Rep Generation using a Holistic Latent Representation PDF

[7] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning PDF

[38] Cad-recode: Reverse engineering cad code from point clouds PDF

[39] GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors PDF

[40] MeshCoder: LLM-Powered Structured Mesh Code Generation from Point Clouds PDF

[41] Generating cad code with vision-language models for 3d designs PDF

[42] Gencad-3d: Cad program generation using multimodal latent space alignment and synthetic dataset balancing PDF

RL fine-tuning for multimodal CAD reconstruction

[7] cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning PDF

[34] Cadcrafter: Generating computer-aided design models from unconstrained images PDF

[43] Lrm-zero: Training large reconstruction models with synthesized data PDF

[44] CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning PDF

[45] Compositional 3D Content Creation with Machine Learning and Procedural Modeling PDF

[46] Learning CAD Program Generation using Reinforcement Learning PDF

State-of-the-art results on 10 benchmarks across 3 modalities and 4 datasets

[2] GenCAD-Three-Dimensional: Computer-Aided Design Program Generation Using Multimodal Latent Space Alignment and Synthetic Dataset Balancing PDF

[3] Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion. PDF

[6] Cmt: A cascade mar with topology predictor for multimodal conditional cad generation PDF

[31] LRM: Large Reconstruction Model for Single Image to 3D PDF

[32] OrthoCAD-322K: A cross-modal approach for retrieving 3D CAD models from orthographic views using a graph-based framework on a developed large-scale dataset PDF

[33] LiDAR-GS++:Improving LiDAR Gaussian Reconstruction via Diffusion Priors PDF

[34] Cadcrafter: Generating computer-aided design models from unconstrained images PDF

[35] Img2CAD: Conditioned 3-D CAD Model Generation From Single Image With Structured Visual Geometry PDF

[36] Dafu-cad: Depth-assisted feature unraveling for sketch-based robust cad modeling PDF

[37] Cadvlm: Bridging language and vision in the generation of parametric cad sketches PDF

Table of Contents