cadrille: Multi-modal CAD Reconstruction with Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
CAD3D reconstructionLLMVLMpoint cloudDPOGRPO
Abstract:

Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, most existing methods focus on a single input modality: point clouds, images, or texts, which limits their generalizability and robustness, while few multimodal approaches struggle to deliver competitive quality. Leveraging advances in vision-language models (VLM), we propose cadrille\texttt{cadrille}, a multimodal CAD reconstruction model that takes inputs of three modalities and outputs executable Python code for CAD reconstruction. Inspired by large language model (LLM) training paradigm, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille\texttt{cadrille} sets new state-of-the-art in as many as 10 benchmarks across three modalities and four datasets, including a real-world one.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces cadrille, a multimodal CAD reconstruction model that accepts point clouds, images, and text as inputs and outputs executable Python code for CAD generation. It resides in the 'Unified Multimodal CAD Reconstruction Frameworks' leaf, which contains four papers including the original work. This leaf represents a relatively focused research direction within the broader taxonomy of 30 papers across the field. The sibling papers in this leaf—GenCAD-3D, CAD-MLLM, and one other—similarly pursue end-to-end multimodal CAD synthesis, indicating a small but active cluster of work addressing the same core challenge.

The taxonomy tree reveals neighboring research directions that handle subsets of the problem. The 'Point Cloud to CAD Construction Sequence Reconstruction' leaf contains three papers focused solely on point cloud inputs, while 'Language-Guided CAD Synthesis with Large Language Models' explores text-driven generation using LLMs. The 'Multimodal Representation Learning for 3D Understanding' branch, with five papers in contrastive pre-training alone, addresses cross-modal alignment without direct CAD output. Cadrille's position bridges these areas by combining multimodal inputs with executable program synthesis, distinguishing it from representation-only methods and single-modality reconstruction approaches.

Among 26 candidates examined, the contribution-level analysis found no clear refutations across all three claimed contributions. For the core multimodal reconstruction model, 10 candidates were examined with none providing overlapping prior work. The RL fine-tuning contribution examined 6 candidates, again with no refutations. The state-of-the-art benchmark results contribution examined 10 candidates, similarly finding no direct overlap. This suggests that within the limited search scope, the combination of multimodal inputs, RL-based refinement, and comprehensive benchmark evaluation appears relatively novel, though the small candidate pool and focused leaf structure indicate a nascent research area.

Based on the top-26 semantic matches and the sparse four-paper leaf structure, the work appears to occupy a relatively unexplored niche. The absence of refutable candidates across contributions may reflect both genuine novelty and the limited scale of the literature search. The taxonomy context shows that while related single-modality and representation-learning methods exist in neighboring branches, the specific combination of multimodal CAD reconstruction with RL fine-tuning has fewer direct precedents within the examined scope.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Multimodal CAD reconstruction from point clouds, images, and text. The field structure reflects a progression from foundational representation learning to specialized generation and real-world application. Multimodal Representation Learning for 3D Understanding establishes cross-modal alignment techniques, with works like ULIP[1] and Point-Bind[5] bridging vision, language, and geometry. CAD Generation from Multimodal Inputs focuses on synthesizing parametric models from diverse inputs, encompassing unified frameworks that handle multiple modalities simultaneously, as well as methods targeting specific input types. Point Cloud Completion and Enhancement addresses geometric refinement, while Cross-Modal Shape Abstraction and Transformation explores how different representations can be translated and abstracted. Specialized 3D Generation and Analysis Tasks tackle domain-specific challenges, and Applied CAD Reconstruction and Reverse Engineering brings these techniques to industrial contexts, including manufacturing and design automation. Recent activity centers on unified frameworks that integrate multiple input modalities into coherent CAD outputs, contrasting with earlier single-modality approaches. A key tension involves balancing geometric fidelity with parametric editability: some methods prioritize accurate surface reconstruction from point clouds, while others emphasize generating clean, editable CAD sequences from higher-level inputs like text or sketches. Cadrille[0] sits within the unified multimodal frameworks cluster, alongside GenCAD-3D[2] and CAD-MLLM[25], emphasizing end-to-end learning that fuses point clouds, images, and textual descriptions. Compared to Draw Step by Step[3], which focuses on sequential sketch-based generation, Cadrille[0] adopts a broader multimodal stance, aiming to leverage complementary information across input types. This positioning reflects a growing interest in holistic reconstruction pipelines that can handle real-world scenarios where multiple data sources are available but individually incomplete.

Claimed Contributions

cadrille: multimodal CAD reconstruction model

The authors introduce cadrille, a vision-language model that unifies three input modalities (point clouds, multi-view images, and text) within a single framework to generate executable Python code for CAD reconstruction. This is the first multimodal approach to achieve state-of-the-art results across all three modalities.

10 retrieved papers
RL fine-tuning for multimodal CAD reconstruction

The authors propose a novel training pipeline that uses large-scale procedurally generated data for supervised fine-tuning, followed by reinforcement learning fine-tuning on handcrafted data without requiring CAD sequence annotations. This approach addresses domain gap issues and improves both reconstruction quality and validity ratio.

6 retrieved papers
State-of-the-art results on 10 benchmarks across 3 modalities and 4 datasets

The authors demonstrate that their unified model achieves new state-of-the-art performance on 10 different benchmarks spanning three input modalities and four datasets, including a real-world dataset (CC3D), representing the most comprehensive evaluation of CAD reconstruction methods to date.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

cadrille: multimodal CAD reconstruction model

The authors introduce cadrille, a vision-language model that unifies three input modalities (point clouds, multi-view images, and text) within a single framework to generate executable Python code for CAD reconstruction. This is the first multimodal approach to achieve state-of-the-art results across all three modalities.

Contribution

RL fine-tuning for multimodal CAD reconstruction

The authors propose a novel training pipeline that uses large-scale procedurally generated data for supervised fine-tuning, followed by reinforcement learning fine-tuning on handcrafted data without requiring CAD sequence annotations. This approach addresses domain gap issues and improves both reconstruction quality and validity ratio.

Contribution

State-of-the-art results on 10 benchmarks across 3 modalities and 4 datasets

The authors demonstrate that their unified model achieves new state-of-the-art performance on 10 different benchmarks spanning three input modalities and four datasets, including a real-world dataset (CC3D), representing the most comprehensive evaluation of CAD reconstruction methods to date.