TINKER: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion Model3D Editing

We introduce TINKER, a novel framework for high-fidelity 3D editing without any per-scene finetuning, where only a single edited image (one-shot) or a few edited images (few-shot) are required as input. Unlike prior techniques that demand extensive per-scene optimization to ensure multi-view consistency or to produce dozens of consistent edited input views, TINKER delivers robust, multi-view consistent edits from as few as one or two images. This capability stems from repurposing pretrained diffusion models, which unlocks their latent 3D awareness. To drive research in this space, we curate the first large-scale multi-view editing dataset and data pipeline, spanning diverse scenes and styles. Building on this dataset, we develop our framework capable of generating multi-view consistent edited views without per-scene training, which consists of two novel components: (1) Multi-view consistent editor: Enables precise, reference-driven edits that remain coherent across all viewpoints. (2) Any-view-to-video scene completion model : Leverages spatial-temporal priors from video diffusion to perform high-quality scene completion and novel-view generation even from sparse inputs. Through extensive experiments, TINKER significantly reduces the barrier to generalizable 3D content creation, achieving state-of-the-art performance on editing, novel-view synthesis, and rendering enhancement tasks, while also demonstrating strong potential for 4D editing. We believe that TINKER represents a key step towards truly scalable, zero-shot 3D and 4D editing.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

TINKER contributes a feed-forward framework for multi-view consistent 3D editing from one or a few edited images, without per-scene optimization. It resides in the 'Feed-Forward Multi-View Editing' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy. This leaf sits under 'Multi-View Consistent 3D Editing', which itself comprises three sub-categories addressing different editing strategies. The small sibling count suggests this specific approach—training-free propagation of sparse edits—is less crowded than reconstruction or generation branches.

The taxonomy reveals neighboring work in 'Diffusion-Based Multi-View Editing' and 'Progressive and Hierarchical 3D Editing', which enforce consistency through diffusion priors or iterative refinement rather than feed-forward propagation. Adjacent branches like 'Sparse-View 3D Reconstruction' and 'Multi-View Enhancement and Completion' tackle related challenges of building or refining 3D representations from limited inputs, but do not focus on editing. TINKER's positioning bridges feed-forward efficiency with multi-view consistency, diverging from optimization-heavy methods in sibling categories while leveraging diffusion priors similar to neighboring diffusion-based editing approaches.

Among thirty candidates examined, the dataset contribution shows one refutable candidate out of ten examined, while the reference-driven editor also has one refutable match among ten candidates. The any-view-to-video completion model appears more novel, with zero refutable candidates among ten examined. This suggests the dataset and editor components face some prior overlap within the limited search scope, whereas the video-based completion approach may occupy less explored territory. The analysis reflects top-K semantic matches and citation expansion, not exhaustive coverage of all relevant literature.

Given the limited search scope and sparse taxonomy leaf, TINKER appears to address a less saturated research direction, though specific components show varying degrees of prior work overlap. The feed-forward editing paradigm itself remains relatively underexplored compared to reconstruction or generation tasks, as evidenced by the small sibling count and the concentration of examined candidates around particular contributions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Multi-view consistent 3D editing from sparse inputs. The field addresses the challenge of reconstructing, generating, and editing 3D content when only a handful of input views are available, rather than dense multi-view captures. The taxonomy reveals five main branches that collectively tackle different facets of this problem. Sparse-View 3D Reconstruction and Novel View Synthesis focuses on building complete 3D representations from limited observations, employing techniques ranging from neural radiance fields to Gaussian splatting methods like Mvsplat[1] and SparSplat[9]. Text-to-3D and Multi-View Diffusion Generation leverages generative models to produce consistent multi-view imagery or 3D assets from text prompts, often relying on diffusion-based priors. Multi-View Consistent 3D Editing directly targets the modification of existing 3D scenes while preserving geometric and appearance coherence across views. Multi-View Enhancement and Completion aims to refine or fill in missing details in sparse reconstructions, and Feed-Forward 3D Representation Learning develops efficient architectures that predict 3D structures in a single forward pass. Within this landscape, a particularly active line of work explores feed-forward editing pipelines that bypass slow optimization loops. TINKER[0] exemplifies this direction by enabling rapid, instruction-driven edits on sparse-view scenes, positioning itself alongside other feed-forward editing methods like EditP23[13] and Edit3r[18]. These approaches contrast with optimization-heavy techniques such as MVPGS[3] or Sculpt3d[2], which iteratively refine 3D representations but require longer processing times. Another emerging theme involves integrating multi-view diffusion priors to ensure consistency during editing or completion, as seen in works like Gen3c[5] and MVDiffusion++[8]. TINKER[0] sits at the intersection of feed-forward efficiency and multi-view consistency, emphasizing real-time applicability while maintaining coherence across sparse inputs. Open questions remain around balancing edit fidelity with computational cost and generalizing across diverse scene types and editing instructions.

Claimed Contributions

Multi-view consistent image editing dataset and pipeline

Can Refute

10 retrieved papers

The authors introduce the first large-scale dataset for multi-view consistent image editing, constructed using a novel pipeline that leverages pretrained diffusion models to generate locally consistent image pairs and filters them for global consistency.

10 retrieved papers

Can Refute

Multi-view consistent editor with reference-driven editing

Can Refute

10 retrieved papers

A novel component that fine-tunes large-scale image editing models using LoRA to perform reference-based editing, enabling the model to propagate editing intent across different viewpoints and achieve globally consistent results without per-scene optimization.

10 retrieved papers

Can Refute

Any-view-to-video scene completion model

10 retrieved papers

A depth-conditioned video diffusion model that reframes 3D editing as a reconstruction task, enabling efficient generation of dense multi-view consistent edited views from sparse reference inputs by exploiting spatial-temporal priors from video diffusion models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] EditP23: 3D Editing via Propagation of Image Prompts to Multi-View PDF

Bar-On, Roi, Cohen-Bar, Dana, Roi Bar-On, Cohen-Or, Daniel, Dana Cohen-Bar, Daniel Cohen-Or (2025)

[18] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images PDF

Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multi-view consistent image editing dataset and pipeline

[72] Mvinpainter: Learning multi-view consistent inpainting to bridge 2d and 3d editing PDF

Can Refute

[62] View-Consistent 3D Editing with Gaussian Splatting PDF

Cannot Refute

[65] DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing PDF

Cannot Refute

[70] Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model PDF

Cannot Refute

[71] Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model PDF

Cannot Refute

[73] One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion PDF

Cannot Refute

[74] SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency PDF

Cannot Refute

[75] Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout PDF

Cannot Refute

[76] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image PDF

Cannot Refute

[77] HOMER: Homography-Based Efficient Multi-view 3D Object Removal PDF

Cannot Refute

Contribution

Multi-view consistent editor with reference-driven editing

[60] Gaussctrl: Multi-view consistent text-driven 3d gaussian splatting editing PDF

Can Refute

[61] Fast Multi-view Consistent 3D Editing with Video Priors PDF

Cannot Refute

[62] View-Consistent 3D Editing with Gaussian Splatting PDF

Cannot Refute

[63] Trame: Trajectory-anchored multi-view editing for text-guided 3d gaussian splatting manipulation PDF

Cannot Refute

[64] EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting PDF

Cannot Refute

[65] DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing PDF

Cannot Refute

[66] Reference-Based 3D-Aware Image Editing with Triplane PDF

Cannot Refute

[67] AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360deg Unbounded Scene Inpainting PDF

Cannot Refute

[68] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion PDF

Cannot Refute

[69] View-consistent Object Removal in Radiance Fields PDF

Cannot Refute

Contribution

Any-view-to-video scene completion model

[16] V3d: Video diffusion models are effective 3d generators PDF

Cannot Refute

[51] Posediff: Pose-conditioned multimodal diffusion model for unbounded scene synthesis from sparse inputs PDF

Cannot Refute

[52] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis PDF

Cannot Refute

[53] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion PDF

Cannot Refute

[54] Generative Novel View Synthesis with 3D-Aware Diffusion Models PDF

Cannot Refute

[55] 4diffusion: Multi-view video diffusion model for 4d generation PDF

Cannot Refute

[56] ifusion: Inverting diffusion for pose-free reconstruction from sparse views PDF

Cannot Refute

[57] Valid: Variable-Length Input Diffusion for Novel View Synthesis PDF

Cannot Refute

[58] Novel View Synthesis with Diffusion Models PDF

Cannot Refute

[59] SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations PDF

Cannot Refute

TINKER: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] EditP23: 3D Editing via Propagation of Image Prompts to Multi-View PDF

[18] Edit3r: Instant 3D Scene Editing from Sparse Unposed Images PDF

Contribution Analysis

Multi-view consistent image editing dataset and pipeline

[72] Mvinpainter: Learning multi-view consistent inpainting to bridge 2d and 3d editing PDF

[62] View-Consistent 3D Editing with Gaussian Splatting PDF

[65] DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing PDF

[70] Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model PDF

[71] Zero123++: a Single Image to Consistent Multi-view Diffusion Base Model PDF

[73] One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion PDF

[74] SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency PDF

[75] Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout PDF

[76] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image PDF

[77] HOMER: Homography-Based Efficient Multi-view 3D Object Removal PDF

Multi-view consistent editor with reference-driven editing

[60] Gaussctrl: Multi-view consistent text-driven 3d gaussian splatting editing PDF

[61] Fast Multi-view Consistent 3D Editing with Video Priors PDF

[62] View-Consistent 3D Editing with Gaussian Splatting PDF

[63] Trame: Trajectory-anchored multi-view editing for text-guided 3d gaussian splatting manipulation PDF

[64] EditSplat: Multi-View Fusion and Attention-Guided Optimization for View-Consistent 3D Scene Editing with 3D Gaussian Splatting PDF

[65] DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing PDF

[66] Reference-Based 3D-Aware Image Editing with Triplane PDF

[67] AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360deg Unbounded Scene Inpainting PDF

[68] CoreEditor: Consistent 3D Editing via Correspondence-constrained Diffusion PDF

[69] View-consistent Object Removal in Radiance Fields PDF

Any-view-to-video scene completion model

[16] V3d: Video diffusion models are effective 3d generators PDF

[51] Posediff: Pose-conditioned multimodal diffusion model for unbounded scene synthesis from sparse inputs PDF

[52] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis PDF

[53] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion PDF

[54] Generative Novel View Synthesis with 3D-Aware Diffusion Models PDF

[55] 4diffusion: Multi-view video diffusion model for 4d generation PDF

[56] ifusion: Inverting diffusion for pose-free reconstruction from sparse views PDF

[57] Valid: Variable-Length Input Diffusion for Novel View Synthesis PDF

[58] Novel View Synthesis with Diffusion Models PDF

[59] SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations PDF

Table of Contents