Referring Layer Decomposition

ICLR 2026 Conference SubmissionAnonymous Authors
DatasetBenchmarkLayer Decomposition
Abstract:

Precise, object-aware control over visual content is essential for advanced image editing and compositional generation. Yet, most existing approaches operate on entire images holistically, limiting the ability to isolate and manipulate individual scene elements. In contrast, layered representations, where scenes are explicitly separated into objects, environmental context, and visual effects, provide a more intuitive and structured framework for interpreting and editing visual content. To bridge this gap and enable both compositional understanding and controllable editing, we introduce the Referring Layer Decomposition (RLD) task, which predicts complete RGBA layers from a single RGB image, conditioned on flexible user prompts, such as spatial inputs (e.g., points, boxes, masks), natural language descriptions, or combinations thereof. At the core is the RefLade, a large-scale dataset comprising 1.11M image–layer–prompt triplets produced by our scalable data engine, along with 100K manually curated, high-fidelity layers. Coupled with a perceptually grounded, human-preference-aligned automatic evaluation protocol, RefLade establishes RLD as a well-defined and benchmarkable research task. Building on this foundation, we present RefLayer, a simple baseline designed for prompt-conditioned layer decomposition, achieving high visual fidelity and semantic alignment. Extensive experiments show our approach enables effective training, reliable evaluation, and high-quality image decomposition, while exhibiting strong zero-shot generalization capabilities. We will release our dataset, evaluation tools, and model for future research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the Referring Layer Decomposition (RLD) task, which decomposes RGB images into RGBA layers conditioned on flexible prompts (spatial inputs, text, or combinations). According to the taxonomy, this work resides in the 'Text-Guided Image Layer Decomposition' leaf under 'Layer Decomposition and Extraction'. This leaf contains only three papers total, including the original work, indicating a relatively sparse research direction. The sibling papers focus on similar decomposition goals but differ in prompt modalities or target domains, suggesting RLD occupies a distinct niche within a small but emerging subfield.

The taxonomy reveals that 'Layer Decomposition and Extraction' sits alongside 'Layer-Aware Generation' (which synthesizes layers from scratch) and 'Domain-Specific Layer-Based Generation' (garments, 3D scenes, documents). The original paper's leaf excludes video decomposition and non-textual guidance, distinguishing it from neighboring categories like 'Video Layer Decomposition' and 'Document and Design Layer Decomposition'. The broader field comprises sixteen papers across multiple branches, with text-guided image decomposition representing a minority direction. This positioning suggests the work addresses a gap between general matting techniques and compositional generation pipelines.

Among twenty-six candidates examined, the RLD task formulation itself shows no clear refutation across six candidates, while the RefLade dataset encounters two refutable candidates among ten examined, and the evaluation protocol finds none among ten. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted sample rather than exhaustive coverage. The dataset contribution appears to have more substantial prior work overlap, whereas the task definition and evaluation protocol seem less directly anticipated by existing literature within the examined set.

Based on the limited search of twenty-six candidates, the work appears to carve out a specific niche in prompt-conditioned layer decomposition, though the dataset component overlaps with prior efforts. The taxonomy structure confirms this is a sparsely populated research direction, with only two sibling papers in the same leaf. However, the analysis does not cover the full landscape of matting, segmentation, or compositional generation methods that may inform or constrain the novelty assessment.

Taxonomy

Core-task Taxonomy Papers
16
3
Claimed Contributions
26
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Prompt-conditioned RGBA layer decomposition from images. The field centers on extracting and generating compositional image layers—typically with alpha channels—guided by text or other prompts. The taxonomy reveals four main branches: Layer Decomposition and Extraction focuses on parsing existing images into editable components, often leveraging matting or segmentation techniques to isolate foreground objects or semantic regions. Layer-Aware Generation emphasizes synthesizing new layered content from scratch, producing RGBA outputs that can be composited flexibly. Domain-Specific Layer-Based Generation tailors these ideas to specialized contexts such as garment modeling or document layout, where domain priors guide layer structure. Foundational Scene Understanding underpins many of these methods by providing robust representations of depth, occlusion, and semantic boundaries. Together, these branches reflect a shift from monolithic image editing toward modular, compositional workflows that afford fine-grained control. Within Layer Decomposition and Extraction, a small handful of works explore text-guided decomposition, where natural language queries specify which elements to isolate. Referring Layer Decomposition[0] exemplifies this direction by enabling users to extract arbitrary layers via referring expressions, contrasting with earlier matting approaches like Portrait Image Matting[13] that target predefined categories. Nearby, Graphic Design Decomposition[14] tackles structured layouts rather than photographic scenes, highlighting the diversity of decomposition targets. Meanwhile, Layer-Aware Generation includes methods such as RGBA Instance Generation[9] and ImmerseGen[2], which synthesize layered assets directly, and Layerflow[3], which orchestrates multi-layer generation pipelines. The interplay between decomposition and generation remains an open question: whether to parse real images into layers or to generate layered content ab initio depends on the application, and hybrid approaches that refine extracted layers with generative priors are emerging as a promising middle ground.

Claimed Contributions

Referring Layer Decomposition (RLD) task

The authors formalize a novel task that extracts targeted RGBA layers from RGB images based on multi-modal user prompts such as spatial inputs (points, boxes, masks), natural language descriptions, or combinations thereof. This task enables compositional understanding and controllable editing of visual content.

6 retrieved papers
RefLade dataset and data engine

The authors develop a scalable, automated data engine and use it to construct RefLade, a dataset of 1.11 million image-layer-prompt triplets with human-curated splits. This dataset establishes RLD as a trainable and benchmarkable research task.

10 retrieved papers
Can Refute
Human-preference-aligned evaluation protocol

The authors design an automatic evaluation protocol that assesses layer decomposition along three dimensions (preservation, completion, faithfulness) and aggregates them into a unified HPA score that strongly correlates with human judgments, enabling reliable benchmarking without human-in-the-loop evaluation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Referring Layer Decomposition (RLD) task

The authors formalize a novel task that extracts targeted RGBA layers from RGB images based on multi-modal user prompts such as spatial inputs (points, boxes, masks), natural language descriptions, or combinations thereof. This task enables compositional understanding and controllable editing of visual content.

Contribution

RefLade dataset and data engine

The authors develop a scalable, automated data engine and use it to construct RefLade, a dataset of 1.11 million image-layer-prompt triplets with human-curated splits. This dataset establishes RLD as a trainable and benchmarkable research task.

Contribution

Human-preference-aligned evaluation protocol

The authors design an automatic evaluation protocol that assesses layer decomposition along three dimensions (preservation, completion, faithfulness) and aggregates them into a unified HPA score that strongly correlates with human judgments, enabling reliable benchmarking without human-in-the-loop evaluation.