Abstract:

While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons: (1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings; (2) Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce the Virtual Try-on Benchmark (VTBench), the first-ever hierarchical try-on benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages: 1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios. 2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark’s alignment with perceptual quality across all evaluation dimensions. 3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenarios, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VTBench, a hierarchical benchmark suite for virtual try-on evaluation spanning five dimensions: image quality, texture preservation, background consistency, size adaptability, and hand-occlusion handling. Within the taxonomy, it occupies the 'Comprehensive Benchmark Suites' leaf under 'Datasets, Benchmarks, and Evaluation Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this category. This isolation suggests the research direction of multi-dimensional, hierarchical try-on benchmarking is relatively unexplored, contrasting with the crowded generation methods branches that contain over thirty papers across warping, diffusion, and specialized scenarios.

The taxonomy reveals that most neighboring work resides in generation methods (warping techniques, diffusion models) or application domains (in-the-wild try-on, immersive reality). The 'Datasets, Benchmarks, and Evaluation Frameworks' branch includes only three leaves: comprehensive benchmarks, 3D garment datasets, and inverse try-on datasets. While 3D reconstruction datasets like Deep Fashion3D and garment extraction benchmarks exist, none provide the systematic, multi-dimensional evaluation framework VTBench proposes. The taxonomy's scope notes clarify that generation-only methods belong elsewhere, reinforcing that VTBench's focus on evaluation criteria and human-aligned metrics distinguishes it from the algorithmic innovations dominating the field.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The hierarchical benchmark suite itself (Contribution A) examined ten candidates with zero refutations, suggesting no prior work offers comparable multi-dimensional decomposition. However, the novel unpaired evaluation metrics (Contribution B) encountered one refutable candidate among ten examined, indicating some overlap with existing metric development efforts. The curated test datasets and human annotations (Contribution C) also showed no refutations across ten candidates. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—rather than exhaustive coverage, meaning additional relevant work may exist beyond the examined set.

Given the limited search scope of thirty candidates, VTBench appears to occupy a sparse research direction within try-on benchmarking, particularly for hierarchical evaluation frameworks. The single refutation among three contributions suggests most claims are not contradicted by the examined literature, though the small candidate pool and the existence of one overlapping metric work indicate caution is warranted. The taxonomy structure confirms that systematic evaluation infrastructure lags behind generative method development, positioning VTBench as a methodological contribution addressing an underserved need in the field.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: virtual try-on evaluation and benchmarking. The field has evolved into a multi-faceted landscape organized around six main branches. Virtual Try-On Generation Methods encompasses the algorithmic core, ranging from early warping-based approaches like VITON[14] to recent diffusion-driven models such as GP-VTON[1] and AnyFit[5], which tackle garment deformation, pose variation, and photorealism. Datasets, Benchmarks, and Evaluation Frameworks provide the empirical foundation, including 3D garment repositories like Deep Fashion3D[2] and comprehensive benchmark suites that standardize performance assessment. Application-Oriented and Domain-Specific Try-On explores specialized contexts—from street fashion in Street TryOn[9] to emerging metaverse platforms discussed in Fashion Metaverse[41]—while Fit Assessment and Garment Design Support addresses practical concerns such as size prediction and garment customization. Surveys, Reviews, and User Studies, exemplified by Deep Learning Survey[10] and Image-Based Survey[12], synthesize progress and user experience insights, and Auxiliary Methods and Related Tasks cover enabling technologies like body scanning and garment capture. Within this ecosystem, a particularly active tension exists between generation quality and practical evaluation rigor. Many generation methods prioritize visual fidelity and pose robustness—HF-VTON[13] and OmniTry[6] push diffusion architectures toward higher realism—yet standardized benchmarking remains fragmented, with evaluation often relying on ad hoc metrics or limited datasets. VTBench[0] sits squarely in the Comprehensive Benchmark Suites cluster, addressing this gap by providing a unified evaluation framework that spans diverse try-on scenarios and metrics. Unlike generation-focused works such as GP-VTON[1] or fit-assessment tools like Color Histogram Fit[3], VTBench[0] emphasizes systematic comparison across methods, aiming to establish reproducible standards. This positions it as a methodological complement to the generative advances, offering the community a shared reference point for assessing progress and identifying open challenges in realism, garment fidelity, and cross-dataset generalization.

Claimed Contributions

VTBench: First comprehensive hierarchical benchmark suite for virtual try-on

The authors present VTBench, the first comprehensive benchmark suite specifically designed for evaluating virtual try-on models. It systematically decomposes virtual try-on quality into hierarchical, disentangled dimensions (general image quality, garment preservation, auxiliary consistency), each with tailored test sets and evaluation criteria to enable fine-grained assessment of model capabilities.

10 retrieved papers
Novel unpaired evaluation metrics for virtual try-on

The authors develop four novel unpaired evaluation metrics to overcome the difficulty of collecting paired try-on datasets. These metrics include font texture similarity for texture fidelity, VLM-based cross-category plausibility assessment, background consistency calculator, and hand-structure consistency evaluator, enabling comprehensive evaluation without requiring paired ground-truth data.

10 retrieved papers
Can Refute
Curated test datasets and human preference annotations for each evaluation dimension

The authors collect and curate specialized test datasets for each evaluation dimension (complex background, font texture, cross-category, hand-occlusion) and provide human preference annotations to validate that VTBench evaluations align strongly with human perceptual judgments across all dimensions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VTBench: First comprehensive hierarchical benchmark suite for virtual try-on

The authors present VTBench, the first comprehensive benchmark suite specifically designed for evaluating virtual try-on models. It systematically decomposes virtual try-on quality into hierarchical, disentangled dimensions (general image quality, garment preservation, auxiliary consistency), each with tailored test sets and evaluation criteria to enable fine-grained assessment of model capabilities.

Contribution

Novel unpaired evaluation metrics for virtual try-on

The authors develop four novel unpaired evaluation metrics to overcome the difficulty of collecting paired try-on datasets. These metrics include font texture similarity for texture fidelity, VLM-based cross-category plausibility assessment, background consistency calculator, and hand-structure consistency evaluator, enabling comprehensive evaluation without requiring paired ground-truth data.

Contribution

Curated test datasets and human preference annotations for each evaluation dimension

The authors collect and curate specialized test datasets for each evaluation dimension (complex background, font texture, cross-category, hand-occlusion) and provide human preference annotations to validate that VTBench evaluations align strongly with human perceptual judgments across all dimensions.