VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Virtual Try-onBenchmark

While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons: (1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings; (2) Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce the Virtual Try-on Benchmark (VTBench), the first-ever hierarchical try-on benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages: 1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios. 2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark’s alignment with perceptual quality across all evaluation dimensions. 3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenarios, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces VTBench, a hierarchical benchmark suite for virtual try-on evaluation spanning five dimensions: image quality, texture preservation, background consistency, size adaptability, and hand-occlusion handling. Within the taxonomy, it occupies the 'Comprehensive Benchmark Suites' leaf under 'Datasets, Benchmarks, and Evaluation Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this category. This isolation suggests the research direction of multi-dimensional, hierarchical try-on benchmarking is relatively unexplored, contrasting with the crowded generation methods branches that contain over thirty papers across warping, diffusion, and specialized scenarios.

The taxonomy reveals that most neighboring work resides in generation methods (warping techniques, diffusion models) or application domains (in-the-wild try-on, immersive reality). The 'Datasets, Benchmarks, and Evaluation Frameworks' branch includes only three leaves: comprehensive benchmarks, 3D garment datasets, and inverse try-on datasets. While 3D reconstruction datasets like Deep Fashion3D and garment extraction benchmarks exist, none provide the systematic, multi-dimensional evaluation framework VTBench proposes. The taxonomy's scope notes clarify that generation-only methods belong elsewhere, reinforcing that VTBench's focus on evaluation criteria and human-aligned metrics distinguishes it from the algorithmic innovations dominating the field.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The hierarchical benchmark suite itself (Contribution A) examined ten candidates with zero refutations, suggesting no prior work offers comparable multi-dimensional decomposition. However, the novel unpaired evaluation metrics (Contribution B) encountered one refutable candidate among ten examined, indicating some overlap with existing metric development efforts. The curated test datasets and human annotations (Contribution C) also showed no refutations across ten candidates. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—rather than exhaustive coverage, meaning additional relevant work may exist beyond the examined set.

Given the limited search scope of thirty candidates, VTBench appears to occupy a sparse research direction within try-on benchmarking, particularly for hierarchical evaluation frameworks. The single refutation among three contributions suggests most claims are not contradicted by the examined literature, though the small candidate pool and the existence of one overlapping metric work indicate caution is warranted. The taxonomy structure confirms that systematic evaluation infrastructure lags behind generative method development, positioning VTBench as a methodological contribution addressing an underserved need in the field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: virtual try-on evaluation and benchmarking. The field has evolved into a multi-faceted landscape organized around six main branches. Virtual Try-On Generation Methods encompasses the algorithmic core, ranging from early warping-based approaches like VITON[14] to recent diffusion-driven models such as GP-VTON[1] and AnyFit[5], which tackle garment deformation, pose variation, and photorealism. Datasets, Benchmarks, and Evaluation Frameworks provide the empirical foundation, including 3D garment repositories like Deep Fashion3D[2] and comprehensive benchmark suites that standardize performance assessment. Application-Oriented and Domain-Specific Try-On explores specialized contexts—from street fashion in Street TryOn[9] to emerging metaverse platforms discussed in Fashion Metaverse[41]—while Fit Assessment and Garment Design Support addresses practical concerns such as size prediction and garment customization. Surveys, Reviews, and User Studies, exemplified by Deep Learning Survey[10] and Image-Based Survey[12], synthesize progress and user experience insights, and Auxiliary Methods and Related Tasks cover enabling technologies like body scanning and garment capture. Within this ecosystem, a particularly active tension exists between generation quality and practical evaluation rigor. Many generation methods prioritize visual fidelity and pose robustness—HF-VTON[13] and OmniTry[6] push diffusion architectures toward higher realism—yet standardized benchmarking remains fragmented, with evaluation often relying on ad hoc metrics or limited datasets. VTBench[0] sits squarely in the Comprehensive Benchmark Suites cluster, addressing this gap by providing a unified evaluation framework that spans diverse try-on scenarios and metrics. Unlike generation-focused works such as GP-VTON[1] or fit-assessment tools like Color Histogram Fit[3], VTBench[0] emphasizes systematic comparison across methods, aiming to establish reproducible standards. This positions it as a methodological complement to the generative advances, offering the community a shared reference point for assessing progress and identifying open challenges in realism, garment fidelity, and cross-dataset generalization.

Claimed Contributions

VTBench: First comprehensive hierarchical benchmark suite for virtual try-on

10 retrieved papers

The authors present VTBench, the first comprehensive benchmark suite specifically designed for evaluating virtual try-on models. It systematically decomposes virtual try-on quality into hierarchical, disentangled dimensions (general image quality, garment preservation, auxiliary consistency), each with tailored test sets and evaluation criteria to enable fine-grained assessment of model capabilities.

10 retrieved papers

Novel unpaired evaluation metrics for virtual try-on

Can Refute

10 retrieved papers

The authors develop four novel unpaired evaluation metrics to overcome the difficulty of collecting paired try-on datasets. These metrics include font texture similarity for texture fidelity, VLM-based cross-category plausibility assessment, background consistency calculator, and hand-structure consistency evaluator, enabling comprehensive evaluation without requiring paired ground-truth data.

10 retrieved papers

Can Refute

Curated test datasets and human preference annotations for each evaluation dimension

10 retrieved papers

The authors collect and curate specialized test datasets for each evaluation dimension (complex background, font texture, cross-category, hand-occlusion) and provide human preference annotations to validate that VTBench evaluations align strongly with human perceptual judgments across all dimensions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VTBench: First comprehensive hierarchical benchmark suite for virtual try-on

[7] Shineon: Illuminating design choices for practical video-based virtual clothing try-on PDF

Cannot Refute

[9] Street tryon: Learning in-the-wild virtual try-on from unpaired person images PDF

Cannot Refute

[10] Deep learning in virtual try-on: A comprehensive survey PDF

Cannot Refute

[50] Prediction of garment fit level in 3D virtual environment based on artificial neural networks PDF

Cannot Refute

[59] A multi-level consistency network for high-fidelity virtual try-on PDF

Cannot Refute

[60] Hood: Hierarchical graphs for generalized modelling of clothing dynamics PDF

Cannot Refute

[61] Virtual try-on systems in fashion consumption: A systematic review PDF

Cannot Refute

[62] Towards high-fidelity 3D virtual try-on via global collaborative modeling PDF

Cannot Refute

[63] Rmgn: A regional mask guided network for parser-free virtual try-on PDF

Cannot Refute

[64] Neural Style Transfer for Image-Based Garment Interchange Through Multi-Person Human Views PDF

Cannot Refute

Contribution

Novel unpaired evaluation metrics for virtual try-on

[40] Better Fit: Accommodate Variations in Clothing Types for Virtual Try-On PDF

Can Refute

[9] Street tryon: Learning in-the-wild virtual try-on from unpaired person images PDF

Cannot Refute

[51] High-resolution virtual try-on with misalignment and occlusion-handled conditions PDF

Cannot Refute

[52] Image based virtual try-on network from unpaired data PDF

Cannot Refute

[53] Bridging Fashion and Technology: Synthetic Human Models for an Enhanced E-Commerce Experience PDF

Cannot Refute

[54] Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models PDF

Cannot Refute

[55] Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN PDF

Cannot Refute

[56] SVTON: Simplified Virtual Try-On PDF

Cannot Refute

[57] PASTA-GAN++: A Versatile Framework for High-Resolution Unpaired Virtual Try-on PDF

Cannot Refute

[58] Training-free Clothing Region of Interest Self-correction for Virtual Try-On PDF

Cannot Refute

Contribution

Curated test datasets and human preference annotations for each evaluation dimension

[38] A mixed reality virtual clothes try-on system PDF

Cannot Refute

[65] MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer PDF

Cannot Refute

[66] Flux Already Knows - Activating Subject-Driven Image Generation without Training PDF

Cannot Refute

[67] Comparative analysis of Vid2Vid and Fast Vid2Vid Models for Video-to-Video Synthesis on Cityscapes Dataset PDF

Cannot Refute

[68] Novel Paradigms of Human-Fashion Interaction PDF

Cannot Refute

[69] VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on PDF

Cannot Refute

[70] Integrating Graded and Nominal Response Models to Analyze User Preferences in AI-, AR-, and VR-Based Virtual Try-On for Personalized Beauty Experiences PDF

Cannot Refute

[71] Robust 3D Virtual Try-On Under Complex Poses PDF

Cannot Refute

[72] Automated Dataset Creation for Conditional Image Editing with Diffusion Models PDF

Cannot Refute

[73] Virtual Stylist: Outfit Try-on and Personalized Fashion Advice PDF

Cannot Refute

VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

VTBench: First comprehensive hierarchical benchmark suite for virtual try-on

[7] Shineon: Illuminating design choices for practical video-based virtual clothing try-on PDF

[9] Street tryon: Learning in-the-wild virtual try-on from unpaired person images PDF

[10] Deep learning in virtual try-on: A comprehensive survey PDF

[50] Prediction of garment fit level in 3D virtual environment based on artificial neural networks PDF

[59] A multi-level consistency network for high-fidelity virtual try-on PDF

[60] Hood: Hierarchical graphs for generalized modelling of clothing dynamics PDF

[61] Virtual try-on systems in fashion consumption: A systematic review PDF

[62] Towards high-fidelity 3D virtual try-on via global collaborative modeling PDF

[63] Rmgn: A regional mask guided network for parser-free virtual try-on PDF

[64] Neural Style Transfer for Image-Based Garment Interchange Through Multi-Person Human Views PDF

Novel unpaired evaluation metrics for virtual try-on

[40] Better Fit: Accommodate Variations in Clothing Types for Virtual Try-On PDF

[9] Street tryon: Learning in-the-wild virtual try-on from unpaired person images PDF

[51] High-resolution virtual try-on with misalignment and occlusion-handled conditions PDF

[52] Image based virtual try-on network from unpaired data PDF

[53] Bridging Fashion and Technology: Synthetic Human Models for an Enhanced E-Commerce Experience PDF

[54] Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models PDF

[55] Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN PDF

[56] SVTON: Simplified Virtual Try-On PDF

[57] PASTA-GAN++: A Versatile Framework for High-Resolution Unpaired Virtual Try-on PDF

[58] Training-free Clothing Region of Interest Self-correction for Virtual Try-On PDF

Curated test datasets and human preference annotations for each evaluation dimension

[38] A mixed reality virtual clothes try-on system PDF

[65] MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer PDF

[66] Flux Already Knows - Activating Subject-Driven Image Generation without Training PDF

[67] Comparative analysis of Vid2Vid and Fast Vid2Vid Models for Video-to-Video Synthesis on Cityscapes Dataset PDF

[68] Novel Paradigms of Human-Fashion Interaction PDF

[69] VTONQA: A Multi-Dimensional Quality Assessment Dataset for Virtual Try-on PDF

[70] Integrating Graded and Nominal Response Models to Analyze User Preferences in AI-, AR-, and VR-Based Virtual Try-On for Personalized Beauty Experiences PDF

[71] Robust 3D Virtual Try-On Under Complex Poses PDF

[72] Automated Dataset Creation for Conditional Image Editing with Diffusion Models PDF

[73] Virtual Stylist: Outfit Try-on and Personalized Fashion Advice PDF

Table of Contents