VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models
Overview
Overall Novelty Assessment
The paper introduces VTBench, a hierarchical benchmark suite for virtual try-on evaluation spanning five dimensions: image quality, texture preservation, background consistency, size adaptability, and hand-occlusion handling. Within the taxonomy, it occupies the 'Comprehensive Benchmark Suites' leaf under 'Datasets, Benchmarks, and Evaluation Frameworks'. Notably, this leaf contains only the original paper itself—no sibling papers exist in this category. This isolation suggests the research direction of multi-dimensional, hierarchical try-on benchmarking is relatively unexplored, contrasting with the crowded generation methods branches that contain over thirty papers across warping, diffusion, and specialized scenarios.
The taxonomy reveals that most neighboring work resides in generation methods (warping techniques, diffusion models) or application domains (in-the-wild try-on, immersive reality). The 'Datasets, Benchmarks, and Evaluation Frameworks' branch includes only three leaves: comprehensive benchmarks, 3D garment datasets, and inverse try-on datasets. While 3D reconstruction datasets like Deep Fashion3D and garment extraction benchmarks exist, none provide the systematic, multi-dimensional evaluation framework VTBench proposes. The taxonomy's scope notes clarify that generation-only methods belong elsewhere, reinforcing that VTBench's focus on evaluation criteria and human-aligned metrics distinguishes it from the algorithmic innovations dominating the field.
Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The hierarchical benchmark suite itself (Contribution A) examined ten candidates with zero refutations, suggesting no prior work offers comparable multi-dimensional decomposition. However, the novel unpaired evaluation metrics (Contribution B) encountered one refutable candidate among ten examined, indicating some overlap with existing metric development efforts. The curated test datasets and human annotations (Contribution C) also showed no refutations across ten candidates. These statistics reflect a limited search scope—top-K semantic matches plus citation expansion—rather than exhaustive coverage, meaning additional relevant work may exist beyond the examined set.
Given the limited search scope of thirty candidates, VTBench appears to occupy a sparse research direction within try-on benchmarking, particularly for hierarchical evaluation frameworks. The single refutation among three contributions suggests most claims are not contradicted by the examined literature, though the small candidate pool and the existence of one overlapping metric work indicate caution is warranted. The taxonomy structure confirms that systematic evaluation infrastructure lags behind generative method development, positioning VTBench as a methodological contribution addressing an underserved need in the field.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present VTBench, the first comprehensive benchmark suite specifically designed for evaluating virtual try-on models. It systematically decomposes virtual try-on quality into hierarchical, disentangled dimensions (general image quality, garment preservation, auxiliary consistency), each with tailored test sets and evaluation criteria to enable fine-grained assessment of model capabilities.
The authors develop four novel unpaired evaluation metrics to overcome the difficulty of collecting paired try-on datasets. These metrics include font texture similarity for texture fidelity, VLM-based cross-category plausibility assessment, background consistency calculator, and hand-structure consistency evaluator, enabling comprehensive evaluation without requiring paired ground-truth data.
The authors collect and curate specialized test datasets for each evaluation dimension (complex background, font texture, cross-category, hand-occlusion) and provide human preference annotations to validate that VTBench evaluations align strongly with human perceptual judgments across all dimensions.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
VTBench: First comprehensive hierarchical benchmark suite for virtual try-on
The authors present VTBench, the first comprehensive benchmark suite specifically designed for evaluating virtual try-on models. It systematically decomposes virtual try-on quality into hierarchical, disentangled dimensions (general image quality, garment preservation, auxiliary consistency), each with tailored test sets and evaluation criteria to enable fine-grained assessment of model capabilities.
[7] Shineon: Illuminating design choices for practical video-based virtual clothing try-on PDF
[9] Street tryon: Learning in-the-wild virtual try-on from unpaired person images PDF
[10] Deep learning in virtual try-on: A comprehensive survey PDF
[50] Prediction of garment fit level in 3D virtual environment based on artificial neural networks PDF
[59] A multi-level consistency network for high-fidelity virtual try-on PDF
[60] Hood: Hierarchical graphs for generalized modelling of clothing dynamics PDF
[61] Virtual try-on systems in fashion consumption: A systematic review PDF
[62] Towards high-fidelity 3D virtual try-on via global collaborative modeling PDF
[63] Rmgn: A regional mask guided network for parser-free virtual try-on PDF
[64] Neural Style Transfer for Image-Based Garment Interchange Through Multi-Person Human Views PDF
Novel unpaired evaluation metrics for virtual try-on
The authors develop four novel unpaired evaluation metrics to overcome the difficulty of collecting paired try-on datasets. These metrics include font texture similarity for texture fidelity, VLM-based cross-category plausibility assessment, background consistency calculator, and hand-structure consistency evaluator, enabling comprehensive evaluation without requiring paired ground-truth data.
[40] Better Fit: Accommodate Variations in Clothing Types for Virtual Try-On PDF
[9] Street tryon: Learning in-the-wild virtual try-on from unpaired person images PDF
[51] High-resolution virtual try-on with misalignment and occlusion-handled conditions PDF
[52] Image based virtual try-on network from unpaired data PDF
[53] Bridging Fashion and Technology: Synthetic Human Models for an Enhanced E-Commerce Experience PDF
[54] Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models PDF
[55] Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN PDF
[56] SVTON: Simplified Virtual Try-On PDF
[57] PASTA-GAN++: A Versatile Framework for High-Resolution Unpaired Virtual Try-on PDF
[58] Training-free Clothing Region of Interest Self-correction for Virtual Try-On PDF
Curated test datasets and human preference annotations for each evaluation dimension
The authors collect and curate specialized test datasets for each evaluation dimension (complex background, font texture, cross-category, hand-occlusion) and provide human preference annotations to validate that VTBench evaluations align strongly with human perceptual judgments across all dimensions.