Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a new training paradigm that integrates human perceptual structure at the initialization stage of model training, rather than applying it as a post-hoc fine-tuning step. This approach uses human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder before self-supervised learning.
The method consists of two sequential stages: first initializing the vision encoder by training on human similarity judgments from NIGHTS, then performing conventional large-scale contrastive pretraining on 15M image-text pairs from YFCC15M. This converts random initialization into perceptual initialization.
The authors claim this is the first work to directly integrate supervised human perceptual data into the initialization of vision-language models before web-scale training, distinguishing it from prior work that applied human perceptual alignment only as post-hoc fine-tuning.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[2] Perceptual grouping in vision-language models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Perceptual-Initialization paradigm for vision-language models
The authors propose a new training paradigm that integrates human perceptual structure at the initialization stage of model training, rather than applying it as a post-hoc fine-tuning step. This approach uses human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder before self-supervised learning.
[3] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF
[23] MENTOR: Human Perception-Guided Pretraining for Increased Generalization PDF
[26] Perceptual Inductive Bias Is What You Need Before Contrastive Learning PDF
[27] Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding PDF
[28] POV Learning: Individual Alignment of Multimodal Models using Human Perception PDF
[29] ScanDMM: A Deep Markov Model of Scanpath Prediction for 360° Images PDF
[30] Perceptual quality assessment for no-reference image via optimization-based meta-learning PDF
[31] Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers PDF
[32] Towards model-based recognition of human movements in image sequences PDF
[33] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction PDF
Two-stage training pipeline with human perceptual initialization
The method consists of two sequential stages: first initializing the vision encoder by training on human similarity judgments from NIGHTS, then performing conventional large-scale contrastive pretraining on 15M image-text pairs from YFCC15M. This converts random initialization into perceptual initialization.
[37] JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs PDF
[26] Perceptual Inductive Bias Is What You Need Before Contrastive Learning PDF
[36] Liftedcl: Lifting contrastive learning for human-centric perception PDF
First approach using human triplet judgments for vision-language model initialization
The authors claim this is the first work to directly integrate supervised human perceptual data into the initialization of vision-language models before web-scale training, distinguishing it from prior work that applied human perceptual alignment only as post-hoc fine-tuning.