Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

ICLR 2026 Conference SubmissionAnonymous Authors
Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment
Abstract:

We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements without any task-specific fine-tuning across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
25
3
Claimed Contributions
15
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Incorporating human perceptual structure into vision-language model initialization. The field organizes around four main branches that reflect distinct stages and concerns in building perceptually grounded multimodal systems. Perceptual Structure Integration Methods explore how human-like perceptual cues—ranging from grouping principles (Perceptual Grouping VLM[2]) to similarity metrics (Perceptual Similarity Registration[15])—can be embedded during model initialization or training. Vision-Language Model Architectures and Capabilities address the design of multimodal backbones and their representational power, spanning general-purpose frameworks (Qwen2-VL[3], Versatile Multimodal Pretraining[13]) and specialized mechanisms for memory or reasoning (MemVLT[8], RecompGPT[20]). Perceptual Alignment and Quality Assessment focuses on measuring and enforcing correspondence between model outputs and human judgments, including alignment strategies (Aligning Perception Language[1], Aligning Human Cognition[16]) and quality evaluation (VisualCritic[7]). Application Domains and Use Cases demonstrate how perceptual grounding benefits downstream tasks such as urban scene understanding (Urban Scene Perception[17]), creative generation (Draw with Thought[5]), and cross-modal retrieval (Human-CLAP[4]). Several active lines of work highlight trade-offs between early-stage perceptual embedding and post-hoc alignment. Some studies inject perceptual priors directly at initialization (Perceptual Initialization[0], Perceptual Grouping VLM[2]), aiming to shape the learned representation space from the outset, while others refine alignment through distillation or feedback loops after pretraining (Language Visual Distillation[9], MENTOR[23]). Perceptual Initialization[0] sits squarely within the Initialization-Phase Perceptual Embedding cluster, emphasizing the integration of human perceptual structure before large-scale training begins. This contrasts with approaches like Aligning Perception Language[1] or Aligning Human Cognition[16], which typically adjust pretrained models to better match human judgments. By anchoring perceptual cues early, Perceptual Initialization[0] seeks to reduce the gap between machine and human vision from the ground up, complementing neighbor work (Perceptual Grouping VLM[2]) that similarly leverages Gestalt-like grouping at the model's foundation.

Claimed Contributions

Perceptual-Initialization paradigm for vision-language models

The authors propose a new training paradigm that integrates human perceptual structure at the initialization stage of model training, rather than applying it as a post-hoc fine-tuning step. This approach uses human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder before self-supervised learning.

10 retrieved papers
Two-stage training pipeline with human perceptual initialization

The method consists of two sequential stages: first initializing the vision encoder by training on human similarity judgments from NIGHTS, then performing conventional large-scale contrastive pretraining on 15M image-text pairs from YFCC15M. This converts random initialization into perceptual initialization.

3 retrieved papers
Can Refute
First approach using human triplet judgments for vision-language model initialization

The authors claim this is the first work to directly integrate supervised human perceptual data into the initialization of vision-language models before web-scale training, distinguishing it from prior work that applied human perceptual alignment only as post-hoc fine-tuning.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Perceptual-Initialization paradigm for vision-language models

The authors propose a new training paradigm that integrates human perceptual structure at the initialization stage of model training, rather than applying it as a post-hoc fine-tuning step. This approach uses human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder before self-supervised learning.

Contribution

Two-stage training pipeline with human perceptual initialization

The method consists of two sequential stages: first initializing the vision encoder by training on human similarity judgments from NIGHTS, then performing conventional large-scale contrastive pretraining on 15M image-text pairs from YFCC15M. This converts random initialization into perceptual initialization.

Contribution

First approach using human triplet judgments for vision-language model initialization

The authors claim this is the first work to directly integrate supervised human perceptual data into the initialization of vision-language models before web-scale training, distinguishing it from prior work that applied human perceptual alignment only as post-hoc fine-tuning.