Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements without any task-specific fine-tuning across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and retrieval recall (e.g., R@1, R@5) across these diverse evaluation tasks, without requiring any adaptation to target domains. These findings challenge the conventional wisdom of using human-perceptual data primarily for fine-tuning and demonstrate that embedding human perceptual structure during early representation learning yields more capable and vision-language aligned systems that generalize immediately to unseen tasks. Our work shows that "beginning with you", starting with human perception, provides a stronger foundation for general-purpose vision-language intelligence.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Incorporating human perceptual structure into vision-language model initialization. The field organizes around four main branches that reflect distinct stages and concerns in building perceptually grounded multimodal systems. Perceptual Structure Integration Methods explore how human-like perceptual cues—ranging from grouping principles (Perceptual Grouping VLM[2]) to similarity metrics (Perceptual Similarity Registration[15])—can be embedded during model initialization or training. Vision-Language Model Architectures and Capabilities address the design of multimodal backbones and their representational power, spanning general-purpose frameworks (Qwen2-VL[3], Versatile Multimodal Pretraining[13]) and specialized mechanisms for memory or reasoning (MemVLT[8], RecompGPT[20]). Perceptual Alignment and Quality Assessment focuses on measuring and enforcing correspondence between model outputs and human judgments, including alignment strategies (Aligning Perception Language[1], Aligning Human Cognition[16]) and quality evaluation (VisualCritic[7]). Application Domains and Use Cases demonstrate how perceptual grounding benefits downstream tasks such as urban scene understanding (Urban Scene Perception[17]), creative generation (Draw with Thought[5]), and cross-modal retrieval (Human-CLAP[4]). Several active lines of work highlight trade-offs between early-stage perceptual embedding and post-hoc alignment. Some studies inject perceptual priors directly at initialization (Perceptual Initialization[0], Perceptual Grouping VLM[2]), aiming to shape the learned representation space from the outset, while others refine alignment through distillation or feedback loops after pretraining (Language Visual Distillation[9], MENTOR[23]). Perceptual Initialization[0] sits squarely within the Initialization-Phase Perceptual Embedding cluster, emphasizing the integration of human perceptual structure before large-scale training begins. This contrasts with approaches like Aligning Perception Language[1] or Aligning Human Cognition[16], which typically adjust pretrained models to better match human judgments. By anchoring perceptual cues early, Perceptual Initialization[0] seeks to reduce the gap between machine and human vision from the ground up, complementing neighbor work (Perceptual Grouping VLM[2]) that similarly leverages Gestalt-like grouping at the model's foundation.

Claimed Contributions

Perceptual-Initialization paradigm for vision-language models

10 retrieved papers

The authors propose a new training paradigm that integrates human perceptual structure at the initialization stage of model training, rather than applying it as a post-hoc fine-tuning step. This approach uses human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder before self-supervised learning.

10 retrieved papers

Two-stage training pipeline with human perceptual initialization

Can Refute

3 retrieved papers

The method consists of two sequential stages: first initializing the vision encoder by training on human similarity judgments from NIGHTS, then performing conventional large-scale contrastive pretraining on 15M image-text pairs from YFCC15M. This converts random initialization into perceptual initialization.

3 retrieved papers

Can Refute

First approach using human triplet judgments for vision-language model initialization

2 retrieved papers

The authors claim this is the first work to directly integrate supervised human perceptual data into the initialization of vision-language models before web-scale training, distinguishing it from prior work that applied human perceptual alignment only as post-hoc fine-tuning.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Perceptual grouping in vision-language models PDF

K Ranasinghe, B McKinzie, S Ravi, Y Yang, AT Toshev (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Perceptual-Initialization paradigm for vision-language models

[3] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

Cannot Refute

[23] MENTOR: Human Perception-Guided Pretraining for Increased Generalization PDF

Cannot Refute

[26] Perceptual Inductive Bias Is What You Need Before Contrastive Learning PDF

Cannot Refute

[27] Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding PDF

Cannot Refute

[28] POV Learning: Individual Alignment of Multimodal Models using Human Perception PDF

Cannot Refute

[29] ScanDMM: A Deep Markov Model of Scanpath Prediction for 360Â° Images PDF

Cannot Refute

[30] Perceptual quality assessment for no-reference image via optimization-based meta-learning PDF

Cannot Refute

[31] Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers PDF

Cannot Refute

[32] Towards model-based recognition of human movements in image sequences PDF

Cannot Refute

[33] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction PDF

Cannot Refute

Contribution

Two-stage training pipeline with human perceptual initialization

[37] JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs PDF

Can Refute

[26] Perceptual Inductive Bias Is What You Need Before Contrastive Learning PDF

Cannot Refute

[36] Liftedcl: Lifting contrastive learning for human-centric perception PDF

Cannot Refute

Contribution

First approach using human triplet judgments for vision-language model initialization

[34] Aligning machine and human visual representations across abstraction levels PDF

Cannot Refute

[35] When does perceptual alignment benefit vision representations? PDF

Cannot Refute

Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Perceptual grouping in vision-language models PDF

Contribution Analysis

Perceptual-Initialization paradigm for vision-language models

[3] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

[23] MENTOR: Human Perception-Guided Pretraining for Increased Generalization PDF

[26] Perceptual Inductive Bias Is What You Need Before Contrastive Learning PDF

[27] Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding PDF

[28] POV Learning: Individual Alignment of Multimodal Models using Human Perception PDF

[29] ScanDMM: A Deep Markov Model of Scanpath Prediction for 360Â° Images PDF

[30] Perceptual quality assessment for no-reference image via optimization-based meta-learning PDF

[31] Fixating on Attention: Integrating Human Eye Tracking into Vision Transformers PDF

[32] Towards model-based recognition of human movements in image sequences PDF

[33] GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction PDF

Two-stage training pipeline with human perceptual initialization

[37] JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining Based on JND Audio Pairs PDF

[26] Perceptual Inductive Bias Is What You Need Before Contrastive Learning PDF

[36] Liftedcl: Lifting contrastive learning for human-centric perception PDF

First approach using human triplet judgments for vision-language model initialization

[34] Aligning machine and human visual representations across abstraction levels PDF

[35] When does perceptual alignment benefit vision representations? PDF

Table of Contents