Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

LLM pre-trainingMLLMsmulti-modality

Large Language Models (LLMs), despite being trained on text alone, surprisingly develop rich visual priors. These priors allow latent visual capabilities to be unlocked for vision tasks with a relatively small amount of multimodal data, and to perform symbolic visual generation tasks without ever having seen an image. Through systematic analysis, we reveal that visual priors—the implicit, emergent knowledge about the visual world acquired during language pre-training—are composed of separable perception and reasoning priors with unique scaling trends and origins. We show that an LLM's latent visual reasoning ability is predominantly developed by pre-training on reasoning-centric data (\eg, code, math, academia) and scales progressively. This reasoning prior acquired from language pre-training is transferable and universally applicable to visual reasoning. In contrast, the perception prior emerges more diffusely from broad corpora, and perception ability is more sensitive to the vision encoder and visual instruction tuning data. In parallel, text describing the visual world proves crucial, though its performance impact saturates rapidly. Leveraging these insights, we propose a data-centric recipe for pre-training vision-aware LLMs and verify it in 1T token scale pre-training. Our findings are grounded in over 100 controlled experiments consuming 500,000 GPU-hours, spanning the full MLLM construction pipeline—from LLM pre-training to visual alignment and supervised multimodal fine-tuning—across five model scales, a wide range of data categories and mixtures, and multiple adaptation setups. Along with our main findings, we also propose and investigate several hypotheses, and introduce a Multi-Level Existence Bench (MLE-Bench) to facilitate future research. Together, this work provides a new way of deliberately cultivating visual priors from language pre-training, paving the way for the next generation of multimodal LLMs.

We recommend a visit to our anonymous project page (https://anonymouspaperweb.github.io/lsbs/) for an interactive reading.

Abstract:

We recommend a visit to our anonymous project page (https://anonymouspaperweb.github.io/lsbs/) for an interactive reading.

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper investigates how visual priors emerge in language models during text-only pre-training, decomposing them into separable perception and reasoning components with distinct scaling behaviors. It resides in the 'Visual Prior Origins and Scaling' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers across 13 leaf nodes. This leaf sits under 'Visual Prior Formation and Mechanisms', a branch focused on understanding the origins and development of visual knowledge in language models rather than architectural integration or downstream applications.

The taxonomy reveals that neighboring research directions address related but distinct questions. The sibling leaf 'Visual Representation Mapping and Alignment' (four papers) examines how visual features map to language model spaces, while the parallel branch 'Visual-Language Architecture Design' (containing multiple leaves with 14 papers total) focuses on connector modules and fusion strategies. The 'Bias and Hallucination Mitigation' branch (three papers) tackles consequences of language priors dominating visual input. The paper's focus on data sources and scaling trends for visual reasoning versus perception distinguishes it from these architectural and bias-correction perspectives, though it shares conceptual overlap with work examining whether models rely on learned visual features or inherited textual biases.

Among 30 candidates examined through semantic search and citation expansion, none were identified as clearly refuting the paper's three main contributions. For the decomposition of visual priors into perception and reasoning components, 10 candidates were examined with zero refutable matches. Similarly, the identification of data sources for visual priors (10 candidates examined) and the proposed data-centric pre-training recipe (10 candidates examined) each showed no clear prior work providing the same insights. This suggests that within the limited search scope, the specific framing of separable perception versus reasoning priors with distinct scaling laws and data origins appears relatively unexplored, though the analysis does not claim exhaustive coverage of all potentially relevant literature.

The limited search scope (30 candidates from top-K semantic matches) and sparse population of the taxonomy leaf (three papers total) together suggest the work addresses questions that have received less systematic attention in prior literature. However, the analysis cannot rule out relevant work outside the semantic search radius or published in adjacent communities. The absence of refutable candidates reflects the specific scope examined rather than definitive proof of novelty across all possible prior work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visual priors emergence in language models for multimodal learning. This field investigates how large language models develop and leverage visual understanding when integrated with vision encoders, spanning questions about representation formation, architectural choices, and downstream performance. The taxonomy organizes research into five main branches: Visual Prior Formation and Mechanisms examines how visual knowledge arises during pretraining and scaling, often exploring whether models rely on learned visual features or inherit biases from text (e.g., Pixels Versus Priors[17], Visual Language Mapping[5]); Visual-Language Architecture Design focuses on connector modules, attention schemes, and fusion strategies that bridge modalities (e.g., Chat UniVi[1], CogVLM[16]); Bias and Hallucination Mitigation addresses spurious correlations and object hallucinations that emerge when language priors dominate visual input (e.g., Debiasing Multimodal[23], Causal LLaVA[42]); Multimodal Task Applications demonstrates capabilities across vision-language tasks such as VQA, grounding, and embodied reasoning (e.g., PaLM-E[7], LISA Segmentation[28]); and Survey and Benchmark Studies provide systematic reviews and evaluation frameworks (e.g., Multimodal LLM Survey[9], MLLM VQA Survey[14]). A central tension runs through many branches: whether visual understanding in these models stems primarily from pixel-level perception or from textual priors encoded in pretrained language backbones. Visual Priors Language[0] sits within the Visual Prior Origins and Scaling cluster, directly probing this question by analyzing how scaling and training regimes influence the balance between emergent visual reasoning and inherited linguistic biases. This work contrasts with neighbors like Pixels Versus Priors[17], which empirically disentangles pixel contributions from prior knowledge, and Debiasing Multimodal[23], which tackles the downstream consequences of over-reliance on language shortcuts. Meanwhile, architectural studies such as Visual Language Mapping[5] and Vila Pretraining[4] explore how design choices during alignment and pretraining shape the formation of these priors. Across the taxonomy, open questions persist about optimal training recipes, the role of scale versus data quality, and strategies to ensure models ground predictions in visual evidence rather than statistical correlations.

Claimed Contributions

Decomposition of visual priors into perception and reasoning components

10 retrieved papers

The authors demonstrate that visual priors acquired during language pre-training are not monolithic but consist of separable perception and reasoning priors. These components exhibit distinct scaling trends, origins, and dependencies on different data sources and training stages.

10 retrieved papers

Identification of data sources for visual priors

10 retrieved papers

The work reveals that visual reasoning capabilities are primarily developed through pre-training on reasoning-centric data such as code, mathematics, and academic texts, and scale progressively with increased exposure. In contrast, perception abilities emerge more diffusely from diverse corpora and are more sensitive to the vision encoder and visual instruction tuning.

10 retrieved papers

Data-centric recipe for vision-aware LLM pre-training

10 retrieved papers

The authors develop and validate a data mixture recipe that balances reasoning-centric content with visual world descriptions to deliberately cultivate visual priors during language pre-training. This recipe is verified through 1T token scale experiments and demonstrates improved multimodal performance without compromising language proficiency.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts PDF

Golovanevsky, Michal, Rudman, William, Michal Golovanevsky, William Rudman, Bar, Amir, Michael Lepori, Singh, Ritambhara, Amir Bar, Eickhoff, Carsten, Ritambhara Singh, Carsten Eickhoff (2025) • Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

[23] Debiasing multimodal large language models PDF

Zhang Yifan, Shi Yang, Yifan Zhang, Yu, Weichen, Weichen Yu, Wen, Qingsong, Qingsong Wen, Wang Xue, Xue Wang, Yang, Wenjing, Zhang Zhang, Liang Wang, Wang Liang, Rong Jin, Jin Rong, Tien-Ping Tan (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Decomposition of visual priors into perception and reasoning components

[16] Cogvlm: Visual expert for pretrained language models PDF

Cannot Refute

[59] mplug-owl: Modularization empowers large language models with multimodality PDF

Cannot Refute

[60] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

Cannot Refute

[61] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward PDF

Cannot Refute

[62] Visually descriptive language model for vector graphics reasoning PDF

Cannot Refute

[63] More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models PDF

Cannot Refute

[64] Dual Thinking and Logical Processing in Human Vision and Multi-modal Large Language Models PDF

Cannot Refute

[65] MedBLINK: Probing Visual Perception in Multimodal Language Models for Medicine PDF

Cannot Refute

[66] Post-Training Small Data with Visual Language Model PDF

Cannot Refute

[67] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling PDF

Cannot Refute

Contribution

Identification of data sources for visual priors

[68] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

Cannot Refute

[69] Perception tokens enhance visual reasoning in multimodal language models PDF

Cannot Refute

[70] Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models PDF

Cannot Refute

[71] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

Cannot Refute

[72] A survey on benchmarks of multimodal large language models PDF

Cannot Refute

[73] What is the visual cognition gap between humans and multimodal llms? PDF

Cannot Refute

[74] Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning PDF

Cannot Refute

[75] From perception to cognition: A survey of vision-language interactive reasoning in multimodal large language models PDF

Cannot Refute

[76] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models PDF

Cannot Refute

[77] Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving PDF

Cannot Refute

Contribution

Data-centric recipe for vision-aware LLM pre-training

[4] Vila: On pre-training for visual language models PDF

Cannot Refute

[16] Cogvlm: Visual expert for pretrained language models PDF

Cannot Refute

[51] Mm1: methods, analysis and insights from multimodal llm pre-training PDF

Cannot Refute

[52] Mmcomposition: Revisiting the compositionality of pre-trained vision-language models PDF

Cannot Refute

[53] An examination of the compositionality of large generative vision-language models PDF

Cannot Refute

[54] Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models PDF

Cannot Refute

[55] Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models PDF

Cannot Refute

[56] Concept-skill transferability-based data selection for large vision-language models PDF

Cannot Refute

[57] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models PDF

Cannot Refute

[58] Compositional Visual Generation with Composable Diffusion Models PDF

Cannot Refute

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts PDF

[23] Debiasing multimodal large language models PDF

Contribution Analysis

Decomposition of visual priors into perception and reasoning components

[16] Cogvlm: Visual expert for pretrained language models PDF

[59] mplug-owl: Modularization empowers large language models with multimodality PDF

[60] Perception before reasoning: Two-stage reinforcement learning for visual reasoning in vision-language models PDF

[61] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward PDF

[62] Visually descriptive language model for vector graphics reasoning PDF

[63] More Than the Final Answer: Improving Visual Extraction and Logical Consistency in Vision-Language Models PDF

[64] Dual Thinking and Logical Processing in Human Vision and Multi-modal Large Language Models PDF

[65] MedBLINK: Probing Visual Perception in Multimodal Language Models for Medicine PDF

[66] Post-Training Small Data with Visual Language Model PDF

[67] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling PDF

Identification of data sources for visual priors

[68] Insight-v: Exploring long-chain visual reasoning with multimodal large language models PDF

[69] Perception tokens enhance visual reasoning in multimodal language models PDF

[70] Mm-iq: Benchmarking human-like abstraction and reasoning in multimodal models PDF

[71] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

[72] A survey on benchmarks of multimodal large language models PDF

[73] What is the visual cognition gap between humans and multimodal llms? PDF

[74] Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning PDF

[75] From perception to cognition: A survey of vision-language interactive reasoning in multimodal large language models PDF

[76] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models PDF

[77] Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving PDF

Data-centric recipe for vision-aware LLM pre-training

[4] Vila: On pre-training for visual language models PDF

[16] Cogvlm: Visual expert for pretrained language models PDF

[51] Mm1: methods, analysis and insights from multimodal llm pre-training PDF

[52] Mmcomposition: Revisiting the compositionality of pre-trained vision-language models PDF

[53] An examination of the compositionality of large generative vision-language models PDF

[54] Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models PDF

[55] Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models PDF

[56] Concept-skill transferability-based data selection for large vision-language models PDF

[57] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models PDF

[58] Compositional Visual Generation with Composable Diffusion Models PDF

Table of Contents