From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Native Vision-Language ModelsVision-Language PrimitiveHolistic Vision-Language Buffer

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, greatly narrowing the gap with top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Code and weights will be publicly available to promote further research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes NEO, a family of native vision-language models designed to unify pixel and word representations within a shared semantic space. It sits in the Unified Native Architecture Development leaf, which contains only two papers including this one. This is the sparsest research direction in the taxonomy, suggesting that end-to-end native architectures built from first principles remain relatively underexplored compared to modular approaches. The work explicitly contrasts itself with modular VLMs that connect frozen vision encoders to language models via alignment mechanisms, positioning itself as a fundamental rethinking of architecture rather than an incremental adaptation.

The taxonomy reveals that most related work clusters in neighboring leaves under Model Architecture and Design Principles. The Modular Integration and Alignment leaf contains five papers exploring adapter-based connections between pre-trained components, while Efficient and Lightweight Architectures houses five papers optimizing for reduced computational cost. The scope notes clarify that native architectures like NEO exclude modular approaches and efficiency-focused compression, instead emphasizing integrated vision-language primitives. Nearby branches in Training Methodologies address pre-training strategies and instruction tuning, which likely inform NEO's training paradigm but focus on optimization recipes rather than architectural unification.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core native VLM primitive and the NEO family architecture examined ten candidates each with zero refutations, suggesting these contributions occupy relatively unexplored territory within the limited search scope. However, the Native Rotary Position Embedding with modality-specific decomposition examined ten candidates and found one refutable match, indicating some prior work on position encoding adaptations for multimodal contexts. The limited search scale means these findings reflect top-K semantic neighbors rather than exhaustive coverage of the field.

Based on the thirty-candidate search, the work appears to address a sparse research direction with limited direct precedent in unified native architectures. The single refutation for position encoding suggests incremental refinement in that component, while the core architectural contributions show no clear overlap within the examined set. The analysis cannot rule out relevant work outside the top-K semantic matches or in adjacent communities not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: native vision-language model construction and training. The field organizes around several major branches that reflect distinct stages and concerns in building multimodal systems. Model Architecture and Design Principles focuses on foundational choices—how to unify visual and linguistic representations, whether through adapter-based integration (e.g., Image as a Foreign[48]) or end-to-end native designs. Training Methodologies and Optimization addresses the recipes and techniques that make large-scale pretraining feasible, including data curation, loss formulations, and efficient optimization strategies (e.g., What matters when building[5]). Capability Enhancement and Specialization explores methods to improve reasoning, instruction-following, and task-specific performance, often through reinforcement learning or preference tuning (e.g., Vision-r1[6], Task preference optimization[28]). Domain-Specific Adaptation and Applications examines how general-purpose models are tailored to specialized contexts such as remote sensing (Geochat[3], Rsgpt[10]) or robotics (RT-2[18]). Finally, Surveys, Benchmarks, and Methodological Guidance provides the evaluative and conceptual infrastructure that helps researchers navigate design trade-offs and measure progress. Recent work reveals active exploration of architectural unification versus modular composition, with some studies advocating tightly integrated native architectures and others favoring lightweight adapters for flexibility. Training efficiency and scalability remain central themes, as researchers balance model size, data volume, and computational cost (Building and better understanding[1], Efficient multimodal large language[12]). From Pixels to Words[0] sits within the Unified Native Architecture Development cluster, emphasizing end-to-end integration of vision and language components rather than bolt-on adapters. This positions it closely alongside works like Image as a Foreign[48], which also explores native unification strategies, though From Pixels to Words[0] may differ in its specific architectural choices or training protocols. The broader landscape shows ongoing tension between maximizing performance through large-scale pretraining and achieving practical deployment through efficient, specialized designs.

Claimed Contributions

Native VLM primitive with unified vision-language encoding, alignment, and reasoning

10 retrieved papers

The authors propose a unified primitive architecture that integrates flexible position encoding, multi-head native attention, and native rotary position embeddings to simultaneously handle encoding, alignment, and reasoning across vision and language modalities within a single module, eliminating the need for separate visual encoders.

10 retrieved papers

NEO family of native VLMs with pre-Buffer and post-LLM training paradigm

10 retrieved papers

The authors introduce NEO, a family of native vision-language models that partitions the monolithic backbone into pre-Buffer and post-LLM components during pre-training to enable efficient visual learning from scratch while preserving linguistic knowledge, then merges them for end-to-end training in later stages.

10 retrieved papers

Native Rotary Position Embedding (Native-RoPE) with modality-specific decomposition

Can Refute

10 retrieved papers

The authors develop a novel rotary position embedding scheme that fully decomposes channel and frequency allocation across temporal, height, and width dimensions with distinct base frequencies, enabling effective modeling of multi-dimensional spatial-temporal relationships while maintaining compatibility with pre-trained language models.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[48] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks PDF

Wenhui Wang, Hangbo Bao, Wen Wang, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Qiangbo Liu, Owais Khan Mohammed, Saksham Singhal, O. Mohammed, Subhojit Som, Furu Wei (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Native VLM primitive with unified vision-language encoding, alignment, and reasoning

[60] Align before fuse: Vision and language representation learning with momentum distillation PDF

Cannot Refute

[61] Scaling up visual and vision-language representation learning with noisy text supervision PDF

Cannot Refute

[62] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection PDF

Cannot Refute

[63] Uniter: Universal image-text representation learning PDF

Cannot Refute

[64] Prompting large vision-language models for compositional reasoning PDF

Cannot Refute

[65] Gramian Multimodal Representation Learning and Alignment PDF

Cannot Refute

[66] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[67] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

Cannot Refute

[68] FG-CLIP: Fine-Grained Visual and Textual Alignment PDF

Cannot Refute

[69] A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges PDF

Cannot Refute

Contribution

NEO family of native VLMs with pre-Buffer and post-LLM training paradigm

[36] Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks PDF

Cannot Refute

[51] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

Cannot Refute

[52] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone PDF

Cannot Refute

[53] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models PDF

Cannot Refute

[54] PEIT: bridging the modality gap with pre-trained models for end-to-end image translation PDF

Cannot Refute

[55] Unified multi-modal diagnostic framework with reconstruction pre-training and heterogeneity-combat tuning PDF

Cannot Refute

[56] Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving PDF

Cannot Refute

[57] Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving PDF

Cannot Refute

[58] AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning PDF

Cannot Refute

[59] An Effective End-to-End Solution for Multimodal Action Recognition PDF

Cannot Refute

Contribution

Native Rotary Position Embedding (Native-RoPE) with modality-specific decomposition

[77] VideoRoPE: What Makes for Good Video Rotary Position Embedding? PDF

Can Refute

[70] VRoPE: Rotary Position Embedding for Video Large Language Models PDF

Cannot Refute

[71] Mavors: Multi-granularity video representation for multimodal large language model PDF

Cannot Refute

[72] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models PDF

Cannot Refute

[73] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

Cannot Refute

[74] HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models PDF

Cannot Refute

[75] St-llm: Large language models are effective temporal learners PDF

Cannot Refute

[76] Spatiotemporal-aware visual captioning using vision-language pre-training model PDF

Cannot Refute

[78] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models PDF

Cannot Refute

[79] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization PDF

Cannot Refute

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[48] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks PDF

Contribution Analysis

Native VLM primitive with unified vision-language encoding, alignment, and reasoning

[60] Align before fuse: Vision and language representation learning with momentum distillation PDF

[61] Scaling up visual and vision-language representation learning with noisy text supervision PDF

[62] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection PDF

[63] Uniter: Universal image-text representation learning PDF

[64] Prompting large vision-language models for compositional reasoning PDF

[65] Gramian Multimodal Representation Learning and Alignment PDF

[66] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[67] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

[68] FG-CLIP: Fine-Grained Visual and Textual Alignment PDF

[69] A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges PDF

NEO family of native VLMs with pre-Buffer and post-LLM training paradigm

[36] Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks PDF

[51] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

[52] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone PDF

[53] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models PDF

[54] PEIT: bridging the modality gap with pre-trained models for end-to-end image translation PDF

[55] Unified multi-modal diagnostic framework with reconstruction pre-training and heterogeneity-combat tuning PDF

[56] Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving PDF

[57] Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving PDF

[58] AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning PDF

[59] An Effective End-to-End Solution for Multimodal Action Recognition PDF

Native Rotary Position Embedding (Native-RoPE) with modality-specific decomposition

[77] VideoRoPE: What Makes for Good Video Rotary Position Embedding? PDF

[70] VRoPE: Rotary Position Embedding for Video Large Language Models PDF

[71] Mavors: Multi-granularity video representation for multimodal large language model PDF

[72] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models PDF

[73] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

[74] HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models PDF

[75] St-llm: Large language models are effective temporal learners PDF

[76] Spatiotemporal-aware visual captioning using vision-language pre-training model PDF

[78] MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models PDF

[79] EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization PDF

Table of Contents