From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

ICLR 2026 Conference SubmissionAnonymous Authors
Native Vision-Language ModelsVision-Language PrimitiveHolistic Vision-Language Buffer
Abstract:

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome? (-) How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field. In this paper, we clarify these challenges and outline guiding principles for constructing native VLMs. Specifically, one native VLM primitive should: (i) effectively align pixel and word representations within a shared semantic space; (ii) seamlessly integrate the strengths of formerly separate vision and language modules; (iii) inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of native VLMs built from first principles, greatly narrowing the gap with top-tier modular counterparts across diverse real-world scenarios. With only 390M image-text examples, NEO efficiently develops visual perception from scratch while mitigating vision-language conflicts inside a dense and monolithic model crafted from our elaborate primitives. We position NEO as a cornerstone for scalable and powerful native VLM development, paired with a rich set of reusable components that foster a cost-effective and extensible ecosystem. Code and weights will be publicly available to promote further research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes NEO, a family of native vision-language models designed to unify pixel and word representations within a shared semantic space. It sits in the Unified Native Architecture Development leaf, which contains only two papers including this one. This is the sparsest research direction in the taxonomy, suggesting that end-to-end native architectures built from first principles remain relatively underexplored compared to modular approaches. The work explicitly contrasts itself with modular VLMs that connect frozen vision encoders to language models via alignment mechanisms, positioning itself as a fundamental rethinking of architecture rather than an incremental adaptation.

The taxonomy reveals that most related work clusters in neighboring leaves under Model Architecture and Design Principles. The Modular Integration and Alignment leaf contains five papers exploring adapter-based connections between pre-trained components, while Efficient and Lightweight Architectures houses five papers optimizing for reduced computational cost. The scope notes clarify that native architectures like NEO exclude modular approaches and efficiency-focused compression, instead emphasizing integrated vision-language primitives. Nearby branches in Training Methodologies address pre-training strategies and instruction tuning, which likely inform NEO's training paradigm but focus on optimization recipes rather than architectural unification.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core native VLM primitive and the NEO family architecture examined ten candidates each with zero refutations, suggesting these contributions occupy relatively unexplored territory within the limited search scope. However, the Native Rotary Position Embedding with modality-specific decomposition examined ten candidates and found one refutable match, indicating some prior work on position encoding adaptations for multimodal contexts. The limited search scale means these findings reflect top-K semantic neighbors rather than exhaustive coverage of the field.

Based on the thirty-candidate search, the work appears to address a sparse research direction with limited direct precedent in unified native architectures. The single refutation for position encoding suggests incremental refinement in that component, while the core architectural contributions show no clear overlap within the examined set. The analysis cannot rule out relevant work outside the top-K semantic matches or in adjacent communities not captured by the search strategy.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: native vision-language model construction and training. The field organizes around several major branches that reflect distinct stages and concerns in building multimodal systems. Model Architecture and Design Principles focuses on foundational choices—how to unify visual and linguistic representations, whether through adapter-based integration (e.g., Image as a Foreign[48]) or end-to-end native designs. Training Methodologies and Optimization addresses the recipes and techniques that make large-scale pretraining feasible, including data curation, loss formulations, and efficient optimization strategies (e.g., What matters when building[5]). Capability Enhancement and Specialization explores methods to improve reasoning, instruction-following, and task-specific performance, often through reinforcement learning or preference tuning (e.g., Vision-r1[6], Task preference optimization[28]). Domain-Specific Adaptation and Applications examines how general-purpose models are tailored to specialized contexts such as remote sensing (Geochat[3], Rsgpt[10]) or robotics (RT-2[18]). Finally, Surveys, Benchmarks, and Methodological Guidance provides the evaluative and conceptual infrastructure that helps researchers navigate design trade-offs and measure progress. Recent work reveals active exploration of architectural unification versus modular composition, with some studies advocating tightly integrated native architectures and others favoring lightweight adapters for flexibility. Training efficiency and scalability remain central themes, as researchers balance model size, data volume, and computational cost (Building and better understanding[1], Efficient multimodal large language[12]). From Pixels to Words[0] sits within the Unified Native Architecture Development cluster, emphasizing end-to-end integration of vision and language components rather than bolt-on adapters. This positions it closely alongside works like Image as a Foreign[48], which also explores native unification strategies, though From Pixels to Words[0] may differ in its specific architectural choices or training protocols. The broader landscape shows ongoing tension between maximizing performance through large-scale pretraining and achieving practical deployment through efficient, specialized designs.

Claimed Contributions

Native VLM primitive with unified vision-language encoding, alignment, and reasoning

The authors propose a unified primitive architecture that integrates flexible position encoding, multi-head native attention, and native rotary position embeddings to simultaneously handle encoding, alignment, and reasoning across vision and language modalities within a single module, eliminating the need for separate visual encoders.

10 retrieved papers
NEO family of native VLMs with pre-Buffer and post-LLM training paradigm

The authors introduce NEO, a family of native vision-language models that partitions the monolithic backbone into pre-Buffer and post-LLM components during pre-training to enable efficient visual learning from scratch while preserving linguistic knowledge, then merges them for end-to-end training in later stages.

10 retrieved papers
Native Rotary Position Embedding (Native-RoPE) with modality-specific decomposition

The authors develop a novel rotary position embedding scheme that fully decomposes channel and frequency allocation across temporal, height, and width dimensions with distinct base frequencies, enabling effective modeling of multi-dimensional spatial-temporal relationships while maintaining compatibility with pre-trained language models.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Native VLM primitive with unified vision-language encoding, alignment, and reasoning

The authors propose a unified primitive architecture that integrates flexible position encoding, multi-head native attention, and native rotary position embeddings to simultaneously handle encoding, alignment, and reasoning across vision and language modalities within a single module, eliminating the need for separate visual encoders.

Contribution

NEO family of native VLMs with pre-Buffer and post-LLM training paradigm

The authors introduce NEO, a family of native vision-language models that partitions the monolithic backbone into pre-Buffer and post-LLM components during pre-training to enable efficient visual learning from scratch while preserving linguistic knowledge, then merges them for end-to-end training in later stages.

Contribution

Native Rotary Position Embedding (Native-RoPE) with modality-specific decomposition

The authors develop a novel rotary position embedding scheme that fully decomposes channel and frequency allocation across temporal, height, and width dimensions with distinct base frequencies, enabling effective modeling of multi-dimensional spatial-temporal relationships while maintaining compatibility with pre-trained language models.