From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Overview
Overall Novelty Assessment
The paper proposes NEO, a family of native vision-language models designed to unify pixel and word representations within a shared semantic space. It sits in the Unified Native Architecture Development leaf, which contains only two papers including this one. This is the sparsest research direction in the taxonomy, suggesting that end-to-end native architectures built from first principles remain relatively underexplored compared to modular approaches. The work explicitly contrasts itself with modular VLMs that connect frozen vision encoders to language models via alignment mechanisms, positioning itself as a fundamental rethinking of architecture rather than an incremental adaptation.
The taxonomy reveals that most related work clusters in neighboring leaves under Model Architecture and Design Principles. The Modular Integration and Alignment leaf contains five papers exploring adapter-based connections between pre-trained components, while Efficient and Lightweight Architectures houses five papers optimizing for reduced computational cost. The scope notes clarify that native architectures like NEO exclude modular approaches and efficiency-focused compression, instead emphasizing integrated vision-language primitives. Nearby branches in Training Methodologies address pre-training strategies and instruction tuning, which likely inform NEO's training paradigm but focus on optimization recipes rather than architectural unification.
Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core native VLM primitive and the NEO family architecture examined ten candidates each with zero refutations, suggesting these contributions occupy relatively unexplored territory within the limited search scope. However, the Native Rotary Position Embedding with modality-specific decomposition examined ten candidates and found one refutable match, indicating some prior work on position encoding adaptations for multimodal contexts. The limited search scale means these findings reflect top-K semantic neighbors rather than exhaustive coverage of the field.
Based on the thirty-candidate search, the work appears to address a sparse research direction with limited direct precedent in unified native architectures. The single refutation for position encoding suggests incremental refinement in that component, while the core architectural contributions show no clear overlap within the examined set. The analysis cannot rule out relevant work outside the top-K semantic matches or in adjacent communities not captured by the search strategy.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a unified primitive architecture that integrates flexible position encoding, multi-head native attention, and native rotary position embeddings to simultaneously handle encoding, alignment, and reasoning across vision and language modalities within a single module, eliminating the need for separate visual encoders.
The authors introduce NEO, a family of native vision-language models that partitions the monolithic backbone into pre-Buffer and post-LLM components during pre-training to enable efficient visual learning from scratch while preserving linguistic knowledge, then merges them for end-to-end training in later stages.
The authors develop a novel rotary position embedding scheme that fully decomposes channel and frequency allocation across temporal, height, and width dimensions with distinct base frequencies, enabling effective modeling of multi-dimensional spatial-temporal relationships while maintaining compatibility with pre-trained language models.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[48] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Native VLM primitive with unified vision-language encoding, alignment, and reasoning
The authors propose a unified primitive architecture that integrates flexible position encoding, multi-head native attention, and native rotary position embeddings to simultaneously handle encoding, alignment, and reasoning across vision and language modalities within a single module, eliminating the need for separate visual encoders.
[60] Align before fuse: Vision and language representation learning with momentum distillation PDF
[61] Scaling up visual and vision-language representation learning with noisy text supervision PDF
[62] Video-LLaVA: Learning United Visual Representation by Alignment Before Projection PDF
[63] Uniter: Universal image-text representation learning PDF
[64] Prompting large vision-language models for compositional reasoning PDF
[65] Gramian Multimodal Representation Learning and Alignment PDF
[66] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF
[67] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF
[68] FG-CLIP: Fine-Grained Visual and Textual Alignment PDF
[69] A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges PDF
NEO family of native VLMs with pre-Buffer and post-LLM training paradigm
The authors introduce NEO, a family of native vision-language models that partitions the monolithic backbone into pre-Buffer and post-LLM components during pre-training to enable efficient visual learning from scratch while preserving linguistic knowledge, then merges them for end-to-end training in later stages.
[36] Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks PDF
[51] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF
[52] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone PDF
[53] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models PDF
[54] PEIT: bridging the modality gap with pre-trained models for end-to-end image translation PDF
[55] Unified multi-modal diagnostic framework with reconstruction pre-training and heterogeneity-combat tuning PDF
[56] Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving PDF
[57] Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving PDF
[58] AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning PDF
[59] An Effective End-to-End Solution for Multimodal Action Recognition PDF
Native Rotary Position Embedding (Native-RoPE) with modality-specific decomposition
The authors develop a novel rotary position embedding scheme that fully decomposes channel and frequency allocation across temporal, height, and width dimensions with distinct base frequencies, enabling effective modeling of multi-dimensional spatial-temporal relationships while maintaining compatibility with pre-trained language models.