Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Biomolecular learningProtein sequence

Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a context-driven paradigm for biomolecular understanding in scientific LLMs, arguing that high-level structured context from bioinformatics tools outperforms raw sequence inputs. It resides in the 'Tokenization and Representation Evaluation' leaf under 'Benchmarking and Evaluation Methodologies', alongside one sibling paper examining sequence representation methods. This leaf is relatively sparse within a taxonomy of 50 papers across 36 topics, suggesting the specific focus on tokenization strategies and representation quality remains an emerging area of systematic investigation.

The taxonomy reveals neighboring leaves addressing complementary concerns: 'Comprehensive Multi-Task Benchmarking' evaluates LLM performance across diverse biomolecular tasks, while 'Task-Specific Performance Assessment' examines domain-specific capabilities. The paper's emphasis on input representation connects to architectural branches like 'Multimodal Protein-Language Integration' and 'Sequence-to-Text Translation', which explore how models encode and interpret biomolecular data. Its diagnostic stance on tokenization dilemmas bridges evaluation methodology and architectural design, positioning it at the intersection of how models are assessed and how they process sequences.

Among 27 candidates examined through limited semantic search, none clearly refute the three core contributions. The context-driven paradigm examined 10 candidates with no refutations, the tokenization dilemma formalization examined 10 with none refuting, and the systematic input mode comparison examined 7 with none refuting. This suggests that within the search scope, the specific framing of tokenization as a fundamental bottleneck and the empirical demonstration that context-only inputs outperform sequence-plus-context combinations appear relatively unexplored. However, the limited search scale means broader literature may contain related insights not captured here.

The analysis indicates the work addresses a recognized but under-investigated question within biomolecular LLM evaluation. The sparse population of its taxonomy leaf and absence of refuting candidates among 27 examined papers suggest the specific empirical findings and paradigm shift may offer fresh perspective. However, the limited search scope and the paper's position within a broader ecosystem of representation and benchmarking studies warrant cautious interpretation of its novelty claims relative to the full literature landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Evaluating biomolecular sequence understanding in scientific large language models. The field has organized itself around several complementary dimensions. Scientific LLM Architectures and Frameworks for Biomolecules encompasses the design of specialized models that integrate protein, DNA, and molecular representations with language understanding, as seen in works like Prot2Chat[3] and InstructBioMol[21]. Benchmarking and Evaluation Methodologies focuses on systematic assessment strategies, including tokenization schemes, representation quality, and task-specific performance metrics exemplified by LAB-Bench[28] and Bioinformatics NLP Benchmarking[9]. Domain-Specific Applications and Use Cases explores practical deployments in drug discovery, protein engineering, and genomic analysis, while Foundational Concepts and Methodological Reviews provide broader perspectives on integrating AI with biological sciences, as in AI for Biomedicine[5] and Scientific LLM Survey[1]. Specialized Integration and Hybrid Systems addresses multimodal fusion and cross-domain reasoning capabilities. A particularly active line of work examines how different tokenization and representation strategies affect model performance on biomolecular tasks, with some studies emphasizing character-level or subword approaches while others explore domain-specific vocabularies. Lost in Tokenization[0] sits squarely within this evaluation-focused branch, investigating how tokenization choices impact sequence understanding—a question that bridges architectural design and benchmarking concerns. Its emphasis on representation evaluation aligns closely with Sequence Representation Methods[34], which surveys encoding strategies across biological modalities. Compared to application-oriented efforts like LLMs Drug Discovery[10] or instructional frameworks such as Mol-instructions[2], this work takes a more diagnostic stance, probing the foundational question of whether current LLMs genuinely capture biomolecular semantics or merely exploit surface patterns. This methodological focus complements broader surveys like Protein LLM Survey[24] by drilling into a specific technical bottleneck that affects downstream task performance across the entire taxonomy.

Claimed Contributions

Context-driven paradigm for biomolecular understanding in Sci-LLMs

10 retrieved papers

The authors propose a new paradigm that bypasses direct sequence interpretation by providing LLMs with high-level textual context generated from bioinformatics tools (e.g., BLAST, Pfam, InterProScan). This approach avoids the tokenization dilemma by leveraging structured, human-readable knowledge that is natively aligned with the LLM's linguistic domain.

10 retrieved papers

Identification and formalization of the tokenization dilemma

10 retrieved papers

The authors identify and formalize a fundamental challenge in existing Sci-LLMs: the tokenization dilemma. This encompasses two problems—weak representation from granular tokenization that destroys functional motifs, and semantic misalignment when bridging biological and linguistic spaces in multimodal approaches.

10 retrieved papers

Systematic empirical comparison of input modes for Sci-LLMs

7 retrieved papers

The authors conduct a systematic empirical study comparing three input configurations (sequence-only, context-only, and combined) across multiple state-of-the-art Sci-LLMs on biological reasoning tasks. Their findings demonstrate that context-only consistently outperforms other modes and that raw sequences act as informational noise.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[34] Biological Sequence Representation Methods and Recent Advances: A Review PDF

Hongwei Zhang, Yanbang Shi, Yapeng Wang, Xu Yang, Kefeng Li, Sio-Kei Im, Yu Han, Yan Shi, Yu, Han (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Context-driven paradigm for biomolecular understanding in Sci-LLMs

[51] The nf-core framework for community-curated bioinformatics pipelines PDF

Cannot Refute

[52] The Bio3D packages for structural bioinformatics PDF

Cannot Refute

[53] Unveiling the dynamic role of bioinformatics in automation for efficient and accurate data processing and interpretation PDF

Cannot Refute

[54] Advances in Structural Bioinformatics PDF

Cannot Refute

[55] Context-driven interaction retrieval and classification for modeling, curation, and reuse PDF

Cannot Refute

[56] FTDMP: A Framework for Protein-Protein, Protein-DNA, and Protein-RNA Docking and Scoring. PDF

Cannot Refute

[57] Developing Tools for Structural Bioinformatics: from Python to Bedside PDF

Cannot Refute

[58] Bioinformatics in the Age of Big Data: Leveraging Computational Tools for Biological Discoveries PDF

Cannot Refute

[59] Deciphering the omicron variant: integrated omics analysis reveals critical biomarkers and pathophysiological pathways PDF

Cannot Refute

[60] MOSCA 2.0: A bioinformatics framework for metagenomics, metatranscriptomics and metaproteomics data analysis and visualization PDF

Cannot Refute

Contribution

Identification and formalization of the tokenization dilemma

[4] Transformers and genome language models PDF

Cannot Refute

[68] Geometry Informed Tokenization of Molecules for Language Model Generation PDF

Cannot Refute

[69] A comparison of tokenization impact in attention based and state space genomic language models PDF

Cannot Refute

[70] Linguistically inspired roadmap for building biologically reliable protein language models PDF

Cannot Refute

[71] The impact of tokenizer selection in genomic language models PDF

Cannot Refute

[72] Bilingual language model for protein sequence and structure PDF

Cannot Refute

[73] Language modeling techniques for biological sequence processing PDF

Cannot Refute

[74] DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome PDF

Cannot Refute

[75] evobpe: Evolutionary protein sequence tokenization PDF

Cannot Refute

[76] GENA-LM: a family of open-source foundational DNA language models for long sequences PDF

Cannot Refute

Contribution

Systematic empirical comparison of input modes for Sci-LLMs

[61] A Comparative Survey on Large Language Models for Biological Data PDF

Cannot Refute

[62] DeepPROTECTNeo: A Deep learning-based Personalized and RV-guided Optimization tool for TCR Epitope interaction using Context-aware Transformers PDF

Cannot Refute

[63] Vlm4bio: A benchmark dataset to evaluate pretrained vision-language models for trait discovery from biological images PDF

Cannot Refute

[64] Large language models for zero-shot inference of causal structures in biology PDF

Cannot Refute

[65] Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study PDF

Cannot Refute

[66] How Important Is Tokenization in French Medical Masked Language Models? PDF

Cannot Refute

[67] BLAST Unlocked: Practical Workflows, Optimization Strategies, and Result Interpretation for Sequence Similarity Searches PDF

Cannot Refute

Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[34] Biological Sequence Representation Methods and Recent Advances: A Review PDF

Contribution Analysis

Context-driven paradigm for biomolecular understanding in Sci-LLMs

[51] The nf-core framework for community-curated bioinformatics pipelines PDF

[52] The Bio3D packages for structural bioinformatics PDF

[53] Unveiling the dynamic role of bioinformatics in automation for efficient and accurate data processing and interpretation PDF

[54] Advances in Structural Bioinformatics PDF

[55] Context-driven interaction retrieval and classification for modeling, curation, and reuse PDF

[56] FTDMP: A Framework for Protein-Protein, Protein-DNA, and Protein-RNA Docking and Scoring. PDF

[57] Developing Tools for Structural Bioinformatics: from Python to Bedside PDF

[58] Bioinformatics in the Age of Big Data: Leveraging Computational Tools for Biological Discoveries PDF

[59] Deciphering the omicron variant: integrated omics analysis reveals critical biomarkers and pathophysiological pathways PDF

[60] MOSCA 2.0: A bioinformatics framework for metagenomics, metatranscriptomics and metaproteomics data analysis and visualization PDF

Identification and formalization of the tokenization dilemma

[4] Transformers and genome language models PDF

[68] Geometry Informed Tokenization of Molecules for Language Model Generation PDF

[69] A comparison of tokenization impact in attention based and state space genomic language models PDF

[70] Linguistically inspired roadmap for building biologically reliable protein language models PDF

[71] The impact of tokenizer selection in genomic language models PDF

[72] Bilingual language model for protein sequence and structure PDF

[73] Language modeling techniques for biological sequence processing PDF

[74] DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome PDF

[75] evobpe: Evolutionary protein sequence tokenization PDF

[76] GENA-LM: a family of open-source foundational DNA language models for long sequences PDF

Systematic empirical comparison of input modes for Sci-LLMs

[61] A Comparative Survey on Large Language Models for Biological Data PDF

[62] DeepPROTECTNeo: A Deep learning-based Personalized and RV-guided Optimization tool for TCR Epitope interaction using Context-aware Transformers PDF

[63] Vlm4bio: A benchmark dataset to evaluate pretrained vision-language models for trait discovery from biological images PDF

[64] Large language models for zero-shot inference of causal structures in biology PDF

[65] Evaluation of Large Language Model Performance on the Biomedical Language Understanding and Reasoning Benchmark: Comparative Study PDF

[66] How Important Is Tokenization in French Medical Masked Language Models? PDF

[67] BLAST Unlocked: Practical Workflows, Optimization Strategies, and Result Interpretation for Sequence Similarity Searches PDF

Table of Contents