The Human Genomics Long-Range Benchmark: Advancing DNA Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Language ModelsDNADNA LMsBenchmark
Abstract:

The advent of language models (LMs) in genomics necessitates benchmarks that can assess models’ capabilities and limitations. In contrast to protein models, DNA LMs can be used to study non-coding regions of the genome and must account for unique challenges, especially interactions across long sequence lengths. However, existing benchmarks for DNA LMs are defined over short sequence datasets and can involve tasks that are not considered to be biologically meaningful. Here, we present the Human Genomics Long-Range Benchmark (LRB), which focuses on biologically meaningful tasks and supports long-range contexts. We complement our benchmark with fine-tuning recipes that meaningfully improve performance. We evaluate DNA LMs across nine compiled human genome tasks and observe that they achieve competitive performance relative to supervised baselines on several tasks (e.g., genome annotation), but there remains a significant gap in domains, such as variant effect and gene expression prediction. Additionally, we introduce a visualization tool to examine model performance split by genomic properties.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a benchmark suite focused on long-range genomic tasks for DNA language models, emphasizing biologically meaningful evaluations across nine human genome tasks. It resides in the 'Long-Range Genomic Task Benchmarks' leaf, which contains four papers total including this work. This represents a moderately populated research direction within the broader benchmarking landscape, suggesting active but not overcrowded interest in evaluating models on extended genomic contexts that require capturing dependencies across thousands to millions of base pairs.

The taxonomy reveals neighboring evaluation frameworks with distinct emphases: 'General Benchmark Suites' (three papers) cover diverse tasks without long-range focus, while 'Regulatory DNA Benchmarks' (two papers) target chromatin accessibility and transcription factor binding. The original work bridges these by selecting biologically meaningful long-range tasks rather than comprehensive short-context coverage. Its sibling papers in the same leaf (DNALongBench, DART-Eval, and one other) share the long-range evaluation goal but may differ in task selection, species focus, or evaluation protocols, positioning this work within an emerging subfield addressing context-length challenges.

Among thirty candidates examined, none clearly refuted the three main contributions: the benchmark suite itself, fine-tuning recipes, and the visualization tool. For each contribution, ten candidates were reviewed with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of human-focused long-range tasks, accompanying fine-tuning strategies, and genomic property visualization appears relatively distinct. However, the analysis explicitly covers top-K semantic matches and citation expansion, not an exhaustive literature review, leaving open the possibility of unexamined overlapping work.

Given the limited search scope and the moderately populated taxonomy leaf, the work appears to offer a focused contribution to long-range genomic benchmarking. The absence of refutable candidates among thirty examined suggests novelty in the specific task compilation and methodological recipes, though the broader concept of long-range DNA model evaluation is shared with sibling papers. The analysis does not capture potential overlaps outside the top-thirty semantic matches or recent preprints.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Benchmarking DNA language models on long-range genomic tasks. The field of DNA language modeling has matured into a structured landscape with several distinct branches. DNA Language Model Architectures and Pre-training encompasses foundational models such as HyenaDNA[4], GENA-LM[2], and Evo[16], which explore diverse architectural choices from transformers to state space models for capturing genomic sequences at scale. Benchmarking Frameworks and Evaluation Methodologies focuses on systematic assessment tools like BEND[8], DNALongBench[7], and DART-Eval[14], which provide standardized tasks to compare model performance across regulatory prediction, variant effect estimation, and other genomic challenges. Application-Specific Models and Downstream Tasks targets specialized problems such as gene expression prediction, RNA structure modeling with FlashRNA[20], and variant interpretation, while Model Interpretation and Functional Understanding investigates how these models learn biological signals and dependencies. Theoretical and Methodological Foundations addresses core questions about tokenization strategies, representational power, and the mathematical underpinnings that enable effective genomic sequence modeling. A particularly active line of work centers on developing comprehensive benchmarks that stress-test models on tasks requiring long-range context, where dependencies span thousands or even millions of base pairs. Genomics Long-Range Benchmark[0] sits squarely within this effort, joining DNALongBench[7] and related frameworks in pushing models beyond local motif recognition toward genome-scale understanding. While Advancing DNA LMs[3] surveys broader architectural trends and DNA Foundation Benchmarking[6] examines general-purpose evaluation, the original work emphasizes the unique challenges of long-range tasks where models like HyenaDNA[4] and Evo[16] demonstrate advantages over traditional transformers. This focus on extended context contrasts with benchmarks like BEND[8], which cover a wider variety of shorter-range regulatory tasks, highlighting an ongoing tension between breadth of evaluation and depth in specific challenging regimes.

Claimed Contributions

Human Genomics Long-Range Benchmark (LRB)

A benchmark compilation of biologically meaningful tasks in human genomics that deliberately incorporates tasks spanning both short and long genomic contexts, allowing users to select arbitrary sequence length inputs for any dataset to empirically understand the importance of long-range inputs.

10 retrieved papers
Fine-tuning recipes for DNA language models

The authors provide fine-tuning approaches that demonstrate the benefit of full model fine-tuning compared to previous methods that keep backbone DNA LM weights frozen during downstream training, achieving meaningful performance improvements.

10 retrieved papers
Visualization tool for genomic property analysis

A tool that allows users to analyze model performance results in detail by examining how performance varies across different genomic properties and annotations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Human Genomics Long-Range Benchmark (LRB)

A benchmark compilation of biologically meaningful tasks in human genomics that deliberately incorporates tasks spanning both short and long genomic contexts, allowing users to select arbitrary sequence length inputs for any dataset to empirically understand the importance of long-range inputs.

Contribution

Fine-tuning recipes for DNA language models

The authors provide fine-tuning approaches that demonstrate the benefit of full model fine-tuning compared to previous methods that keep backbone DNA LM weights frozen during downstream training, achieving meaningful performance improvements.

Contribution

Visualization tool for genomic property analysis

A tool that allows users to analyze model performance results in detail by examining how performance varies across different genomic properties and annotations.

The Human Genomics Long-Range Benchmark: Advancing DNA Language Models | Novelty Validation