BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

ICLR 2026 Conference SubmissionAnonymous Authors
AI for biologyfoundation modelssynthetic captions
Abstract:

This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BioCAP (i.e., BioCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces BioCAP, a biological foundation model trained with synthetic captions generated by multimodal large language models, targeting species classification and text-image retrieval. It resides in the 'Organismal Biology Multimodal Models' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 28 papers. This positioning suggests the work addresses a niche problem—aligning organismal images with natural language—where prior efforts are limited compared to the more densely populated biomedical imaging branches.

The taxonomy reveals that neighboring branches focus on clinical imaging (Pathology and Histology, General Biomedical Vision-Language) and molecular biology (Molecule-Text Alignment, Protein-Language Foundation Models), each containing multiple papers. The organismal biology leaf sits under 'Genomic and Cross-Modal Biological Models,' which also includes genomic sequence models. BioCAP's use of synthetic captions connects it conceptually to the 'Synthetic Caption Generation and Utilization' leaf in the general multimodal branch, yet its domain-specific context pipeline (Wikipedia-derived visual information, taxon-tailored examples) distinguishes it from domain-agnostic methods like those in that leaf.

Among 27 candidates examined, the contribution-level analysis shows varied novelty. The core BioCAP model examined 7 candidates with no refutations, suggesting limited direct prior work on caption-supervised organismal models. The domain-specific caption generation pipeline examined 10 candidates, also with no refutations, indicating this tailored approach may be novel. However, the separated visual projectors contribution examined 10 candidates and found 2 refutable cases, implying that architectural strategies for heterogeneous supervision have precedent in the limited search scope. These statistics reflect a top-K semantic search, not an exhaustive review.

Overall, the work appears to occupy a sparsely explored intersection of synthetic caption generation and organismal biology, with the caption pipeline and model integration showing novelty within the examined scope. The architectural component for heterogeneous supervision has more substantial prior work among the candidates. The analysis is constrained by the limited search scale and does not cover all possible related efforts in broader vision-language or ecological modeling literature.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
27
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: training biological multimodal foundation models with synthetic captions. The field has evolved into several distinct branches that reflect the diversity of biological data modalities and application domains. Biomedical Vision-Language Foundation Models form a dense branch focused on clinical imaging and pathology, with works such as Multimodal Pathology Foundation[1] and Biomedclip[2] aligning radiology or histopathology images with clinical text. Molecular and Protein Multimodal Models address the challenge of integrating sequence, structure, and functional annotations for proteins and small molecules, exemplified by efforts like Biot5[5] and Molecule Structure Text[8]. Genomic and Cross-Modal Biological Models extend multimodal learning to organismal and ecological scales, bridging genomic sequences with phenotypic or environmental observations. Meanwhile, General Multimodal Architectures and Methods provide foundational techniques—such as contrastive learning and vision-language pretraining—that are adapted across these specialized domains. A central theme across branches is the trade-off between domain-specific curation and scalable synthetic data generation. Many studies in the biomedical imaging branch rely on expert-annotated datasets, whereas genomic and organismal models often face severe annotation scarcity, motivating synthetic caption strategies. BioCAP[0] sits within the Organismal Biology Multimodal Models cluster, where it addresses the challenge of pairing visual observations of organisms with descriptive text by leveraging large language models to generate captions at scale. This approach contrasts with neighboring works like Insect Foundation[14], which may emphasize vision-only pretraining or smaller curated datasets, and aligns conceptually with cross-modal strategies seen in Omni-DNA[10] that unify genomic and phenotypic modalities. The broader question remains how synthetic supervision compares to expert curation in downstream biological discovery tasks.

Claimed Contributions

BioCAP: biological foundation model trained with synthetic captions

The authors introduce BioCAP, a multimodal foundation model for organismal biology that is trained using both taxonomic labels and synthetic descriptive captions. This model demonstrates improved performance in species classification and text-image retrieval tasks compared to models trained with labels alone.

7 retrieved papers
Domain-specific context pipeline for synthetic caption generation

The authors develop a pipeline that incorporates Wikipedia-derived visual information and taxon-tailored format examples as domain-specific contexts for multimodal large language models. This approach reduces hallucination and enables the generation of accurate, instance-specific descriptive captions for biological images at scale.

10 retrieved papers
Separated visual projectors for heterogeneous supervision

The authors propose using two separate visual projection heads after a shared visual encoder to handle heterogeneous supervision from taxonomic labels and descriptive captions. This architectural design allows the model to align different types of textual supervision with images more effectively.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BioCAP: biological foundation model trained with synthetic captions

The authors introduce BioCAP, a multimodal foundation model for organismal biology that is trained using both taxonomic labels and synthetic descriptive captions. This model demonstrates improved performance in species classification and text-image retrieval tasks compared to models trained with labels alone.

Contribution

Domain-specific context pipeline for synthetic caption generation

The authors develop a pipeline that incorporates Wikipedia-derived visual information and taxon-tailored format examples as domain-specific contexts for multimodal large language models. This approach reduces hallucination and enables the generation of accurate, instance-specific descriptive captions for biological images at scale.

Contribution

Separated visual projectors for heterogeneous supervision

The authors propose using two separate visual projection heads after a shared visual encoder to handle heterogeneous supervision from taxonomic labels and descriptive captions. This architectural design allows the model to align different types of textual supervision with images more effectively.