BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AI for biologyfoundation modelssynthetic captions

This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BioCAP (i.e., BioCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces BioCAP, a biological foundation model trained with synthetic captions generated by multimodal large language models, targeting species classification and text-image retrieval. It resides in the 'Organismal Biology Multimodal Models' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 28 papers. This positioning suggests the work addresses a niche problem—aligning organismal images with natural language—where prior efforts are limited compared to the more densely populated biomedical imaging branches.

The taxonomy reveals that neighboring branches focus on clinical imaging (Pathology and Histology, General Biomedical Vision-Language) and molecular biology (Molecule-Text Alignment, Protein-Language Foundation Models), each containing multiple papers. The organismal biology leaf sits under 'Genomic and Cross-Modal Biological Models,' which also includes genomic sequence models. BioCAP's use of synthetic captions connects it conceptually to the 'Synthetic Caption Generation and Utilization' leaf in the general multimodal branch, yet its domain-specific context pipeline (Wikipedia-derived visual information, taxon-tailored examples) distinguishes it from domain-agnostic methods like those in that leaf.

Among 27 candidates examined, the contribution-level analysis shows varied novelty. The core BioCAP model examined 7 candidates with no refutations, suggesting limited direct prior work on caption-supervised organismal models. The domain-specific caption generation pipeline examined 10 candidates, also with no refutations, indicating this tailored approach may be novel. However, the separated visual projectors contribution examined 10 candidates and found 2 refutable cases, implying that architectural strategies for heterogeneous supervision have precedent in the limited search scope. These statistics reflect a top-K semantic search, not an exhaustive review.

Overall, the work appears to occupy a sparsely explored intersection of synthetic caption generation and organismal biology, with the caption pipeline and model integration showing novelty within the examined scope. The architectural component for heterogeneous supervision has more substantial prior work among the candidates. The analysis is constrained by the limited search scale and does not cover all possible related efforts in broader vision-language or ecological modeling literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: training biological multimodal foundation models with synthetic captions. The field has evolved into several distinct branches that reflect the diversity of biological data modalities and application domains. Biomedical Vision-Language Foundation Models form a dense branch focused on clinical imaging and pathology, with works such as Multimodal Pathology Foundation[1] and Biomedclip[2] aligning radiology or histopathology images with clinical text. Molecular and Protein Multimodal Models address the challenge of integrating sequence, structure, and functional annotations for proteins and small molecules, exemplified by efforts like Biot5[5] and Molecule Structure Text[8]. Genomic and Cross-Modal Biological Models extend multimodal learning to organismal and ecological scales, bridging genomic sequences with phenotypic or environmental observations. Meanwhile, General Multimodal Architectures and Methods provide foundational techniques—such as contrastive learning and vision-language pretraining—that are adapted across these specialized domains. A central theme across branches is the trade-off between domain-specific curation and scalable synthetic data generation. Many studies in the biomedical imaging branch rely on expert-annotated datasets, whereas genomic and organismal models often face severe annotation scarcity, motivating synthetic caption strategies. BioCAP[0] sits within the Organismal Biology Multimodal Models cluster, where it addresses the challenge of pairing visual observations of organisms with descriptive text by leveraging large language models to generate captions at scale. This approach contrasts with neighboring works like Insect Foundation[14], which may emphasize vision-only pretraining or smaller curated datasets, and aligns conceptually with cross-modal strategies seen in Omni-DNA[10] that unify genomic and phenotypic modalities. The broader question remains how synthetic supervision compares to expert curation in downstream biological discovery tasks.

Claimed Contributions

BioCAP: biological foundation model trained with synthetic captions

7 retrieved papers

The authors introduce BioCAP, a multimodal foundation model for organismal biology that is trained using both taxonomic labels and synthetic descriptive captions. This model demonstrates improved performance in species classification and text-image retrieval tasks compared to models trained with labels alone.

7 retrieved papers

Domain-specific context pipeline for synthetic caption generation

10 retrieved papers

The authors develop a pipeline that incorporates Wikipedia-derived visual information and taxon-tailored format examples as domain-specific contexts for multimodal large language models. This approach reduces hallucination and enables the generation of accurate, instance-specific descriptive captions for biological images at scale.

10 retrieved papers

Separated visual projectors for heterogeneous supervision

Can Refute

10 retrieved papers

The authors propose using two separate visual projection heads after a shared visual encoder to handle heterogeneous supervision from taxonomic labels and descriptive captions. This architectural design allows the model to align different types of textual supervision with images more effectively.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] Insect-foundation: A foundation model and large multimodal dataset for vision-language insect understanding PDF

Thanh-Dat Truong, Hoang-Quan Nguyen, Xuan-Bac Nguyen, Ashley Dowling, Xin Li, Khoa Luu (2025)

[26] BioCAP PDF

Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BioCAP: biological foundation model trained with synthetic captions

[14] Insect-foundation: A foundation model and large multimodal dataset for vision-language insect understanding PDF

Cannot Refute

[26] BioCAP PDF

Cannot Refute

[49] Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification PDF

Cannot Refute

[50] Multimodal foundation models for zero-shot animal species recognition in camera trap images PDF

Cannot Refute

[51] Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification PDF

Cannot Refute

[52] Using Knowledge Graphs to Harvest PDF

Cannot Refute

[53] Leveraging Knowledge Graphs to harvest a high-quality dataset for efficient CLIP model training PDF

Cannot Refute

Contribution

Domain-specific context pipeline for synthetic caption generation

[29] From show to tell: A survey on deep learning-based image captioning PDF

Cannot Refute

[30] Vcr: Visual caption restoration PDF

Cannot Refute

[31] Caption alignment and structure-aware attention for scientific table-to-text generation PDF

Cannot Refute

[32] Automated fact-checking of claims from Wikipedia PDF

Cannot Refute

[33] Show, Interpret and Tell: Entity-Aware Contextualised Image Captioning in Wikipedia PDF

Cannot Refute

[34] VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text PDF

Cannot Refute

[35] A pipeline for generating, annotating and employing synthetic data for real world question answering PDF

Cannot Refute

[36] ImageCLEF 2021 best of labs: the curious case of caption generation for medical images PDF

Cannot Refute

[37] Advanced Methods for Remote Sensing Image Captioning PDF

Cannot Refute

[38] Using a Novel Capsule Network For an Innovative Approach to Image Captioning. PDF

Cannot Refute

Contribution

Separated visual projectors for heterogeneous supervision

[41] MoVA: Adapting Mixture of Vision Experts to Multimodal Context PDF

Can Refute

[43] Unifying specialized visual encoders for video language models PDF

Can Refute

[39] From clip to dino: Visual encoders shout in multi-modal large language models PDF

Cannot Refute

[40] Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers PDF

Cannot Refute

[42] VL-Mamba: Exploring State Space Models for Multimodal Learning PDF

Cannot Refute

[44] VCoder: Versatile Vision Encoders for Multimodal Large Language Models PDF

Cannot Refute

[45] LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models PDF

Cannot Refute

[46] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders PDF

Cannot Refute

[47] Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios PDF

Cannot Refute

[48] Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents PDF

Cannot Refute

BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] Insect-foundation: A foundation model and large multimodal dataset for vision-language insect understanding PDF

[26] BioCAP PDF

Contribution Analysis

BioCAP: biological foundation model trained with synthetic captions

[14] Insect-foundation: A foundation model and large multimodal dataset for vision-language insect understanding PDF

[26] BioCAP PDF

[49] Leveraging Large Image-Caption Datasets for Multimodal Taxon Classification PDF

[50] Multimodal foundation models for zero-shot animal species recognition in camera trap images PDF

[51] Taxonomic Reasoning for Rare Arthropods: Combining Dense Image Captioning and RAG for Interpretable Classification PDF

[52] Using Knowledge Graphs to Harvest PDF

[53] Leveraging Knowledge Graphs to harvest a high-quality dataset for efficient CLIP model training PDF

Domain-specific context pipeline for synthetic caption generation

[29] From show to tell: A survey on deep learning-based image captioning PDF

[30] Vcr: Visual caption restoration PDF

[31] Caption alignment and structure-aware attention for scientific table-to-text generation PDF

[32] Automated fact-checking of claims from Wikipedia PDF

[33] Show, Interpret and Tell: Entity-Aware Contextualised Image Captioning in Wikipedia PDF

[34] VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text PDF

[35] A pipeline for generating, annotating and employing synthetic data for real world question answering PDF

[36] ImageCLEF 2021 best of labs: the curious case of caption generation for medical images PDF

[37] Advanced Methods for Remote Sensing Image Captioning PDF

[38] Using a Novel Capsule Network For an Innovative Approach to Image Captioning. PDF

Separated visual projectors for heterogeneous supervision

[41] MoVA: Adapting Mixture of Vision Experts to Multimodal Context PDF

[43] Unifying specialized visual encoders for video language models PDF

[39] From clip to dino: Visual encoders shout in multi-modal large language models PDF

[40] Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers PDF

[42] VL-Mamba: Exploring State Space Models for Multimodal Learning PDF

[44] VCoder: Versatile Vision Encoders for Multimodal Large Language Models PDF

[45] LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models PDF

[46] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders PDF

[47] Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios PDF

[48] Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents PDF

Table of Contents