Multihead Mixture of Experts for Classification of Gigapixel Pathology Images

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Mixture of ExpertsMultiple Instance LearningComputational PathologyComputer Vision

Multiple Instance Learning (MIL) is the predominant approach for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across 8 MIL methods and 19 different tasks, we find that this improvement to the task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Finally, we identify Instance-Gradient Interference (IGI)—a limitation where heterogeneous instances produce conflicting gradients when processed by a single linear layer—and show that MAMMOTH effectively mitigates IGI by decoupling gradient flows between experts, yielding consistent performance gains in 130 of the 152 examined configurations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MAMMOTH, a parameter-efficient multihead mixture-of-experts module that transforms general-purpose patch features into task-specific representations before aggregation in MIL pipelines. It resides in the 'Mixture of Experts and Task-Specific Transformations' leaf, which contains only two papers total (including this one). This is a notably sparse research direction within the broader MIL landscape, suggesting that explicit modeling of task-specific transformations via mixture-of-experts architectures remains underexplored compared to attention-based aggregation or prototype methods, which occupy more densely populated leaves.

The taxonomy reveals that neighboring leaves focus on attention mechanisms (five papers), prototype learning (three papers), graph-based spatial modeling (two papers), and hierarchical patch selection (four papers). These directions emphasize aggregation strategies or interpretability rather than the intermediate transformation layer. The paper's focus on the linear projection between feature extraction and aggregation diverges from these neighboring approaches, which largely treat this step as a fixed operation. The scope note for this leaf explicitly excludes standard attention aggregation and prototype methods, positioning MAMMOTH as addressing a distinct bottleneck in the MIL pipeline that other branches do not target.

Among 26 candidates examined across three contributions, no refutable prior work was identified. The MAMMOTH module itself was assessed against nine candidates with zero refutations, the Instance-Gradient Interference analysis against seven candidates with zero refutations, and the task-specific transformation bottleneck hypothesis against ten candidates with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—no prior work directly anticipates the combination of multihead mixture-of-experts applied specifically to the task-specific transformation layer in MIL. The single sibling paper in the same leaf (Mixture Experts Tissues) likely addresses tissue-type partitioning rather than per-patch adaptive transformations, though detailed comparison would require full-text review.

Given the sparse taxonomy leaf, absence of refutable candidates among 26 examined papers, and the paper's focus on an intermediate MIL component that neighboring methods treat as fixed, the work appears to occupy a relatively unexplored niche. However, the limited search scope means this assessment reflects top-K semantic proximity rather than exhaustive coverage of the computational pathology literature. The novelty claim rests on the specific architectural choice—multihead mixture-of-experts for task-specific transformations—rather than on introducing mixture-of-experts or MIL concepts themselves, which are established in the field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: classification of gigapixel whole-slide images in computational pathology. The field has evolved into a rich ecosystem of approaches organized around several major branches. Multiple Instance Learning (MIL) architectures and aggregation methods form a central pillar, encompassing attention-based networks, mixture-of-experts designs, and hierarchical aggregation schemes that handle the enormous number of patches extracted from each slide. Feature extraction and representation learning focuses on self-supervised and foundation models that produce robust patch embeddings, while vision-language models and multimodal learning integrate textual reports or clinical metadata to enhance interpretability. End-to-end and full-resolution processing methods attempt to bypass the patch-based bottleneck by operating directly on high-resolution data, whereas training strategies and optimization techniques address sample efficiency, contrastive learning, and domain adaptation. Generative models and synthetic data creation explore augmentation and slide synthesis, preprocessing and infrastructure tackle color normalization and scalable pipelines, application-specific studies target particular cancer types or clinical workflows, and hybrid multi-feature fusion approaches combine complementary representations. Within the MIL landscape, a particularly active line of work explores mixture-of-experts and task-specific transformations, where models dynamically route or weight different subnetworks to capture tissue heterogeneity. Multihead Mixture Experts[0] exemplifies this direction by employing multiple expert heads to handle diverse histological patterns, closely related to Mixture Experts Tissues[33], which similarly partitions the feature space according to tissue type. These methods contrast with simpler attention pooling schemes like those in Attention Based Networks[40] or Gigapixel Vision Transformers[12], which apply uniform aggregation across all patches. Meanwhile, vision-language approaches such as Slide Prompt Learning[5] and HistoGPT Reports[7] introduce textual guidance to steer classification, and full-resolution methods like Full Resolution Memory[27] and Scaling Gigapixel Resolution[2] bypass patch extraction altogether. The original paper sits squarely in the mixture-of-experts branch, emphasizing adaptive specialization over monolithic aggregation, and shares conceptual ground with Mixture Experts Tissues[33] while differing from prompt-based or end-to-end alternatives that do not explicitly model expert diversity.

Claimed Contributions

MAMMOTH: Multihead Mixture of Experts module for MIL

9 retrieved papers

The authors propose MAMMOTH, a plug-and-play mixture of experts architecture that replaces the standard linear layer in Multiple Instance Learning frameworks. It uses multihead processing, soft expert assignment, and low-rank decomposition to transform general-purpose patch features into task-specific features for gigapixel pathology image classification while maintaining parameter efficiency.

9 retrieved papers

Identification of Instance-Gradient Interference (IGI)

7 retrieved papers

The authors identify a previously unrecognized limitation called Instance-Gradient Interference, where heterogeneous patch instances create conflicting gradient updates in standard linear layers. They demonstrate that MAMMOTH addresses this issue by routing different instances to separate experts, enabling decoupled gradient flows.

7 retrieved papers

Task-specific transformation as performance bottleneck

10 retrieved papers

The authors identify the task-specific linear transformation layer in MIL pipelines as a critical but previously unexplored performance bottleneck. They demonstrate that improving this transformation yields larger performance gains than changing aggregation methods, showing consistent improvements in 130 of 152 examined configurations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[33] Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images PDF

Junxian Wu, Minheng Chen, Xinyi Ke, Tianwang Xun, Xiaoming Jiang, Hongyu Zhou, Lizhi Shao, Youyong Kong (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MAMMOTH: Multihead Mixture of Experts module for MIL

[33] Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images PDF

Cannot Refute

[67] Mome: Mixture of multimodal experts for cancer survival prediction PDF

Cannot Refute

[68] M4: Multi-Proxy Multi-Gate Mixture of Experts Network for Multiple Instance Learning in Histopathology Image Analysis PDF

Cannot Refute

[69] Uni-med: a unified medical generalist foundation model for multi-task learning via connector-MoE PDF

Cannot Refute

[70] A Mixture-of-Experts Decision Support System for Digital Pathology PDF

Cannot Refute

[71] Multimodal Gated Mixture of Experts Using Whole Slide Image and Flow Cytometry for Multiple Instance Learning Classification of Lymphoma PDF

Cannot Refute

[72] Deep multi-instance learning using multi-modal data for diagnosis of lymphocytosis PDF

Cannot Refute

[73] Spatially-Aware Mixture of Experts with Log-Logistic Survival Modeling for Whole-Slide Images PDF

Cannot Refute

[74] Mamba-HMIL: Hierarchical Multiple Instance Learning via State Space Model for Whole Slide Image Diagnosis PDF

Cannot Refute

Contribution

Identification of Instance-Gradient Interference (IGI)

[60] MGIML: Cancer Grading With Incomplete Radiology-Pathology Data via Memory Learning and Gradient Homogenization PDF

Cannot Refute

[61] Local Self-attention-based Hybrid Multiple Instance Learning for Partial Spoof Speech Detection PDF

Cannot Refute

[62] SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology PDF

Cannot Refute

[63] Occlusion-aware tracking for drones using neural methods PDF

Cannot Refute

[64] MoMIL: Mixture of Multi-instance Learners for Modeling Multiple Compound Activities in High Content Imaging PDF

Cannot Refute

[65] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization PDF

Cannot Refute

[66] Facial Analysis of Dyadic Interactions Using Multiple Instance Learning PDF

Cannot Refute

Contribution

Task-specific transformation as performance bottleneck

[32] Bayesian Collaborative Learning for Whole-Slide Image Classification PDF

Cannot Refute

[51] Attention-challenging multiple instance learning for whole slide image classification PDF

Cannot Refute

[52] Reducing Cross-Sensor Domain Gaps in Tactile Sensing via Few-Sample-Driven Style-to-Content Unsupervised Domain Adaptation PDF

Cannot Refute

[53] Nciemil: Rethinking decoupled multiple instance learning framework for histopathological slide classification PDF

Cannot Refute

[54] Shared-specific feature learning with bottleneck fusion transformer for multi-modal whole slide image analysis PDF

Cannot Refute

[55] From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation PDF

Cannot Refute

[56] Flow-MIL: Constructing Highly-expressive Latent Feature Space For Whole Slide Image Classification Using Normalizing Flow PDF

Cannot Refute

[57] A Study of Temporal Contextual Semantic Enhanced Fusion Module for AnomalousVideo Detection PDF

Cannot Refute

[58] Mitigating Representation Bottlenecks in Multiple Instance Learning PDF

Cannot Refute

[59] An investigation into the effectiveness of bottleneck based input control compared to aggregate input control PDF

Cannot Refute

Multihead Mixture of Experts for Classification of Gigapixel Pathology Images

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[33] Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images PDF

Contribution Analysis

MAMMOTH: Multihead Mixture of Experts module for MIL

[33] Learning Heterogeneous Tissues with Mixture of Experts for Gigapixel Whole Slide Images PDF

[67] Mome: Mixture of multimodal experts for cancer survival prediction PDF

[68] M4: Multi-Proxy Multi-Gate Mixture of Experts Network for Multiple Instance Learning in Histopathology Image Analysis PDF

[69] Uni-med: a unified medical generalist foundation model for multi-task learning via connector-MoE PDF

[70] A Mixture-of-Experts Decision Support System for Digital Pathology PDF

[71] Multimodal Gated Mixture of Experts Using Whole Slide Image and Flow Cytometry for Multiple Instance Learning Classification of Lymphoma PDF

[72] Deep multi-instance learning using multi-modal data for diagnosis of lymphocytosis PDF

[73] Spatially-Aware Mixture of Experts with Log-Logistic Survival Modeling for Whole-Slide Images PDF

[74] Mamba-HMIL: Hierarchical Multiple Instance Learning via State Space Model for Whole Slide Image Diagnosis PDF

Identification of Instance-Gradient Interference (IGI)

[60] MGIML: Cancer Grading With Incomplete Radiology-Pathology Data via Memory Learning and Gradient Homogenization PDF

[61] Local Self-attention-based Hybrid Multiple Instance Learning for Partial Spoof Speech Detection PDF

[62] SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology PDF

[63] Occlusion-aware tracking for drones using neural methods PDF

[64] MoMIL: Mixture of Multi-instance Learners for Modeling Multiple Compound Activities in High Content Imaging PDF

[65] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization PDF

[66] Facial Analysis of Dyadic Interactions Using Multiple Instance Learning PDF

Task-specific transformation as performance bottleneck

[32] Bayesian Collaborative Learning for Whole-Slide Image Classification PDF

[51] Attention-challenging multiple instance learning for whole slide image classification PDF

[52] Reducing Cross-Sensor Domain Gaps in Tactile Sensing via Few-Sample-Driven Style-to-Content Unsupervised Domain Adaptation PDF

[53] Nciemil: Rethinking decoupled multiple instance learning framework for histopathological slide classification PDF

[54] Shared-specific feature learning with bottleneck fusion transformer for multi-modal whole slide image analysis PDF

[55] From Segments to Concepts: Interpretable Image Classification via Concept-Guided Segmentation PDF

[56] Flow-MIL: Constructing Highly-expressive Latent Feature Space For Whole Slide Image Classification Using Normalizing Flow PDF

[57] A Study of Temporal Contextual Semantic Enhanced Fusion Module for AnomalousVideo Detection PDF

[58] Mitigating Representation Bottlenecks in Multiple Instance Learning PDF

[59] An investigation into the effectiveness of bottleneck based input control compared to aggregate input control PDF

Table of Contents