Multihead Mixture of Experts for Classification of Gigapixel Pathology Images

ICLR 2026 Conference SubmissionAnonymous Authors
Mixture of ExpertsMultiple Instance LearningComputational PathologyComputer Vision
Abstract:

Multiple Instance Learning (MIL) is the predominant approach for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across 8 MIL methods and 19 different tasks, we find that this improvement to the task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Finally, we identify Instance-Gradient Interference (IGI)—a limitation where heterogeneous instances produce conflicting gradients when processed by a single linear layer—and show that MAMMOTH effectively mitigates IGI by decoupling gradient flows between experts, yielding consistent performance gains in 130 of the 152 examined configurations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MAMMOTH, a parameter-efficient multihead mixture-of-experts module that transforms general-purpose patch features into task-specific representations before aggregation in MIL pipelines. It resides in the 'Mixture of Experts and Task-Specific Transformations' leaf, which contains only two papers total (including this one). This is a notably sparse research direction within the broader MIL landscape, suggesting that explicit modeling of task-specific transformations via mixture-of-experts architectures remains underexplored compared to attention-based aggregation or prototype methods, which occupy more densely populated leaves.

The taxonomy reveals that neighboring leaves focus on attention mechanisms (five papers), prototype learning (three papers), graph-based spatial modeling (two papers), and hierarchical patch selection (four papers). These directions emphasize aggregation strategies or interpretability rather than the intermediate transformation layer. The paper's focus on the linear projection between feature extraction and aggregation diverges from these neighboring approaches, which largely treat this step as a fixed operation. The scope note for this leaf explicitly excludes standard attention aggregation and prototype methods, positioning MAMMOTH as addressing a distinct bottleneck in the MIL pipeline that other branches do not target.

Among 26 candidates examined across three contributions, no refutable prior work was identified. The MAMMOTH module itself was assessed against nine candidates with zero refutations, the Instance-Gradient Interference analysis against seven candidates with zero refutations, and the task-specific transformation bottleneck hypothesis against ten candidates with zero refutations. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—no prior work directly anticipates the combination of multihead mixture-of-experts applied specifically to the task-specific transformation layer in MIL. The single sibling paper in the same leaf (Mixture Experts Tissues) likely addresses tissue-type partitioning rather than per-patch adaptive transformations, though detailed comparison would require full-text review.

Given the sparse taxonomy leaf, absence of refutable candidates among 26 examined papers, and the paper's focus on an intermediate MIL component that neighboring methods treat as fixed, the work appears to occupy a relatively unexplored niche. However, the limited search scope means this assessment reflects top-K semantic proximity rather than exhaustive coverage of the computational pathology literature. The novelty claim rests on the specific architectural choice—multihead mixture-of-experts for task-specific transformations—rather than on introducing mixture-of-experts or MIL concepts themselves, which are established in the field.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: classification of gigapixel whole-slide images in computational pathology. The field has evolved into a rich ecosystem of approaches organized around several major branches. Multiple Instance Learning (MIL) architectures and aggregation methods form a central pillar, encompassing attention-based networks, mixture-of-experts designs, and hierarchical aggregation schemes that handle the enormous number of patches extracted from each slide. Feature extraction and representation learning focuses on self-supervised and foundation models that produce robust patch embeddings, while vision-language models and multimodal learning integrate textual reports or clinical metadata to enhance interpretability. End-to-end and full-resolution processing methods attempt to bypass the patch-based bottleneck by operating directly on high-resolution data, whereas training strategies and optimization techniques address sample efficiency, contrastive learning, and domain adaptation. Generative models and synthetic data creation explore augmentation and slide synthesis, preprocessing and infrastructure tackle color normalization and scalable pipelines, application-specific studies target particular cancer types or clinical workflows, and hybrid multi-feature fusion approaches combine complementary representations. Within the MIL landscape, a particularly active line of work explores mixture-of-experts and task-specific transformations, where models dynamically route or weight different subnetworks to capture tissue heterogeneity. Multihead Mixture Experts[0] exemplifies this direction by employing multiple expert heads to handle diverse histological patterns, closely related to Mixture Experts Tissues[33], which similarly partitions the feature space according to tissue type. These methods contrast with simpler attention pooling schemes like those in Attention Based Networks[40] or Gigapixel Vision Transformers[12], which apply uniform aggregation across all patches. Meanwhile, vision-language approaches such as Slide Prompt Learning[5] and HistoGPT Reports[7] introduce textual guidance to steer classification, and full-resolution methods like Full Resolution Memory[27] and Scaling Gigapixel Resolution[2] bypass patch extraction altogether. The original paper sits squarely in the mixture-of-experts branch, emphasizing adaptive specialization over monolithic aggregation, and shares conceptual ground with Mixture Experts Tissues[33] while differing from prompt-based or end-to-end alternatives that do not explicitly model expert diversity.

Claimed Contributions

MAMMOTH: Multihead Mixture of Experts module for MIL

The authors propose MAMMOTH, a plug-and-play mixture of experts architecture that replaces the standard linear layer in Multiple Instance Learning frameworks. It uses multihead processing, soft expert assignment, and low-rank decomposition to transform general-purpose patch features into task-specific features for gigapixel pathology image classification while maintaining parameter efficiency.

9 retrieved papers
Identification of Instance-Gradient Interference (IGI)

The authors identify a previously unrecognized limitation called Instance-Gradient Interference, where heterogeneous patch instances create conflicting gradient updates in standard linear layers. They demonstrate that MAMMOTH addresses this issue by routing different instances to separate experts, enabling decoupled gradient flows.

7 retrieved papers
Task-specific transformation as performance bottleneck

The authors identify the task-specific linear transformation layer in MIL pipelines as a critical but previously unexplored performance bottleneck. They demonstrate that improving this transformation yields larger performance gains than changing aggregation methods, showing consistent improvements in 130 of 152 examined configurations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MAMMOTH: Multihead Mixture of Experts module for MIL

The authors propose MAMMOTH, a plug-and-play mixture of experts architecture that replaces the standard linear layer in Multiple Instance Learning frameworks. It uses multihead processing, soft expert assignment, and low-rank decomposition to transform general-purpose patch features into task-specific features for gigapixel pathology image classification while maintaining parameter efficiency.

Contribution

Identification of Instance-Gradient Interference (IGI)

The authors identify a previously unrecognized limitation called Instance-Gradient Interference, where heterogeneous patch instances create conflicting gradient updates in standard linear layers. They demonstrate that MAMMOTH addresses this issue by routing different instances to separate experts, enabling decoupled gradient flows.

Contribution

Task-specific transformation as performance bottleneck

The authors identify the task-specific linear transformation layer in MIL pipelines as a critical but previously unexplored performance bottleneck. They demonstrate that improving this transformation yields larger performance gains than changing aggregation methods, showing consistent improvements in 130 of 152 examined configurations.