ASMIL: Attention-Stabilized Multiple Instance Learning for Whole-Slide Imaging

ICLR 2026 Conference SubmissionAnonymous Authors
Whole slide imageMultiple instance learning
Abstract:

Attention-based multiple instance learning (MIL) has emerged as a powerful framework for whole slide image (WSI) diagnosis, leveraging attention to aggregate instance-level features into bag-level predictions. Despite this success, we find that such methods exhibit a new failure mode: unstable attention dynamics. Across four representative attention-based MIL methods and two public WSI datasets, we observe that attention distributions oscillate across epochs rather than converging to a consistent pattern, degrading performance. This instability adds to two previously reported challenges: overfitting and over-concentrated attention distribution. To simultaneously overcome these three limitations, we introduce attention-stabilized multiple instance learning (ASMIL), a novel unified framework. ASMIL uses an anchor model to stabilize attention, replaces softmax with a normalized sigmoid function in the anchor to prevent over-concentration, and applies token random dropping to mitigate overfitting. Extensive experiments demonstrate that ASMIL achieves up to a 6.49% F1 score improvement over state-of-the-art methods. Moreover, integrating the anchor model and normalized sigmoid into existing attention-based MIL methods consistently boosts their performance, with F1 score gains up to 10.73%. All code and data are publicly available at https://anonymous.4open.science/r/ASMIL-5018/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ASMIL, a framework addressing three challenges in attention-based MIL for WSI diagnosis: unstable attention dynamics, overfitting, and over-concentrated attention. It resides in the 'Attention Refinement and Localization Improvement' leaf, which contains four papers including the original work. This leaf sits within the broader 'Attention Mechanism Design and Enhancement' branch, one of ten major research directions in a taxonomy spanning fifty papers. The leaf represents a moderately active research area focused specifically on correcting and refining attention mechanisms, distinct from general attention architectures or hierarchical modeling approaches.

The taxonomy reveals neighboring research directions that share overlapping concerns but pursue different strategies. Adjacent leaves include 'Attention Regularization and Entropy-Based Methods' (one paper using entropy maximization), 'Channel and Multi-Dimensional Attention' (two papers on cross-channel dependencies), and 'Top-K and Selective Attention Mechanisms' (two papers on instance selection). The sibling papers in the same leaf—Focus your attention, Attention-Challenging MIL, and AEM—all target attention quality improvement but through distinct mechanisms like spatial constraints or error mitigation. ASMIL's anchor-based stabilization approach represents a different technical path within this shared goal of refining attention localization and preventing degradation.

Among twenty-four candidates examined via semantic search, the contribution-level analysis reveals mixed novelty signals. The identification of unstable attention dynamics examined five candidates with zero refutations, suggesting this diagnostic observation may be relatively fresh. However, the anchor model mechanism examined ten candidates and found two refutable overlaps, indicating prior work on attention stabilization exists within the limited search scope. The normalized sigmoid function examined nine candidates with one refutation, pointing to some precedent for addressing over-concentration. These statistics reflect a targeted literature search, not exhaustive coverage, and suggest the technical components have varying degrees of prior exploration.

Based on the limited search of twenty-four semantically similar papers, ASMIL appears to combine known attention refinement strategies in a novel configuration targeting a specific failure mode. The unstable dynamics observation seems less explored, while the stabilization and normalization techniques show partial overlap with existing work. The taxonomy context indicates this sits in an active but not overcrowded research direction, with room for incremental contributions that integrate multiple refinement strategies. A broader literature search beyond top-K semantic matches would be needed to assess whether the specific combination and empirical validation represent a substantive advance.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: whole slide image diagnosis using attention-based multiple instance learning. The field addresses the challenge of classifying gigapixel pathology images by treating each slide as a bag of smaller patches (instances) and learning which patches are diagnostically relevant. The taxonomy reveals a rich landscape organized around ten major branches. Attention Mechanism Design and Enhancement focuses on refining how models weight informative patches, including localization improvements and novel attention formulations. Transformer-Based MIL Architectures such as TransMIL[7] and HiViT[3] leverage self-attention for richer contextual modeling. Instance Selection and Hard Example Mining targets the identification of critical or challenging patches, while Feature Representation and Aggregation explores how to combine patch-level embeddings into slide-level predictions. Training Strategies and Learning Paradigms encompass diverse supervision schemes, from pseudo-labeling approaches like PAMIL[11] to iterative refinement methods. Hierarchical and Multi-Scale MIL methods capture tissue structure at multiple resolutions, and Domain-Specific extensions tailor architectures to particular cancer types or clinical tasks. Generalization and Robustness Enhancement addresses overfitting and domain shift, Interactive and Adaptive frameworks incorporate feedback or dynamic selection, and Comparative Studies provide empirical benchmarks across methods. Several active lines of work highlight ongoing trade-offs between model complexity, interpretability, and generalization. Attention refinement methods like Focus your attention[4] and Attention-Challenging Multiple Instance Learning[6] aim to sharpen localization and reduce noise in attention maps, a concern shared by AEM[17] which emphasizes error mitigation. ASMIL[0] sits within this Attention Refinement and Localization Improvement cluster, addressing similar goals of improving attention quality and diagnostic precision. Compared to neighbors like Focus your attention[4], which may emphasize spatial constraints, or AEM[17], which targets error-aware mechanisms, ASMIL[0] offers its own strategy for refining attention to better isolate relevant tissue regions. Meanwhile, transformer-based approaches and hierarchical methods pursue complementary directions—capturing long-range dependencies or multi-scale context—illustrating the field's exploration of both local precision and global structure. Open questions remain around balancing attention sharpness with robustness, and integrating these refinements into clinically deployable systems.

Claimed Contributions

Identification and analysis of unstable attention dynamics in attention-based MIL

The authors identify a previously overlooked failure mode where attention distributions in attention-based multiple instance learning oscillate across training epochs rather than converging to consistent patterns. They quantify this instability using Jensen-Shannon divergence and demonstrate its negative impact on performance and interpretability.

5 retrieved papers
Anchor model for stabilizing attention distributions

The authors propose an anchor model that mirrors the attention block of the online model but is updated via exponential moving average rather than backpropagation. The online model is encouraged to align with the anchor's attention distribution through KL divergence minimization, providing stable training dynamics.

10 retrieved papers
Can Refute
Normalized sigmoid function to prevent attention over-concentration

The authors introduce a normalized sigmoid function as a replacement for softmax in the anchor model to prevent over-concentrated attention distributions. They provide theoretical analysis showing that NSF achieves selective flattening of attention among informative tokens while suppressing weak ones, which cannot be achieved by softmax with a single temperature parameter.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Identification and analysis of unstable attention dynamics in attention-based MIL

The authors identify a previously overlooked failure mode where attention distributions in attention-based multiple instance learning oscillate across training epochs rather than converging to consistent patterns. They quantify this instability using Jensen-Shannon divergence and demonstrate its negative impact on performance and interpretability.

Contribution

Anchor model for stabilizing attention distributions

The authors propose an anchor model that mirrors the attention block of the online model but is updated via exponential moving average rather than backpropagation. The online model is encouraged to align with the anchor's attention distribution through KL divergence minimization, providing stable training dynamics.

Contribution

Normalized sigmoid function to prevent attention over-concentration

The authors introduce a normalized sigmoid function as a replacement for softmax in the anchor model to prevent over-concentrated attention distributions. They provide theoretical analysis showing that NSF achieves selective flattening of attention among informative tokens while suppressing weak ones, which cannot be achieved by softmax with a single temperature parameter.